A Collection Of Data Science Take Home Challenges Download

Tackling the Take-Home Claiming

An example EDA challenge with Python and Jupyter Notebooks

Tara Boyle

Photo by Elisa Ventur via Unsplash

A popular take-home assignment for data positions involves exploratory data analysis or EDA. Yous are given a dataset or 3 and told to clarify the data.

A visitor may give this type of assignment to gain insight into your thought process. They desire to see how yous tackle a new dataset — and of grade to brand sure you accept the technical skills they crave.

While open up-ended challenges tin be great as they allow you to showcase your strengths and to be artistic, information technology can be hard to know where to even get-go.

Often t i mes they say you can use any engineering science you lot like. It would make sense to utilize a linguistic communication that yous're comfy with and that the company you're interviewing with uses.

Any fourth dimension I'm given a choice here, I employ Python and Jupyter notebooks.

Jupyter notebooks makes it easy for you to evidence your thought process and certificate your procedure in an piece of cake to nowadays format.

An of import point to think is that successfully completing the take-home challenge is usually followed by a discussion of your work if the company decides to proceed. It is of import to exist able to explicate your thought processes and be comfy talking nigh your code in follow-upwards interviews.

Hither we volition work through a sample take-home challenge.

The Challenge

In this challenge, the company gives the states a very open ended task: Explore some data.

While getting a flexible assignment tin can be an awesome way for us to highlight our strengths — and perhaps avoid our weaknesses, it tin besides exist challenging to get started with no articulate goal to achieve.

The datasets we will use here are the widget manufactory datasets we created in this article, where we worked through generating fake information with Python.

For our sample claiming, nosotros take a markdown file with instructions:

Instructions for our Take Habitation Claiming

The markdown file is very helpful to get a feel for what kind of data nosotros'll be working with. It includes data definitions and a very open up ended teaching.

Because in that location are essentially no constraints, we will use Python and Jupyter notebooks.

Step Zero — Set Up for Success

I take a directory on my estimator, coding_interviews that contains every have dwelling house challenge I've completed in my job searches. Within this directory, I have subdirectories with the names of each company for which I've completed an assignment.

I like keeping the old code. Many times I've gone dorsum to i notebook or another, knowing that I've done something like in the by and am able to modify it for the electric current task.

Before getting started let's create a widget_factory directory for our claiming and motion all the files to information technology for ease of access and organization.

Getting Started — Read in the Information and Enquire Basic Questions

The offset step I like to take is to read in the data and enquire easy questions almost each dataset individually:

  • How much information do I accept?
  • Are there missing values?
  • What are the information types?

Let'south explore our datasets:

          # read in worker information
worker_df = pd.read_csv('data/workers.csv')
print(worker_df.shape)
worker_df.caput()

Worker Dataset: Sample

Because nosotros moved all the files to our widget_factory directory we tin can use relative paths to read in the information.

I similar to utilise relative paths in my coding assignments for a few reasons:

  • It makes the code wait clean and not bad.
  • The reviewer won't be able to tell if you lot're on a Mac or PC.
  • I like to call up yous could go bonus points for making information technology elementary for the reviewer to run your code without needing to change the path.

We have the files read in, checked the shape, and printed a sample. Some other steps I like to accept right away are to cheque the datatypes, number of unique values, and check for null values.

          # bank check number of unique values in the dataset
for i in list(worker_df.columns):
print(f'Unique {i}: {worker_df[i].nunique()}')

Worker Dataset: Unique values by column
          # checking for null values
worker_df.isnull().sum()

Worker Dataset: Zero values

We have a relatively clean dataset with no missing values.

          # statistics almost numerical information
### 'Worker ID' is the merely numerical column - this column is an identity column according to the readme
worker_df.describe()

Worker Dataset: Numeric Information Description

Our only numeric column is Worker ID which is an identity column.

          # checking cavalcade types
###### 'Rent Date' column isn't a date - we'll need to fix
worker_df.info()

Worker Dataset: Info
          # convert 'Hire Engagement' to datetime
worker_df['Hire Date'] = pd.to_datetime(worker_df['Hire Date'])
# check that it worked
impress(worker_df.info())
# cheque engagement range of dataset
print(f"Min Engagement: {worker_df['Hire Date'].min()}")
print(f"Max Engagement: {worker_df['Hire Engagement'].max()}")

Worker Dataset: Info afterward converting 'Hire Date' cavalcade to datetime

For the widgets dataset we follow the same steps every bit above. The code for these steps can be found in the full notebook on Github.

Plotting — Visualizing the Data

After answering the easy questions nearly the data, the adjacent step is visualization.

I similar to start with more than bones visualizations. For me this means working with 1 variable at a time and then exploring relationships between the features.

Our worker dataset contains five features. Allow'due south piece of work through each feature individually.

We know Worker ID is an identification column from the readme file. We as well confirmed this when we checked unique values. With a unique value for each row, we tin safely skip the visualization of this column.

While Worker Name only has 4820 unique values, I think it's safety to assume this is too an identity cavalcade. We could fence that it may be interesting to see which workers have the aforementioned proper name, or to check for possible duplicate records, we'll skip further exploration of this feature for at present.

The next characteristic nosotros have is Hire Date. Here nosotros can plot the most common hire dates to explore if workers have hire dates in common.

To accomplish this, starting time we need to count the number of unique dates. While there are many ways to accomplish this, I like using Counter.

From the Python documentation, Counter creates a dictionary "where elements are stored every bit dictionary keys and their counts are stored as dictionary values". This will brand it like shooting fish in a barrel to create a bar chart to display our counts.

Afterward getting our counts, we can so utilise Seaborn to create a barplot.

          # visualize hire engagement
# outset Count all unique dates
from collections import Counter hire_dates = Counter(worker_df['Hire Date'].dt.date) # get dates and date counts
common_dates = [d[0] for d in hire_dates.most_common(15)]
common_counts = [d[ane] for d in hire_dates.most_common(15)]
# https://stackoverflow.com/questions/43214978/seaborn-barplot-displaying-values
# function to show values on confined
def show_values_on_bars(axs):
def _show_on_single_plot(ax):
for p in ax.patches:
_x = p.get_x() + p.get_width() / 2
_y = p.get_y() + p.get_height()
value = '{:.0f}'.format(p.get_height())
ax.text(_x, _y, value, ha="center")
if isinstance(axs, np.ndarray):
for idx, ax in np.ndenumerate(axs):
_show_on_single_plot(ax)
else:
_show_on_single_plot(axs)
# plot most mutual hire dates
fig, ax = plt.subplots()
g = sns.barplot(common_dates, common_counts, palette='colorblind')
g.set_yticklabels([])
# evidence values on the bars to make the chart more than readable and cleaner
show_values_on_bars(ax)
sns.despine(left=True, bottom=True)
plt.xlabel('')
plt.ylabel('')
plt.title('Nearly Common Hire Dates', fontsize=30)
plt.tick_params(centrality='x', which='major', labelsize=fifteen)
fig.autofmt_xdate()
plt.show()

When creating visualizations for any presentation, I like to go along my charts clean past removing unneeded labels and lines.

I also make sure to use a consistent color palette — here we're using Seaborn's colorblind palette.

In a take home claiming, it tin be the small-scale details that brand your assignment stand up out! Make sure your charts have appropriate titles and labels!

Later on each plot I like to use a markdown prison cell to remark on whatsoever key observations that can exist drawn. Fifty-fifty if these observations are simple, creating a brusk meaningful caption can help your EDA look consummate and well thought out.

Worker Dataset: Common Hire Dates Barplot

We can see from the in a higher place barchart, workers are often hired alone and not in cohorts.

We could too explore common hiring months and years to determine if in that location is any pattern in rent dates.

          # visualize rent appointment
# first Count all unique dates
hire_dates = Counter(worker_df['Hire Engagement'].dt.year)
# go dates and date counts
common_dates = [d[0] for d in hire_dates.most_common()]
common_counts = [d[one] for d in hire_dates.most_common()]
# plot twenty near common rent dates
fig, ax = plt.subplots()
g = sns.barplot(common_dates, common_counts, palette='colorblind')
grand.set_yticklabels([])
# show values on the confined to make the nautical chart more than readable and cleaner
show_values_on_bars(ax)
sns.despine(left=True, bottom=Truthful)
plt.xlabel('')
plt.ylabel('')
plt.title('Workers by Year Hired', fontsize=30)
plt.tick_params(axis='10', which='major', labelsize=15)
fig.autofmt_xdate()
plt.show()

Worker Dataset: Workers by twelvemonth hired

Hither we tin can run across workers hired by year. We see that the years with the least number of hires are the starting time and last years of the dataset. Higher up, when nosotros checked the date range of the data, we found the minimum date to be 1991–07–11 and the maximum date 2021–07–08 which explains the lower number of hires.

Moving on to Worker Status.

          #visualize condition feature
fig, ax = plt.subplots()
1000 = sns.countplot(ten=worker_df['Worker Status'], order = worker_df['Worker Condition'].value_counts().alphabetize, palette='colorblind')
g.set_yticklabels([])
# testify values on the bars to brand the chart more readable and cleaner
show_values_on_bars(ax)
plt.title('Workers past Worker Status')
sns.despine(left=True, lesser=Truthful)
plt.xlabel('')
plt.ylabel('')
plt.tick_params(axis='x', which='major', labelsize=15)
plt.show()

Worker Dataset: Workers by Status

Here we can see that a majority of workers are total time. We accept less workers that are categorized as role time and per diem.

And finally, permit's explore the squad feature.

          #visualize team feature
sns.countplot(y=worker_df['Team'], order = worker_df['Team'].value_counts().index, palette='colorblind')
sns.despine(left=Truthful, bottom=True)
plt.title('Workers past Team')
plt.prove()

Worker Dataset: Workers past Team

It tin can exist helpful to offset asking questions well-nigh the data. Does one characteristic relate to another that we have?

Here we see that the teams are of similar sizes. Midnight Bluish has the virtually members and Crimson the least. It could be interesting to explore how the teams are assigned. Could information technology exist based on job title, location, or worker status?

Now we can starting time working with groups of features. Let's visualize squad by worker status.

          #visualize squad by worker status
fig, ax = plt.subplots()
one thousand = sns.countplot(x=worker_df['Squad'], hue=worker_df['Worker Condition'], palette='colorblind')
m.set_yticklabels([])
show_values_on_bars(ax) # position the fable and so that it doesn't encompass any bard
leg = plt.legend( loc = 'upper right')
plt.describe()

# Go the bounding box of the original legend
bb = leg.get_bbox_to_anchor().inverse_transformed(ax.transAxes)
# Change location of the legend.
xOffset = 0.1
bb.x0 += xOffset
bb.x1 += xOffset
leg.set_bbox_to_anchor(bb, transform = ax.transAxes)

sns.despine(left=Truthful, lesser=True)
plt.title('Workers by Squad and Worker Status')
plt.ylabel('')
plt.xlabel('')
plt.show()

Worker Dataset: Workers past Team and Worker Condition

Here we tin can run into at that place is a relatively equal distribution of workers to teams by worker condition. This suggests that workers are not assigned to teams by their status.

This completes our worker dataset! Let'south motility on to the widget dataset.

We can create histograms of our numerical data:

          # create histograms of all numerical data
# nosotros know worker id is an identity column
# so removing it from this visualization
widget_df_hist = widget_df[['Pace 1', 'Step 2', 'Step 3']]
widget_df_hist.hist()
sns.despine(left=Truthful, bottom=True)
plt.show()

Widget Dataset: Histograms of Steps ane, 2, and 3

Hither nosotros encounter histograms of each footstep in the widget making process.

For steps 1 and 3, information technology looks similar a majority of workers complete the steps chop-chop and in that location are long tails where the task takes much longer to consummate. The long tails could be due to errors in recording the data, or due to workers having trouble completing the steps. This would be interesting to explore further.

Pace two appears to have a normal distribution. Could step ii be an easier to consummate or more automated footstep in the widget making process?

We've successfully explored all our features in both datasets! While nosotros could stop here, this is not the about useful or interesting assay.

The next step we will accept is to merge our datasets to explore the relationships between features.

Going Farther — Combining Datasets

Our next step is to combine our worker and widget datasets together. This demonstrates our ability to merge datasets — a crucial skill for any data task.

Nosotros will merge the datasets on Worker ID as that is the common characteristic between the data:

          # merge dataframes together
merged_df = pd.merge(worker_df,
widget_df,
how='inner',
on='Worker ID')
print(merged_df.shape)
merged_df.caput()

Merged Datasets

With the information successfully merged we tin can continue plotting. Let's plot item count by squad:

          #visualize item count by team
fig, ax = plt.subplots()
g = sns.countplot(ten=merged_df['Team'],
gild = merged_df['Team'].value_counts().index,
palette='colorblind')
g.set_yticklabels([])
# testify values on the bars to brand the chart more readable and cleaner
show_values_on_bars(ax)
plt.title('Item Count by Team')
sns.despine(left=True, bottom=True)
plt.xlabel('')
plt.ylabel('')
plt.tick_params(axis='x', which='major', labelsize=fifteen)
plt.prove()

Merged Dataset: Particular Count past Team

The MidnightBlue team created the well-nigh items, while Crimson created the least. We can infer this is related to the number of workers assigned to each team. Above we found MidnightBlue is the largest team and Crimson the smallest.

Nosotros tin also explore item count past worker status:

          #visualize item count by worker
fig, ax = plt.subplots()
g = sns.countplot(x=merged_df['Worker Status'],
order = merged_df['Worker Status'].value_counts().alphabetize,
palette='colorblind')
g.set_yticklabels([])
# show values on the confined to make the chart more readable and cleaner
show_values_on_bars(ax)
plt.title('Detail Count past Worker Status')
sns.despine(left=True, bottom=True)
plt.xlabel('')
plt.ylabel('')
plt.tick_params(centrality='x', which='major', labelsize=15)
plt.testify()

Merged Dataset: Item Count by Worker Condition

Here we see detail count past worker condition. As expected, full time workers created the nearly items.

We can also explore item counts by individual workers. This can evidence the most and to the lowest degree productive workers. Allow's look at the workers with the lowest item counts:

          #visualize workers with lowest item counts
# first create temporary df
tmp = grouped_df.sort_values(by='Detail Number Count',
ascending=True).head(20)
fig, ax = plt.subplots()
1000 = sns.barplot(y=tmp['Worker Name'],
x=tmp['Detail Number Count'],
palette='colorblind')
plt.championship('Workers with Everyman Item Count')
sns.despine(left=True, lesser=True)
plt.xlabel('')
plt.ylabel('')
plt.tick_params(axis='x', which='major', labelsize=15)
plt.show()

Merged Dataset: Workers with Lowest Item Count

Here we see the 20 workers with the lowest item count. It would be interesting to explore the worker condition of these workers. Are they office time or per diem? If we had information relating to time off or hours worked information technology would be interesting to explore if in that location is any correlation.

For completeness we tin can cheque the worker status of the plotted workers by printing out the full tmp dataframe nosotros created for plotting and visually checking.

Another, more than succinct, choice is to use value_counts() to go the count of unique values for the Worker Condition column.

          # check the values for worker condition for the workers plotted higher up
tmp['Worker Status'].value_counts()
>>> Per Diem 20
>>> Name: Worker Status, dtype: int64

We can come across the workers creating the to the lowest degree number of items are all per diem. A logical assumption here is that these workers may have worked the least number of shifts or hours as per diem workers are ofttimes used on an as needed basis.

Allow's now exclude per diem and explore the distributions of each step in the widget making process for full and part time workers:

          # create temp df with only part and full time
tmp = merged_df.loc[merged_df['Worker Status'].isin(
['Total Time','Part Fourth dimension'])]
# list of steps to loop over
steps = ['Step 1', 'Stride 2', 'Step three']
# create a plot for each step
for step in steps:
fig, ax = plt.subplots()
g = sns.violinplot(ten='Team',
y=pace,
hue='Worker Condition',
split=True, data=tmp)
sns.despine(left=True, lesser=True)
plt.xlabel('')
plt.ylabel('Fourth dimension')
plt.title(f'{step} by Worker Status', fontsize=20) plt.show()

Merged Dataset: Step 1 Violin Plot

For step i, nosotros see a like distribution for both total and part fourth dimension workers. All distributions have long tails. This could be due to actual dull step completion times or possible data collection errors. This could exist interesting to explore further.

Merged Dataset: Step two Violin Plot

The distributions for stride two appear normal. Both total and office time workers for all teams times for step two resemble a bell curve.

Merged Dataset: Step 3Violin Plot

In pace three, nosotros tin run into very long tails across all groups. Information technology appears that this step is by and large completed apace with few outliers taking longer.

Across all steps, we don't see whatsoever major differences between groups in the violin plots. This suggests that full time and part time workers across all teams are mostly consistent in the time it takes to make widgets.

A limitation nosotros can notice hither is that we have no time units for the time steps. In our instruction file, it was merely noted that the values in these columns are times, just did not give any units.

While nosotros by no means explored all possible information visualizations we take completed a thorough exploratory analysis and will brainstorm wrapping up this challenge here.

Wrapping Up — Conclusions, Limitations, and Farther Exploration

To terminate whatever take dwelling house assignment adding a brusque decision section is a great thought. I similar to touch on 3 areas here:

  • Conclusions
  • Limitations
  • Further Exploration

Are there any conclusions we can depict from the data? What limitations do we observe about the data? What would be interesting to explore further if we had more time or data?

This section doesn't need to be long, a few sentences on each topic should exist plenty.

Conclusions

Nosotros found that the MidnightBlue team has the about workers and besides created the greatest number of widgets. Crimson, with the least number of members, created the lowest number of widgets.

For the timeframe of this dataset it appears that number of widgets created is correlated with squad members.

We likewise plant that workers who created the least number of widgets are all per diem status. This suggests per diem workers could work less hours than function time and total time workers.

Full time and part time workers across all teams appear to be similar in their widget cosmos times.

Limitations

The information given here does non include the time frame of the information collection flow. Nosotros are merely able to analyze the data as a single snapshot in fourth dimension, and are not able to explore how widget creation may vary over time.

As we noted above, we practice not know the time unit for the data in the widget tabular array.

Further Exploration

It would exist interesting to explore widget creation over time. Do teams with the most workers ever create the nigh widgets?

We could also farther explore the timings of the widget creation steps. It would exist interesting to run across if this changes over fourth dimension or to explore any potential outliers.

One final thing I like to practice here is to restart my notebook and run all cells from elevation to bottom. This shows attending to detail, and perhaps more than importantly ensures that the notebook will run in social club error free.

If the reviewer tries to run our notebook, we desire to be confident that it volition run in society with no errors!

Summary

Here we worked through a sample interview accept home challenge from first to finish.

We began by reading in the data and asking simple questions.

  • Practise nosotros have outliers?
  • Practice nosotros have missing values?
  • What type of data do nosotros take?
  • What does the information look like?

Nosotros then moved on to plotting, first visualizing single features, and then moving on to exploring the relationships amongst the features.

By manipulating the data, merging it, and creating visualizations, we were able to showcase our Python and data exploration skills.

The full notebook tin exist found on GitHub.

Best of luck on your next take home challenge!

DOWNLOAD HERE

Posted by: nixtindiand.blogspot.com