5  Cleaning Data

5.1 Cleaning and data science

Imagine you have finally gotten that dataset that you need to work with to answer those questions you’ve been asked to clear up. But wait! Before you get started in earnest working away with generating insights, it is important that you take a closer look at the “quality” of the data.

There is much to look at and consider when it comes to the topic of data cleaning. You probably have heard it said that 80% (or some high percentage) of a data scientist’s time is spent cleaning data, that is, putting it in a form that will better suit its downstream uses.

Not surprisingly, data cleaning is a topic difficult to cover systematically in an introductory course.

It’s not so bad, promise!

So in this tutorial we will focus on a few cleaning techniques - with pandas - that are likely going to be leveraged over and over again when you work with data in python.

  • How to identify and clean up missing data
  • When and how to change datatypes
  • When and how to modify values in your dataset

5.2 Dealing with Missing Data

One of the most common issues in data cleaning has to do with data that is missing from your dataset. Most data, even data that is meticulously collected and managed, is likely to have empty cells. Empty data can be harmless, or it can represent a challenge to you, depending on what you need your data to do once you have finished working with it.

The best way to learn about these concepts is to see them in action with real (or real-ish) data! Before we get started with fetching the data, let’s go ahead and get pandas for today’s session.

# import pandas
import pandas as pd 

Now, imagine we had our own online streaming service. We have access to some data that will help us predict what our users would like to see. To fetch it, run the code below:

# import data
mr = "https://raw.githubusercontent.com/bcgov/" \
        "ds-intro-to-python/main/data/movieratings.csv"
movie_ratings = pd.read_csv(mr)
movie_ratings.head()
Rater Star Wars Finding Nemo Forrest Gump Parasite Citizen Kane
0 Floriana NaN 5.0 5.0 3.0 NaN
1 Raymundo 4.0 NaN NaN NaN 5.0
2 Jung 5.0 NaN NaN 5.0 NaN
3 Kumar 5.0 NaN 4.0 NaN 4.0
4 Maria 5.0 4.0 5.0 NaN NaN

This is a 10 row dataframe, with each row containing a movie rater as well as their ratings for selected movies that they have given ratings for. For interpretative purposes, each movie is rated on a 5 point scale, with 5 being super awesome and 1 being dismal. Not every movie is rated by every rater, so several cells are blank. These are the missing data cells and are marked with “NaN”, indicating missing data.

This dataset, while small and made up, exemplifies a classic “recommendation engine” use case problem - there are lots of raters and lots of movies, but an awful lot of missing data. So what one chooses to do with missing data is a key cleaning question.

Locating Missing Data

In pandas, there is a widely used method named isna() which can be applied to a whole dataframe or just to a particular series or column.

Below we apply it to the entire dataframe to get a cell by cell view of whether the cell data is missing or not (True means it is missing, False means it is not):

#find the missing values
movie_ratings.isna()
Rater Star Wars Finding Nemo Forrest Gump Parasite Citizen Kane
0 False True False False False True
1 False False True True True False
2 False False True True False True
3 False False True False True False
4 False False False False True True
5 False False False False False False
6 False True True False False False
7 False False False False False False
8 False False False False False False
9 False False True True True True

It is probably helpful to note there is a similar function called isnull() that is an alias for isna(). You may run into both, and both will return the same results. There is also a notna()(notnull()) function, which is the inverse function of isna()(isnull()), returning True for values that are not NA. For simplicity, let’s just stick with isna() in this tutorial.

To get a better summary overview (i.e. one that summarizes with numbers) of how many missing values there are per column, then we need to chain togther a .sum() function to the line of code like so:

###listing out the missing values in the dataframe for each series
movie_ratings.isna().sum()
Rater           0
Star Wars       2
Finding Nemo    5
Forrest Gump    3
Parasite        4
Citizen Kane    4
dtype: int64

The reason that this works is because a boolean value of True is represented in pandas by the value of 1, while False is represented by a 0. Therefore, we can apply the sum()method to the dataframe with series being returned that contains the counts of missing items in each column.

Now we see no missing data for Rater, but missing data for each of the movies. For our use case, missing data is a real challenge. With pandas, we can deal with this in many different ways.

Dropping rows and columns

One cleaning option we have is to drop entire rows or columns with missing values in them from the dataframe. The method is called dropna()and takes a number of parameters:

  • axis: {0 or ‘index’, 1 or ‘columns’}, default 0
  • how: {‘any’, ‘all’}, default ‘any’
  • subset: column label or sequence of labels

Imagine we would like to drop all of the rows with any missing values in them. Since the axis=0 and how=any parameters are the defaults, we can write an elegant and simple line of code:

reduced_rows = movie_ratings.dropna()
reduced_rows.head(11)
Rater Star Wars Finding Nemo Forrest Gump Parasite Citizen Kane
5 Arthur 2.0 2.0 3.0 3.0 3.0
7 Martina 5.0 5.0 5.0 5.0 5.0
8 Orson 1.0 1.0 1.0 2.0 5.0

We still have three raters who have given us ratings for each movie. The dataset is nice and clean now, although it has come at a pretty big cost, losing seven rows of data! If we go in the other direction, and insert a parameter of how='all', then we have the opposite problem, with no rows being eliminated (as there were no instances where all of the columns contained missing data for a given row)!

Another option is to use subset to eliminate rows on the basis of whether values are missing in a specified subset. Let’s say we want to have just the rows of data where there is not missing ratings data for the film “Parasite”. We would use subset= parameter and identify the column we want to clean up:

par_df = movie_ratings.dropna(subset='Parasite')
par_df.head(11)
Rater Star Wars Finding Nemo Forrest Gump Parasite Citizen Kane
0 Floriana NaN 5.0 5.0 3.0 NaN
2 Jung 5.0 NaN NaN 5.0 NaN
5 Arthur 2.0 2.0 3.0 3.0 3.0
6 Marcellus NaN NaN 4.0 5.0 4.0
7 Martina 5.0 5.0 5.0 5.0 5.0
8 Orson 1.0 1.0 1.0 2.0 5.0

We can see that, while there is no NaN in the Parasite column, we can see they are still there sprinkled throughout other columns.

Now what would happen if we applied that same drop logic to all columns?

reduced_cols = movie_ratings.dropna(axis=1)
reduced_cols.head()
Rater
0 Floriana
1 Raymundo
2 Jung
3 Kumar
4 Maria

We see that we pretty much all the movie columns have been removed, as each of these columns had at least one missing value! So be careful when using dropna() as it can be a powerful eraser of data. Whether you should use it or how you should will depend on the downstream uses for your data.


Saving the dataframe “inplace”

Inplace is an argument used in different functions such as dropna() that we have just looked at. The default value of this attribute is False, which means whatever the function does to the dataframe, it does so as a copy of the object. When the argument is inplace=True, however, whatever changes are made by the given function are made to the original object inplace, not just to a copy of it!

Let’s look at how this works starting with fetching the data from the source:

# Fetching the data to make sure we start again with the original
mr = "https://raw.githubusercontent.com/bcgov/" \
        "ds-intro-to-python/main/data/movieratings.csv"
movie_ratings = pd.read_csv(mr)
movie_ratings.head()
Rater Star Wars Finding Nemo Forrest Gump Parasite Citizen Kane
0 Floriana NaN 5.0 5.0 3.0 NaN
1 Raymundo 4.0 NaN NaN NaN 5.0
2 Jung 5.0 NaN NaN 5.0 NaN
3 Kumar 5.0 NaN 4.0 NaN 4.0
4 Maria 5.0 4.0 5.0 NaN NaN

Now let’s “modify” the frame with inplace=False with the code below and see what happens:

# Making change to dataframe with inplace=False
movie_ratings.dropna(how='any', inplace=False)

# Looking at the dataframe
movie_ratings.head()
Rater Star Wars Finding Nemo Forrest Gump Parasite Citizen Kane
0 Floriana NaN 5.0 5.0 3.0 NaN
1 Raymundo 4.0 NaN NaN NaN 5.0
2 Jung 5.0 NaN NaN 5.0 NaN
3 Kumar 5.0 NaN 4.0 NaN 4.0
4 Maria 5.0 4.0 5.0 NaN NaN

Ok, it kind of looks like “nothing happened”! And in a way, with the code we ran, nothing did happen, as the result was not assigned to a new object. Soooo, what’s the point of that? Any guesses?

When we run exactly the same code except with the parameter inplace set to 'True', the dropped data is actually gone!

# Making change to dataframe with inplace=False
movie_ratings.dropna(how='any', inplace=True)

# Looking at the dataframe
movie_ratings.head()
Rater Star Wars Finding Nemo Forrest Gump Parasite Citizen Kane
5 Arthur 2.0 2.0 3.0 3.0 3.0
7 Martina 5.0 5.0 5.0 5.0 5.0
8 Orson 1.0 1.0 1.0 2.0 5.0

So now if we want to work with the “movie_ratings” object without the columns we just dropped, we would have to go back and rerun the read_csv() function and fetch the original data again! It is therefore important when writing your cleaning code that you use this parameter carefully.

Filling in missing data

Sometimes it is the right cleaning decision to fill in missing data with some values.

Take the example of our movie ratings dataframe. The basic way to fill in missing values would you to specify a value to take the place of each missing value. One approach is to calculate and plug in the mean value - which we know to be (3.85) for the entire dataframe:

# Fetching the data, yet again, to make sure we start again with the original
mr = "https://raw.githubusercontent.com/bcgov/" \
        "ds-intro-to-python/main/data/movieratings.csv"
movie_ratings = pd.read_csv(mr)

# using fillna and a passing in a constant
filled_385 = movie_ratings.fillna(3.85)
filled_385.head(10)
Rater Star Wars Finding Nemo Forrest Gump Parasite Citizen Kane
0 Floriana 3.85 5.00 5.00 3.00 3.85
1 Raymundo 4.00 3.85 3.85 3.85 5.00
2 Jung 5.00 3.85 3.85 5.00 3.85
3 Kumar 5.00 3.85 4.00 3.85 4.00
4 Maria 5.00 4.00 5.00 3.85 3.85
5 Arthur 2.00 2.00 3.00 3.00 3.00
6 Marcellus 3.85 3.85 4.00 5.00 4.00
7 Martina 5.00 5.00 5.00 5.00 5.00
8 Orson 1.00 1.00 1.00 2.00 5.00
9 Luke 5.00 3.85 3.85 3.85 3.85

Ok, that’s good. Good, but not great for our use case! Maybe we could avail ourselves to some options within the method= parameter with the parentheses the fillna() function. With a specification of bfill we can fill a NaN value with the preceding valid value in that column. ffill fills with the value after it.

What would that look like for our dataset?

bfilled_mr = movie_ratings.fillna(method='bfill')
bfilled_mr.head(10)
Rater Star Wars Finding Nemo Forrest Gump Parasite Citizen Kane
0 Floriana 4.0 5.0 5.0 3.0 5.0
1 Raymundo 4.0 4.0 4.0 5.0 5.0
2 Jung 5.0 4.0 4.0 5.0 4.0
3 Kumar 5.0 4.0 4.0 3.0 4.0
4 Maria 5.0 4.0 5.0 3.0 3.0
5 Arthur 2.0 2.0 3.0 3.0 3.0
6 Marcellus 5.0 5.0 4.0 5.0 4.0
7 Martina 5.0 5.0 5.0 5.0 5.0
8 Orson 1.0 1.0 1.0 2.0 5.0
9 Luke 5.0 NaN NaN NaN NaN

Notice that the bottom row stays NaN when we use bfill. That’s because it can not find a value to backfill with for this row! Using forward fill creates the same issue for the first row. For our use case, admittedly bfill and ffill probably are sub-optimal solutions.

Since we have some ratings data already for each movie, one approach would be to fill in the missing data with the mean rating for that movie from the valid data.

Let’s try that approach on a single column. For the movie Parasite, we fill in the missing values for that column with the average value:

movie_ratings['Parasite'] = movie_ratings['Parasite']. \
  fillna(movie_ratings['Parasite'].mean())
movie_ratings.head(10)
Rater Star Wars Finding Nemo Forrest Gump Parasite Citizen Kane
0 Floriana NaN 5.0 5.0 3.000000 NaN
1 Raymundo 4.0 NaN NaN 3.833333 5.0
2 Jung 5.0 NaN NaN 5.000000 NaN
3 Kumar 5.0 NaN 4.0 3.833333 4.0
4 Maria 5.0 4.0 5.0 3.833333 NaN
5 Arthur 2.0 2.0 3.0 3.000000 3.0
6 Marcellus NaN NaN 4.0 5.000000 4.0
7 Martina 5.0 5.0 5.0 5.000000 5.0
8 Orson 1.0 1.0 1.0 2.000000 5.0
9 Luke 5.0 NaN NaN 3.833333 NaN

And we can extend that for all numeric columns at once!

movie_ratings = movie_ratings.fillna(movie_ratings.mean(numeric_only=True))
movie_ratings.head(10)
Rater Star Wars Finding Nemo Forrest Gump Parasite Citizen Kane
0 Floriana 4.0 5.0 5.000000 3.000000 4.333333
1 Raymundo 4.0 3.4 3.857143 3.833333 5.000000
2 Jung 5.0 3.4 3.857143 5.000000 4.333333
3 Kumar 5.0 3.4 4.000000 3.833333 4.000000
4 Maria 5.0 4.0 5.000000 3.833333 4.333333
5 Arthur 2.0 2.0 3.000000 3.000000 3.000000
6 Marcellus 4.0 3.4 4.000000 5.000000 4.000000
7 Martina 5.0 5.0 5.000000 5.000000 5.000000
8 Orson 1.0 1.0 1.000000 2.000000 5.000000
9 Luke 5.0 3.4 3.857143 3.833333 4.333333

While far from perfect, now we have some filled in data that has some reasonable chance of adding value.

Similar to fillna() is a method called interpolate(), which takes the average of the values near the missing values and plugs them in. To do this, we pass in the interpolate() in similar fashion to how we used fillna() above.

Challenge 1

You are given a dataset for several days time frame in April/May 2021 of the average daily temperature in celcius for three Canadian cities. You would like to ultimately create a plot of the temperatures but don’t want to have gaps in the data. Run the code below to get the data and see what it looks like.

import pandas as pd

temps = "https://raw.githubusercontent.com/bcgov/" \
        "ds-intro-to-python/main/data/citytemps.csv"
city_temps = pd.read_csv(temps)
city_temps.head(10)
Day Victoria Regina Sudbury
0 25-Feb-21 15.0 NaN 7.0
1 26-Feb-21 18.0 -11.0 NaN
2 27-Feb-21 NaN -6.0 NaN
3 28-Feb-21 12.0 -2.0 4.0
4 01-Mar-21 NaN NaN 7.0
5 02-Mar-21 NaN NaN NaN
6 03-Mar-21 NaN NaN 14.0
7 04-Mar-21 15.0 7.0 11.0
8 05-Mar-21 13.0 NaN 9.0
9 06-Mar-21 11.0 4.0 14.0

Unfortunately you notice that several of the days have data missing. Write code that leverages the interpolate() function to fill in the missing data. Run the code. Is this an appropriate solution for the problem? What are the pros and cons to this approach.

Code you would write is below.

interpolate_temps = city_temps.interpolate()
interpolate_temps.head(10)
Day Victoria Regina Sudbury
0 25-Feb-21 15.00 NaN 7.0
1 26-Feb-21 18.00 -11.00 6.0
2 27-Feb-21 15.00 -6.00 5.0
3 28-Feb-21 12.00 -2.00 4.0
4 01-Mar-21 12.75 0.25 7.0
5 02-Mar-21 13.50 2.50 10.5
6 03-Mar-21 14.25 4.75 14.0
7 04-Mar-21 15.00 7.00 11.0
8 05-Mar-21 13.00 5.50 9.0
9 06-Mar-21 11.00 4.00 14.0

Here we can see how the values “smooth” out the data. Were we to go ahead and plot the data, it would make more sense.

Dealing with missing data is one normal task in cleaning, changing data and header content is another, let’s take a look at that next.

5.3 Modifying existing data

Now let’s grad a different dataset, this one is quite redacted snip of the 2014 Mental Health in Tech Survey https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey.

techhealth = "https://raw.githubusercontent.com/bcgov/" \
        "ds-intro-to-python/main/data/techhealth.csv"
m_health = pd.read_csv(techhealth)
m_health.head()
Timestamp Age Gender Country self employed family history treatment work interfere remote work tech company benefits leave mental health consequence
0 27/08/2014 11:35 46 Male United States No No Yes Often Yes Yes Yes Don't know Maybe
1 27/08/2014 11:36 41 Male United States No No Yes Never No No Don't know Don't know Maybe
2 27/08/2014 11:36 33 male United States No Yes Yes Rarely No Yes Yes Don't know No
3 27/08/2014 11:37 35 male United States No Yes Yes Sometimes No No Yes Very easy Yes
4 27/08/2014 11:42 35 M United States No No Yes Rarely Yes Yes Yes Very easy No


This abridged version of the dataset contains 24 cases and 13 variables:

  • Timestamp
  • Age
  • Gender
  • Country
  • Self Employed: Are you self-employed?
  • Family History: Do you have a family history of mental illness?
  • Treatment: Have you sought treatment for a mental health condition?
  • Work Interfere: If you have a mental health condition, do you feel that it interferes with your work?
  • Remote Work: Do you work remotely (outside of an office) at least 50% of the time?
  • Tech Company: Is your employer primarily a tech company/organization?
  • Benefits: Does your employer provide mental health benefits?
  • Leave: How easy is it for you to take medical leave for a mental health condition?
  • Mental Health Consequence: Do you think that discussing a mental health issue with your employer would have negative consequences?

Cleaning the header row

A good practice, both in terms of ensuring a consistent convention for column names and for the ability to write more efficient code, is to ensure that all column header names do not have spaces in them and are all in lower case. To do this with our current dataset, we use the str.replace() function on all the header row, identified in pandas via .columns:

m_health.columns = m_health.columns.str.replace(' ', '_')
m_health.columns = m_health.columns.str.lower()
m_health.head(10)
m_health.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   timestamp                  24 non-null     object
 1   age                        24 non-null     int64 
 2   gender                     24 non-null     object
 3   country                    24 non-null     object
 4   self_employed              24 non-null     object
 5   family_history             24 non-null     object
 6   treatment                  24 non-null     object
 7   work_interfere             24 non-null     object
 8   remote_work                24 non-null     object
 9   tech_company               24 non-null     object
 10  benefits                   24 non-null     object
 11  leave                      24 non-null     object
 12  mental_health_consequence  24 non-null     object
dtypes: int64(1), object(12)
memory usage: 2.6+ KB

The other thing to notice that there are no missing values! So we can just concentrate on other cleaning tasks.

Ensuring the right datatypes

A good practice is to notice whether there are series in the dataframe that are improperly datatyped (that is, where the “Dtype” indicated does not correspond to the datatype that it should be).

Taking a look at the data, we notice that “Timestamp” is datatyped as an “object”, which in this context means a “string”. However, that is sub-optimal because the content is time-oriented, and by coding it as an object/string, we limit a lot of its value. If we identify datatype as a time oriented one, that can give us advantages downstream in being able to work with the data more elegantly.

Pandas has functions specifically designed to cast datatypes from one to another, as was covered in the previous section with the astype() function.

To cast a variable as a datetime datatype, we can use the pandas to_datetime() function and pass in the pandas series you want to modify:

m_health['timestamp'] = pd.to_datetime(m_health['timestamp'])
m_health.head()
m_health.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   timestamp                  24 non-null     datetime64[ns]
 1   age                        24 non-null     int64         
 2   gender                     24 non-null     object        
 3   country                    24 non-null     object        
 4   self_employed              24 non-null     object        
 5   family_history             24 non-null     object        
 6   treatment                  24 non-null     object        
 7   work_interfere             24 non-null     object        
 8   remote_work                24 non-null     object        
 9   tech_company               24 non-null     object        
 10  benefits                   24 non-null     object        
 11  leave                      24 non-null     object        
 12  mental_health_consequence  24 non-null     object        
dtypes: datetime64[ns](1), int64(1), object(11)
memory usage: 2.6+ KB

Et voila - timestamp is now of the datatime64 datatype! Were we wanting to leverage timestamp’s date time qualities, we could now do so more easily. It should be added, that this example works well because the existing data is nicely formatted to be understood as data/time datatype data. In the real world, if the timestamp column has different date types, or isn’t actually a date column at all, the method raises exceptions that you will have to work through. For example, look at what happens if we try to force a different variable into a data/time data type:

m_health['tech_company'] = pd.to_datetime(m_health['tech_company'])
m_health.head()
m_health.info()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2236, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2235 try:
-> 2236     values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
   2237     # If tzaware, these values represent unix timestamps, so we
   2238     #  return them as i8 to distinguish from wall times

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/_libs/tslibs/conversion.pyx:360, in pandas._libs.tslibs.conversion.datetime_to_datetime64()

TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

ParserError                               Traceback (most recent call last)
Cell In[20], line 1
----> 1 m_health['tech_company'] = pd.to_datetime(m_health['tech_company'])
      2 m_health.head()
      3 m_health.info()

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:1051, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1049         result = arg.map(cache_array)
   1050     else:
-> 1051         values = convert_listlike(arg._values, format)
   1052         result = arg._constructor(values, index=arg.index, name=arg.name)
   1053 elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:402, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    400 assert format is None or infer_datetime_format
    401 utc = tz == "utc"
--> 402 result, tz_parsed = objects_to_datetime64ns(
    403     arg,
    404     dayfirst=dayfirst,
    405     yearfirst=yearfirst,
    406     utc=utc,
    407     errors=errors,
    408     require_iso8601=require_iso8601,
    409     allow_object=True,
    410 )
    412 if tz_parsed is not None:
    413     # We can take a shortcut since the datetime64 numpy array
    414     # is in UTC
    415     dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2242, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2240         return values.view("i8"), tz_parsed
   2241     except (ValueError, TypeError):
-> 2242         raise err
   2244 if tz_parsed is not None:
   2245     # We can take a shortcut since the datetime64 numpy array
   2246     #  is in UTC
   2247     # Return i8 values to denote unix timestamps
   2248     return result.view("i8"), tz_parsed

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2224, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2222 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
   2223 try:
-> 2224     result, tz_parsed = tslib.array_to_datetime(
   2225         data.ravel("K"),
   2226         errors=errors,
   2227         utc=utc,
   2228         dayfirst=dayfirst,
   2229         yearfirst=yearfirst,
   2230         require_iso8601=require_iso8601,
   2231         allow_mixed=allow_mixed,
   2232     )
   2233     result = result.reshape(data.shape, order=order)
   2234 except ValueError as err:

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/_libs/tslib.pyx:381, in pandas._libs.tslib.array_to_datetime()

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/_libs/tslib.pyx:613, in pandas._libs.tslib.array_to_datetime()

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/_libs/tslib.pyx:751, in pandas._libs.tslib._array_to_datetime_object()

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/_libs/tslib.pyx:742, in pandas._libs.tslib._array_to_datetime_object()

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/_libs/tslibs/parsing.pyx:281, in pandas._libs.tslibs.parsing.parse_datetime_string()

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/dateutil/parser/_parser.py:1368, in parse(timestr, parserinfo, **kwargs)
   1366     return parser(parserinfo).parse(timestr, **kwargs)
   1367 else:
-> 1368     return DEFAULTPARSER.parse(timestr, **kwargs)

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/dateutil/parser/_parser.py:643, in parser.parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    640 res, skipped_tokens = self._parse(timestr, **kwargs)
    642 if res is None:
--> 643     raise ParserError("Unknown string format: %s", timestr)
    645 if len(res) == 0:
    646     raise ParserError("String does not contain a date: %s", timestr)

ParserError: Unknown string format: Yes

The error let’s us know that this data typing conversion will not work as coded.

Recoding values

Value recoding is a data cleaning task to put the column values into more manageable categories. This is a common data recoding task that most data analyst type folks will be familiar with. Pandas alone has many functions that solve recoding issues in different ways. We will just highlight a few in this tutorial.

With this dataset we’ve called there are a few columns that we would like to make changes to so that it will be easier to analyze the data in the way we want. The value_counts() method shows the range of responses and their frequency in the dataframe.

m_health['gender'].value_counts()
Male        12
male         6
M            2
Female       1
m            1
F            1
Cis Male     1
Name: gender, dtype: int64

With that knowledge, we can utilize the replace() function and pass in custom lists with all of the responses that should be replaced by a single specfied value.

m_health['gender'].replace(['Male ', 'male', 'M', 'm', 'Male', 'Cis Male'],
                            'Male', inplace=True)
m_health['gender'].replace(['Female ', 'female', 'F'],
                            'Female', inplace=True)
m_health['gender'].value_counts()
Male      22
Female     2
Name: gender, dtype: int64

Another task is to put categorical objects into a form that is more amenable to computational tasks. Let’s take a look at a categorial variable that consists of responses of “Yes” and “No”.

m_health['self_employed'].value_counts()
No     23
Yes     1
Name: self_employed, dtype: int64

If we want to leverage this data purely as a segment to split other metrics with, we would probably leave as is. But perhaps we would like to treat it as a dummy variable so we can do more sophisticated math with it.

For our self-employed variable, we begin by implementing replace() to transform the existing content to numbers as shown below:

m_health['self_employed'].replace(['Yes'], '1', inplace=True)
m_health['self_employed'].replace(['No'], '0', inplace=True)
m_health['self_employed'].value_counts()
0    23
1     1
Name: self_employed, dtype: int64

The pandas astype() function is able to interpret the 1s and 0s as integers and so we can transform the series into an int datatype.

m_health['self_employed'] = m_health['self_employed'] \
        .astype(dtype='int')
m_health.head()
timestamp age gender country self_employed family_history treatment work_interfere remote_work tech_company benefits leave mental_health_consequence
0 2014-08-27 11:35:00 46 Male United States 0 No Yes Often Yes Yes Yes Don't know Maybe
1 2014-08-27 11:36:00 41 Male United States 0 No Yes Never No No Don't know Don't know Maybe
2 2014-08-27 11:36:00 33 Male United States 0 Yes Yes Rarely No Yes Yes Don't know No
3 2014-08-27 11:37:00 35 Male United States 0 Yes Yes Sometimes No No Yes Very easy Yes
4 2014-08-27 11:42:00 35 Male United States 0 No Yes Rarely Yes Yes Yes Very easy No

Now we see that variable as a series of int-typed 0s and 1s.

Challenge 2

You would like to clean up the scale used in the question about whether one’s mental illness affects one’s work.

  1. Find the column (pandas series) that needs to be transformed and transform its values using replace() with numerical values that make sense.

  2. Display the modified dataframe showing the replaced values.

First, find the distribution of cases per value label:

m_health['work_interfere'].value_counts()
Sometimes    12
Never         5
Rarely        4
Often         3
Name: work_interfere, dtype: int64

Then show how to use replace to transform the scale into a numeric one:

m_health['work_interfere'].replace(['Never'], '0', inplace=True)
m_health['work_interfere'].replace(['Rarely'], '1', inplace=True)
m_health['work_interfere'].replace(['Sometimes'], '2', inplace=True)
m_health['work_interfere'].replace(['Often'], '3', inplace=True)
m_health['work_interfere'].value_counts()
2    12
0     5
1     4
3     3
Name: work_interfere, dtype: int64

Distribution of values looks good. Let’s take a look at the datatypes for the dataframe object.

m_health.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   timestamp                  24 non-null     datetime64[ns]
 1   age                        24 non-null     int64         
 2   gender                     24 non-null     object        
 3   country                    24 non-null     object        
 4   self_employed              24 non-null     int64         
 5   family_history             24 non-null     object        
 6   treatment                  24 non-null     object        
 7   work_interfere             24 non-null     object        
 8   remote_work                24 non-null     object        
 9   tech_company               24 non-null     object        
 10  benefits                   24 non-null     object        
 11  leave                      24 non-null     object        
 12  mental_health_consequence  24 non-null     object        
dtypes: datetime64[ns](1), int64(2), object(10)
memory usage: 2.6+ KB

But it is still typed as an object. So change datatype to int.

m_health['work_interfere'] = m_health['work_interfere'] \
        .astype(dtype='int')
m_health.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   timestamp                  24 non-null     datetime64[ns]
 1   age                        24 non-null     int64         
 2   gender                     24 non-null     object        
 3   country                    24 non-null     object        
 4   self_employed              24 non-null     int64         
 5   family_history             24 non-null     object        
 6   treatment                  24 non-null     object        
 7   work_interfere             24 non-null     int64         
 8   remote_work                24 non-null     object        
 9   tech_company               24 non-null     object        
 10  benefits                   24 non-null     object        
 11  leave                      24 non-null     object        
 12  mental_health_consequence  24 non-null     object        
dtypes: datetime64[ns](1), int64(3), object(9)
memory usage: 2.6+ KB

Finally, show the dataframe itself.

m_health
timestamp age gender country self_employed family_history treatment work_interfere remote_work tech_company benefits leave mental_health_consequence
0 2014-08-27 11:35:00 46 Male United States 0 No Yes 3 Yes Yes Yes Don't know Maybe
1 2014-08-27 11:36:00 41 Male United States 0 No Yes 0 No No Don't know Don't know Maybe
2 2014-08-27 11:36:00 33 Male United States 0 Yes Yes 1 No Yes Yes Don't know No
3 2014-08-27 11:37:00 35 Male United States 0 Yes Yes 2 No No Yes Very easy Yes
4 2014-08-27 11:42:00 35 Male United States 0 No Yes 1 Yes Yes Yes Very easy No
5 2014-08-27 11:42:00 24 Male United Kingdom 0 No Yes 2 No Yes No Don't know Maybe
6 2014-08-27 11:42:00 35 Male United States 0 No No 2 Yes Yes Yes Somewhat difficult Yes
7 2014-08-27 11:43:00 27 Male Canada 0 Yes Yes 2 No Yes No Very difficult Maybe
8 2014-08-27 11:43:00 18 Male Netherlands 0 No No 3 No Yes No Somewhat difficult Yes
9 2014-08-27 11:43:00 30 Male United States 0 No Yes 2 No Yes Don't know Don't know No
10 2014-08-27 11:43:00 38 Female United States 0 Yes Yes 2 No Yes Yes Somewhat easy No
11 2014-08-27 11:43:00 28 Male United Kingdom 0 No No 2 No Yes Don't know Don't know No
12 2014-08-27 11:43:00 34 Male United States 0 No No 2 No Yes No Don't know No
13 2014-08-27 11:43:00 26 Male Canada 1 No No 2 No Yes No Don't know No
14 2014-08-27 11:44:00 30 Male United States 0 Yes Yes 1 No Yes Yes Don't know Maybe
15 2014-08-27 12:12:00 31 Male United States 0 No No 0 No Yes Yes Somewhat easy No
16 2014-08-27 12:13:00 40 Male United States 0 No Yes 0 No Yes No Very difficult No
17 2014-08-27 12:14:00 34 Male United States 0 No No 0 No Yes Don't know Somewhat easy Maybe
18 2014-08-27 12:15:00 25 Female Canada 0 No Yes 2 No Yes Yes Don't know Maybe
19 2014-08-27 12:15:00 29 Male United States 0 No No 0 No Yes Don't know Don't know No
20 2014-08-27 12:16:00 24 Male United States 0 Yes No 1 No Yes Don't know Somewhat easy Maybe
21 2014-08-27 12:17:00 31 Male Mexico 0 Yes Yes 2 No Yes Don't know Don't know No
22 2014-08-27 12:18:00 33 Male United States 0 No Yes 2 No Yes No Somewhat easy Maybe
23 2014-08-27 12:18:00 30 Male United States 0 No Yes 3 No Yes No Somewhat easy No