# import pandas
import pandas as pd
5 Cleaning Data
5.1 Cleaning and data science
Imagine you have finally gotten that dataset that you need to work with to answer those questions you’ve been asked to clear up. But wait! Before you get started in earnest working away with generating insights, it is important that you take a closer look at the “quality” of the data.
There is much to look at and consider when it comes to the topic of data cleaning. You probably have heard it said that 80% (or some high percentage) of a data scientist’s time is spent cleaning data, that is, putting it in a form that will better suit its downstream uses.
Not surprisingly, data cleaning is a topic difficult to cover systematically in an introductory course.
It’s not so bad, promise!
So in this tutorial we will focus on a few cleaning techniques - with pandas - that are likely going to be leveraged over and over again when you work with data in python.
- How to identify and clean up missing data
- When and how to change datatypes
- When and how to modify values in your dataset
5.2 Dealing with Missing Data
One of the most common issues in data cleaning has to do with data that is missing from your dataset. Most data, even data that is meticulously collected and managed, is likely to have empty cells. Empty data can be harmless, or it can represent a challenge to you, depending on what you need your data to do once you have finished working with it.
The best way to learn about these concepts is to see them in action with real (or real-ish) data! Before we get started with fetching the data, let’s go ahead and get pandas for today’s session.
Now, imagine we had our own online streaming service. We have access to some data that will help us predict what our users would like to see. To fetch it, run the code below:
# import data
= "https://raw.githubusercontent.com/bcgov/" \
mr "ds-intro-to-python/main/data/movieratings.csv"
= pd.read_csv(mr)
movie_ratings movie_ratings.head()
Rater | Star Wars | Finding Nemo | Forrest Gump | Parasite | Citizen Kane | |
---|---|---|---|---|---|---|
0 | Floriana | NaN | 5.0 | 5.0 | 3.0 | NaN |
1 | Raymundo | 4.0 | NaN | NaN | NaN | 5.0 |
2 | Jung | 5.0 | NaN | NaN | 5.0 | NaN |
3 | Kumar | 5.0 | NaN | 4.0 | NaN | 4.0 |
4 | Maria | 5.0 | 4.0 | 5.0 | NaN | NaN |
This is a 10 row dataframe, with each row containing a movie rater as well as their ratings for selected movies that they have given ratings for. For interpretative purposes, each movie is rated on a 5 point scale, with 5 being super awesome and 1 being dismal. Not every movie is rated by every rater, so several cells are blank. These are the missing data cells and are marked with “NaN”, indicating missing data.
This dataset, while small and made up, exemplifies a classic “recommendation engine” use case problem - there are lots of raters and lots of movies, but an awful lot of missing data. So what one chooses to do with missing data is a key cleaning question.
Locating Missing Data
In pandas, there is a widely used method named isna()
which can be applied to a whole dataframe or just to a particular series or column.
Below we apply it to the entire dataframe to get a cell by cell view of whether the cell data is missing or not (True means it is missing, False means it is not):
#find the missing values
movie_ratings.isna()
Rater | Star Wars | Finding Nemo | Forrest Gump | Parasite | Citizen Kane | |
---|---|---|---|---|---|---|
0 | False | True | False | False | False | True |
1 | False | False | True | True | True | False |
2 | False | False | True | True | False | True |
3 | False | False | True | False | True | False |
4 | False | False | False | False | True | True |
5 | False | False | False | False | False | False |
6 | False | True | True | False | False | False |
7 | False | False | False | False | False | False |
8 | False | False | False | False | False | False |
9 | False | False | True | True | True | True |
It is probably helpful to note there is a similar function called isnull()
that is an alias for isna()
. You may run into both, and both will return the same results. There is also a notna()
(notnull()
) function, which is the inverse function of isna()
(isnull()
), returning True for values that are not NA. For simplicity, let’s just stick with isna()
in this tutorial.
To get a better summary overview (i.e. one that summarizes with numbers) of how many missing values there are per column, then we need to chain togther a .sum()
function to the line of code like so:
###listing out the missing values in the dataframe for each series
sum() movie_ratings.isna().
Rater 0
Star Wars 2
Finding Nemo 5
Forrest Gump 3
Parasite 4
Citizen Kane 4
dtype: int64
The reason that this works is because a boolean value of True is represented in pandas by the value of 1, while False is represented by a 0. Therefore, we can apply the sum()
method to the dataframe with series being returned that contains the counts of missing items in each column.
Now we see no missing data for Rater, but missing data for each of the movies. For our use case, missing data is a real challenge. With pandas, we can deal with this in many different ways.
Dropping rows and columns
One cleaning option we have is to drop entire rows or columns with missing values in them from the dataframe. The method is called dropna()
and takes a number of parameters:
- axis: {0 or ‘index’, 1 or ‘columns’}, default 0
- how: {‘any’, ‘all’}, default ‘any’
- subset: column label or sequence of labels
Imagine we would like to drop all of the rows with any missing values in them. Since the axis=0
and how=any
parameters are the defaults, we can write an elegant and simple line of code:
= movie_ratings.dropna()
reduced_rows 11) reduced_rows.head(
Rater | Star Wars | Finding Nemo | Forrest Gump | Parasite | Citizen Kane | |
---|---|---|---|---|---|---|
5 | Arthur | 2.0 | 2.0 | 3.0 | 3.0 | 3.0 |
7 | Martina | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 |
8 | Orson | 1.0 | 1.0 | 1.0 | 2.0 | 5.0 |
We still have three raters who have given us ratings for each movie. The dataset is nice and clean now, although it has come at a pretty big cost, losing seven rows of data! If we go in the other direction, and insert a parameter of how='all'
, then we have the opposite problem, with no rows being eliminated (as there were no instances where all of the columns contained missing data for a given row)!
Another option is to use subset to eliminate rows on the basis of whether values are missing in a specified subset. Let’s say we want to have just the rows of data where there is not missing ratings data for the film “Parasite”. We would use subset=
parameter and identify the column we want to clean up:
= movie_ratings.dropna(subset='Parasite')
par_df 11) par_df.head(
Rater | Star Wars | Finding Nemo | Forrest Gump | Parasite | Citizen Kane | |
---|---|---|---|---|---|---|
0 | Floriana | NaN | 5.0 | 5.0 | 3.0 | NaN |
2 | Jung | 5.0 | NaN | NaN | 5.0 | NaN |
5 | Arthur | 2.0 | 2.0 | 3.0 | 3.0 | 3.0 |
6 | Marcellus | NaN | NaN | 4.0 | 5.0 | 4.0 |
7 | Martina | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 |
8 | Orson | 1.0 | 1.0 | 1.0 | 2.0 | 5.0 |
We can see that, while there is no NaN in the Parasite column, we can see they are still there sprinkled throughout other columns.
Now what would happen if we applied that same drop logic to all columns?
= movie_ratings.dropna(axis=1)
reduced_cols reduced_cols.head()
Rater | |
---|---|
0 | Floriana |
1 | Raymundo |
2 | Jung |
3 | Kumar |
4 | Maria |
We see that we pretty much all the movie columns have been removed, as each of these columns had at least one missing value! So be careful when using dropna()
as it can be a powerful eraser of data. Whether you should use it or how you should will depend on the downstream uses for your data.
Saving the dataframe “inplace”
Inplace
is an argument used in different functions such as dropna()
that we have just looked at. The default value of this attribute is False
, which means whatever the function does to the dataframe, it does so as a copy of the object. When the argument is inplace=True
, however, whatever changes are made by the given function are made to the original object inplace, not just to a copy of it!
Let’s look at how this works starting with fetching the data from the source:
# Fetching the data to make sure we start again with the original
= "https://raw.githubusercontent.com/bcgov/" \
mr "ds-intro-to-python/main/data/movieratings.csv"
= pd.read_csv(mr)
movie_ratings movie_ratings.head()
Rater | Star Wars | Finding Nemo | Forrest Gump | Parasite | Citizen Kane | |
---|---|---|---|---|---|---|
0 | Floriana | NaN | 5.0 | 5.0 | 3.0 | NaN |
1 | Raymundo | 4.0 | NaN | NaN | NaN | 5.0 |
2 | Jung | 5.0 | NaN | NaN | 5.0 | NaN |
3 | Kumar | 5.0 | NaN | 4.0 | NaN | 4.0 |
4 | Maria | 5.0 | 4.0 | 5.0 | NaN | NaN |
Now let’s “modify” the frame with inplace=False
with the code below and see what happens:
# Making change to dataframe with inplace=False
='any', inplace=False)
movie_ratings.dropna(how
# Looking at the dataframe
movie_ratings.head()
Rater | Star Wars | Finding Nemo | Forrest Gump | Parasite | Citizen Kane | |
---|---|---|---|---|---|---|
0 | Floriana | NaN | 5.0 | 5.0 | 3.0 | NaN |
1 | Raymundo | 4.0 | NaN | NaN | NaN | 5.0 |
2 | Jung | 5.0 | NaN | NaN | 5.0 | NaN |
3 | Kumar | 5.0 | NaN | 4.0 | NaN | 4.0 |
4 | Maria | 5.0 | 4.0 | 5.0 | NaN | NaN |
Ok, it kind of looks like “nothing happened”! And in a way, with the code we ran, nothing did happen, as the result was not assigned to a new object. Soooo, what’s the point of that? Any guesses?
When we run exactly the same code except with the parameter inplace
set to 'True'
, the dropped data is actually gone!
# Making change to dataframe with inplace=False
='any', inplace=True)
movie_ratings.dropna(how
# Looking at the dataframe
movie_ratings.head()
Rater | Star Wars | Finding Nemo | Forrest Gump | Parasite | Citizen Kane | |
---|---|---|---|---|---|---|
5 | Arthur | 2.0 | 2.0 | 3.0 | 3.0 | 3.0 |
7 | Martina | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 |
8 | Orson | 1.0 | 1.0 | 1.0 | 2.0 | 5.0 |
So now if we want to work with the “movie_ratings” object without the columns we just dropped, we would have to go back and rerun the read_csv()
function and fetch the original data again! It is therefore important when writing your cleaning code that you use this parameter carefully.
Filling in missing data
Sometimes it is the right cleaning decision to fill in missing data with some values.
Take the example of our movie ratings dataframe. The basic way to fill in missing values would you to specify a value to take the place of each missing value. One approach is to calculate and plug in the mean value - which we know to be (3.85) for the entire dataframe:
# Fetching the data, yet again, to make sure we start again with the original
= "https://raw.githubusercontent.com/bcgov/" \
mr "ds-intro-to-python/main/data/movieratings.csv"
= pd.read_csv(mr)
movie_ratings
# using fillna and a passing in a constant
= movie_ratings.fillna(3.85)
filled_385 10) filled_385.head(
Rater | Star Wars | Finding Nemo | Forrest Gump | Parasite | Citizen Kane | |
---|---|---|---|---|---|---|
0 | Floriana | 3.85 | 5.00 | 5.00 | 3.00 | 3.85 |
1 | Raymundo | 4.00 | 3.85 | 3.85 | 3.85 | 5.00 |
2 | Jung | 5.00 | 3.85 | 3.85 | 5.00 | 3.85 |
3 | Kumar | 5.00 | 3.85 | 4.00 | 3.85 | 4.00 |
4 | Maria | 5.00 | 4.00 | 5.00 | 3.85 | 3.85 |
5 | Arthur | 2.00 | 2.00 | 3.00 | 3.00 | 3.00 |
6 | Marcellus | 3.85 | 3.85 | 4.00 | 5.00 | 4.00 |
7 | Martina | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 |
8 | Orson | 1.00 | 1.00 | 1.00 | 2.00 | 5.00 |
9 | Luke | 5.00 | 3.85 | 3.85 | 3.85 | 3.85 |
Ok, that’s good. Good, but not great for our use case! Maybe we could avail ourselves to some options within the method=
parameter with the parentheses the fillna()
function. With a specification of bfill
we can fill a NaN value with the preceding valid value in that column. ffill
fills with the value after it.
What would that look like for our dataset?
= movie_ratings.fillna(method='bfill')
bfilled_mr 10) bfilled_mr.head(
Rater | Star Wars | Finding Nemo | Forrest Gump | Parasite | Citizen Kane | |
---|---|---|---|---|---|---|
0 | Floriana | 4.0 | 5.0 | 5.0 | 3.0 | 5.0 |
1 | Raymundo | 4.0 | 4.0 | 4.0 | 5.0 | 5.0 |
2 | Jung | 5.0 | 4.0 | 4.0 | 5.0 | 4.0 |
3 | Kumar | 5.0 | 4.0 | 4.0 | 3.0 | 4.0 |
4 | Maria | 5.0 | 4.0 | 5.0 | 3.0 | 3.0 |
5 | Arthur | 2.0 | 2.0 | 3.0 | 3.0 | 3.0 |
6 | Marcellus | 5.0 | 5.0 | 4.0 | 5.0 | 4.0 |
7 | Martina | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 |
8 | Orson | 1.0 | 1.0 | 1.0 | 2.0 | 5.0 |
9 | Luke | 5.0 | NaN | NaN | NaN | NaN |
Notice that the bottom row stays NaN when we use bfill. That’s because it can not find a value to backfill with for this row! Using forward fill creates the same issue for the first row. For our use case, admittedly bfill
and ffill
probably are sub-optimal solutions.
Since we have some ratings data already for each movie, one approach would be to fill in the missing data with the mean rating for that movie from the valid data.
Let’s try that approach on a single column. For the movie Parasite, we fill in the missing values for that column with the average value:
'Parasite'] = movie_ratings['Parasite']. \
movie_ratings['Parasite'].mean())
fillna(movie_ratings[10) movie_ratings.head(
Rater | Star Wars | Finding Nemo | Forrest Gump | Parasite | Citizen Kane | |
---|---|---|---|---|---|---|
0 | Floriana | NaN | 5.0 | 5.0 | 3.000000 | NaN |
1 | Raymundo | 4.0 | NaN | NaN | 3.833333 | 5.0 |
2 | Jung | 5.0 | NaN | NaN | 5.000000 | NaN |
3 | Kumar | 5.0 | NaN | 4.0 | 3.833333 | 4.0 |
4 | Maria | 5.0 | 4.0 | 5.0 | 3.833333 | NaN |
5 | Arthur | 2.0 | 2.0 | 3.0 | 3.000000 | 3.0 |
6 | Marcellus | NaN | NaN | 4.0 | 5.000000 | 4.0 |
7 | Martina | 5.0 | 5.0 | 5.0 | 5.000000 | 5.0 |
8 | Orson | 1.0 | 1.0 | 1.0 | 2.000000 | 5.0 |
9 | Luke | 5.0 | NaN | NaN | 3.833333 | NaN |
And we can extend that for all numeric columns at once!
= movie_ratings.fillna(movie_ratings.mean(numeric_only=True))
movie_ratings 10) movie_ratings.head(
Rater | Star Wars | Finding Nemo | Forrest Gump | Parasite | Citizen Kane | |
---|---|---|---|---|---|---|
0 | Floriana | 4.0 | 5.0 | 5.000000 | 3.000000 | 4.333333 |
1 | Raymundo | 4.0 | 3.4 | 3.857143 | 3.833333 | 5.000000 |
2 | Jung | 5.0 | 3.4 | 3.857143 | 5.000000 | 4.333333 |
3 | Kumar | 5.0 | 3.4 | 4.000000 | 3.833333 | 4.000000 |
4 | Maria | 5.0 | 4.0 | 5.000000 | 3.833333 | 4.333333 |
5 | Arthur | 2.0 | 2.0 | 3.000000 | 3.000000 | 3.000000 |
6 | Marcellus | 4.0 | 3.4 | 4.000000 | 5.000000 | 4.000000 |
7 | Martina | 5.0 | 5.0 | 5.000000 | 5.000000 | 5.000000 |
8 | Orson | 1.0 | 1.0 | 1.000000 | 2.000000 | 5.000000 |
9 | Luke | 5.0 | 3.4 | 3.857143 | 3.833333 | 4.333333 |
While far from perfect, now we have some filled in data that has some reasonable chance of adding value.
Similar to fillna()
is a method called interpolate()
, which takes the average of the values near the missing values and plugs them in. To do this, we pass in the interpolate()
in similar fashion to how we used fillna()
above.
Dealing with missing data is one normal task in cleaning, changing data and header content is another, let’s take a look at that next.
5.3 Modifying existing data
Now let’s grad a different dataset, this one is quite redacted snip of the 2014 Mental Health in Tech Survey https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey.
= "https://raw.githubusercontent.com/bcgov/" \
techhealth "ds-intro-to-python/main/data/techhealth.csv"
= pd.read_csv(techhealth)
m_health m_health.head()
Timestamp | Age | Gender | Country | self employed | family history | treatment | work interfere | remote work | tech company | benefits | leave | mental health consequence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27/08/2014 11:35 | 46 | Male | United States | No | No | Yes | Often | Yes | Yes | Yes | Don't know | Maybe |
1 | 27/08/2014 11:36 | 41 | Male | United States | No | No | Yes | Never | No | No | Don't know | Don't know | Maybe |
2 | 27/08/2014 11:36 | 33 | male | United States | No | Yes | Yes | Rarely | No | Yes | Yes | Don't know | No |
3 | 27/08/2014 11:37 | 35 | male | United States | No | Yes | Yes | Sometimes | No | No | Yes | Very easy | Yes |
4 | 27/08/2014 11:42 | 35 | M | United States | No | No | Yes | Rarely | Yes | Yes | Yes | Very easy | No |
This abridged version of the dataset contains 24 cases and 13 variables:
- Timestamp
- Age
- Gender
- Country
- Self Employed: Are you self-employed?
- Family History: Do you have a family history of mental illness?
- Treatment: Have you sought treatment for a mental health condition?
- Work Interfere: If you have a mental health condition, do you feel that it interferes with your work?
- Remote Work: Do you work remotely (outside of an office) at least 50% of the time?
- Tech Company: Is your employer primarily a tech company/organization?
- Benefits: Does your employer provide mental health benefits?
- Leave: How easy is it for you to take medical leave for a mental health condition?
- Mental Health Consequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
Cleaning the header row
A good practice, both in terms of ensuring a consistent convention for column names and for the ability to write more efficient code, is to ensure that all column header names do not have spaces in them and are all in lower case. To do this with our current dataset, we use the str.replace()
function on all the header row, identified in pandas via .columns:
= m_health.columns.str.replace(' ', '_')
m_health.columns = m_health.columns.str.lower()
m_health.columns 10)
m_health.head( m_health.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 24 non-null object
1 age 24 non-null int64
2 gender 24 non-null object
3 country 24 non-null object
4 self_employed 24 non-null object
5 family_history 24 non-null object
6 treatment 24 non-null object
7 work_interfere 24 non-null object
8 remote_work 24 non-null object
9 tech_company 24 non-null object
10 benefits 24 non-null object
11 leave 24 non-null object
12 mental_health_consequence 24 non-null object
dtypes: int64(1), object(12)
memory usage: 2.6+ KB
The other thing to notice that there are no missing values! So we can just concentrate on other cleaning tasks.
Ensuring the right datatypes
A good practice is to notice whether there are series in the dataframe that are improperly datatyped (that is, where the “Dtype” indicated does not correspond to the datatype that it should be).
Taking a look at the data, we notice that “Timestamp” is datatyped as an “object”, which in this context means a “string”. However, that is sub-optimal because the content is time-oriented, and by coding it as an object/string, we limit a lot of its value. If we identify datatype as a time oriented one, that can give us advantages downstream in being able to work with the data more elegantly.
Pandas has functions specifically designed to cast datatypes from one to another, as was covered in the previous section with the astype()
function.
To cast a variable as a datetime datatype, we can use the pandas to_datetime()
function and pass in the pandas series you want to modify:
'timestamp'] = pd.to_datetime(m_health['timestamp'])
m_health[
m_health.head() m_health.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 24 non-null datetime64[ns]
1 age 24 non-null int64
2 gender 24 non-null object
3 country 24 non-null object
4 self_employed 24 non-null object
5 family_history 24 non-null object
6 treatment 24 non-null object
7 work_interfere 24 non-null object
8 remote_work 24 non-null object
9 tech_company 24 non-null object
10 benefits 24 non-null object
11 leave 24 non-null object
12 mental_health_consequence 24 non-null object
dtypes: datetime64[ns](1), int64(1), object(11)
memory usage: 2.6+ KB
Et voila - timestamp is now of the datatime64 datatype! Were we wanting to leverage timestamp’s date time qualities, we could now do so more easily. It should be added, that this example works well because the existing data is nicely formatted to be understood as data/time datatype data. In the real world, if the timestamp column has different date types, or isn’t actually a date column at all, the method raises exceptions that you will have to work through. For example, look at what happens if we try to force a different variable into a data/time data type:
'tech_company'] = pd.to_datetime(m_health['tech_company'])
m_health[
m_health.head() m_health.info()
ParserError: Unknown string format: Yes
The error let’s us know that this data typing conversion will not work as coded.
Recoding values
Value recoding is a data cleaning task to put the column values into more manageable categories. This is a common data recoding task that most data analyst type folks will be familiar with. Pandas alone has many functions that solve recoding issues in different ways. We will just highlight a few in this tutorial.
With this dataset we’ve called there are a few columns that we would like to make changes to so that it will be easier to analyze the data in the way we want. The value_counts()
method shows the range of responses and their frequency in the dataframe.
'gender'].value_counts() m_health[
Male 12
male 6
M 2
Female 1
m 1
F 1
Cis Male 1
Name: gender, dtype: int64
With that knowledge, we can utilize the replace()
function and pass in custom lists with all of the responses that should be replaced by a single specfied value.
'gender'].replace(['Male ', 'male', 'M', 'm', 'Male', 'Cis Male'],
m_health['Male', inplace=True)
'gender'].replace(['Female ', 'female', 'F'],
m_health['Female', inplace=True)
'gender'].value_counts() m_health[
Male 22
Female 2
Name: gender, dtype: int64
Another task is to put categorical objects into a form that is more amenable to computational tasks. Let’s take a look at a categorial variable that consists of responses of “Yes” and “No”.
'self_employed'].value_counts() m_health[
No 23
Yes 1
Name: self_employed, dtype: int64
If we want to leverage this data purely as a segment to split other metrics with, we would probably leave as is. But perhaps we would like to treat it as a dummy variable so we can do more sophisticated math with it.
For our self-employed variable, we begin by implementing replace()
to transform the existing content to numbers as shown below:
'self_employed'].replace(['Yes'], '1', inplace=True)
m_health['self_employed'].replace(['No'], '0', inplace=True)
m_health['self_employed'].value_counts() m_health[
0 23
1 1
Name: self_employed, dtype: int64
The pandas astype()
function is able to interpret the 1s and 0s as integers and so we can transform the series into an int
datatype.
'self_employed'] = m_health['self_employed'] \
m_health[='int')
.astype(dtype m_health.head()
timestamp | age | gender | country | self_employed | family_history | treatment | work_interfere | remote_work | tech_company | benefits | leave | mental_health_consequence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-08-27 11:35:00 | 46 | Male | United States | 0 | No | Yes | Often | Yes | Yes | Yes | Don't know | Maybe |
1 | 2014-08-27 11:36:00 | 41 | Male | United States | 0 | No | Yes | Never | No | No | Don't know | Don't know | Maybe |
2 | 2014-08-27 11:36:00 | 33 | Male | United States | 0 | Yes | Yes | Rarely | No | Yes | Yes | Don't know | No |
3 | 2014-08-27 11:37:00 | 35 | Male | United States | 0 | Yes | Yes | Sometimes | No | No | Yes | Very easy | Yes |
4 | 2014-08-27 11:42:00 | 35 | Male | United States | 0 | No | Yes | Rarely | Yes | Yes | Yes | Very easy | No |
Now we see that variable as a series of int-typed 0s and 1s.