5 Cleaning Data

5.1 Cleaning and data science

Imagine you have finally gotten that dataset that you need to work with to answer those questions you’ve been asked to clear up. But wait! Before you get started in earnest working away with generating insights, it is important that you take a closer look at the “quality” of the data.

There is much to look at and consider when it comes to the topic of data cleaning. You probably have heard it said that 80% (or some high percentage) of a data scientist’s time is spent cleaning data, that is, putting it in a form that will better suit its downstream uses.

Not surprisingly, data cleaning is a topic difficult to cover systematically in an introductory course.

It’s not so bad, promise!

So in this tutorial we will focus on a few cleaning techniques - with pandas - that are likely going to be leveraged over and over again when you work with data in python.

How to identify and clean up missing data
When and how to change datatypes
When and how to modify values in your dataset

5.2 Dealing with Missing Data

One of the most common issues in data cleaning has to do with data that is missing from your dataset. Most data, even data that is meticulously collected and managed, is likely to have empty cells. Empty data can be harmless, or it can represent a challenge to you, depending on what you need your data to do once you have finished working with it.

The best way to learn about these concepts is to see them in action with real (or real-ish) data! Before we get started with fetching the data, let’s go ahead and get pandas for today’s session.

# import pandas
import pandas as pd

Now, imagine we had our own online streaming service. We have access to some data that will help us predict what our users would like to see. To fetch it, run the code below:

# import data
mr = "https://raw.githubusercontent.com/bcgov/" \
        "ds-intro-to-python/main/data/movieratings.csv"
movie_ratings = pd.read_csv(mr)
movie_ratings.head()

	Rater	Star Wars	Finding Nemo	Forrest Gump	Parasite	Citizen Kane
0	Floriana	NaN	5.0	5.0	3.0	NaN
1	Raymundo	4.0	NaN	NaN	NaN	5.0
2	Jung	5.0	NaN	NaN	5.0	NaN
3	Kumar	5.0	NaN	4.0	NaN	4.0
4	Maria	5.0	4.0	5.0	NaN	NaN

This is a 10 row dataframe, with each row containing a movie rater as well as their ratings for selected movies that they have given ratings for. For interpretative purposes, each movie is rated on a 5 point scale, with 5 being super awesome and 1 being dismal. Not every movie is rated by every rater, so several cells are blank. These are the missing data cells and are marked with “NaN”, indicating missing data.

This dataset, while small and made up, exemplifies a classic “recommendation engine” use case problem - there are lots of raters and lots of movies, but an awful lot of missing data. So what one chooses to do with missing data is a key cleaning question.

Locating Missing Data

In pandas, there is a widely used method named isna() which can be applied to a whole dataframe or just to a particular series or column.

Below we apply it to the entire dataframe to get a cell by cell view of whether the cell data is missing or not (True means it is missing, False means it is not):

#find the missing values
movie_ratings.isna()

	Rater	Star Wars	Finding Nemo	Forrest Gump	Parasite	Citizen Kane
0	False	True	False	False	False	True
1	False	False	True	True	True	False
2	False	False	True	True	False	True
3	False	False	True	False	True	False
4	False	False	False	False	True	True
5	False	False	False	False	False	False
6	False	True	True	False	False	False
7	False	False	False	False	False	False
8	False	False	False	False	False	False
9	False	False	True	True	True	True

It is probably helpful to note there is a similar function called isnull() that is an alias for isna(). You may run into both, and both will return the same results. There is also a notna()(notnull()) function, which is the inverse function of isna()(isnull()), returning True for values that are not NA. For simplicity, let’s just stick with isna() in this tutorial.

To get a better summary overview (i.e. one that summarizes with numbers) of how many missing values there are per column, then we need to chain togther a .sum() function to the line of code like so:

###listing out the missing values in the dataframe for each series
movie_ratings.isna().sum()

Rater           0
Star Wars       2
Finding Nemo    5
Forrest Gump    3
Parasite        4
Citizen Kane    4
dtype: int64

The reason that this works is because a boolean value of True is represented in pandas by the value of 1, while False is represented by a 0. Therefore, we can apply the sum()method to the dataframe with series being returned that contains the counts of missing items in each column.

Now we see no missing data for Rater, but missing data for each of the movies. For our use case, missing data is a real challenge. With pandas, we can deal with this in many different ways.

Dropping rows and columns

One cleaning option we have is to drop entire rows or columns with missing values in them from the dataframe. The method is called dropna()and takes a number of parameters:

axis: {0 or ‘index’, 1 or ‘columns’}, default 0
how: {‘any’, ‘all’}, default ‘any’
subset: column label or sequence of labels

Imagine we would like to drop all of the rows with any missing values in them. Since the axis=0 and how=any parameters are the defaults, we can write an elegant and simple line of code:

reduced_rows = movie_ratings.dropna()
reduced_rows.head(11)

	Rater	Star Wars	Finding Nemo	Forrest Gump	Parasite	Citizen Kane
5	Arthur	2.0	2.0	3.0	3.0	3.0
7	Martina	5.0	5.0	5.0	5.0	5.0
8	Orson	1.0	1.0	1.0	2.0	5.0

We still have three raters who have given us ratings for each movie. The dataset is nice and clean now, although it has come at a pretty big cost, losing seven rows of data! If we go in the other direction, and insert a parameter of how='all', then we have the opposite problem, with no rows being eliminated (as there were no instances where all of the columns contained missing data for a given row)!

Another option is to use subset to eliminate rows on the basis of whether values are missing in a specified subset. Let’s say we want to have just the rows of data where there is not missing ratings data for the film “Parasite”. We would use subset= parameter and identify the column we want to clean up:

par_df = movie_ratings.dropna(subset='Parasite')
par_df.head(11)

	Rater	Star Wars	Finding Nemo	Forrest Gump	Parasite	Citizen Kane
0	Floriana	NaN	5.0	5.0	3.0	NaN
2	Jung	5.0	NaN	NaN	5.0	NaN
5	Arthur	2.0	2.0	3.0	3.0	3.0
6	Marcellus	NaN	NaN	4.0	5.0	4.0
7	Martina	5.0	5.0	5.0	5.0	5.0
8	Orson	1.0	1.0	1.0	2.0	5.0

We can see that, while there is no NaN in the Parasite column, we can see they are still there sprinkled throughout other columns.

Now what would happen if we applied that same drop logic to all columns?

reduced_cols = movie_ratings.dropna(axis=1)
reduced_cols.head()

	Rater
0	Floriana
1	Raymundo
2	Jung
3	Kumar
4	Maria

We see that we pretty much all the movie columns have been removed, as each of these columns had at least one missing value! So be careful when using dropna() as it can be a powerful eraser of data. Whether you should use it or how you should will depend on the downstream uses for your data.

Saving the dataframe “inplace”

Inplace is an argument used in different functions such as dropna() that we have just looked at. The default value of this attribute is False, which means whatever the function does to the dataframe, it does so as a copy of the object. When the argument is inplace=True, however, whatever changes are made by the given function are made to the original object inplace, not just to a copy of it!

Let’s look at how this works starting with fetching the data from the source:

# Fetching the data to make sure we start again with the original
mr = "https://raw.githubusercontent.com/bcgov/" \
        "ds-intro-to-python/main/data/movieratings.csv"
movie_ratings = pd.read_csv(mr)
movie_ratings.head()

	Rater	Star Wars	Finding Nemo	Forrest Gump	Parasite	Citizen Kane
0	Floriana	NaN	5.0	5.0	3.0	NaN
1	Raymundo	4.0	NaN	NaN	NaN	5.0
2	Jung	5.0	NaN	NaN	5.0	NaN
3	Kumar	5.0	NaN	4.0	NaN	4.0
4	Maria	5.0	4.0	5.0	NaN	NaN

Now let’s “modify” the frame with inplace=False with the code below and see what happens:

# Making change to dataframe with inplace=False
movie_ratings.dropna(how='any', inplace=False)

# Looking at the dataframe
movie_ratings.head()

	Rater	Star Wars	Finding Nemo	Forrest Gump	Parasite	Citizen Kane
0	Floriana	NaN	5.0	5.0	3.0	NaN
1	Raymundo	4.0	NaN	NaN	NaN	5.0
2	Jung	5.0	NaN	NaN	5.0	NaN
3	Kumar	5.0	NaN	4.0	NaN	4.0
4	Maria	5.0	4.0	5.0	NaN	NaN

Ok, it kind of looks like “nothing happened”! And in a way, with the code we ran, nothing did happen, as the result was not assigned to a new object. Soooo, what’s the point of that? Any guesses?

When we run exactly the same code except with the parameter inplace set to 'True', the dropped data is actually gone!

# Making change to dataframe with inplace=False
movie_ratings.dropna(how='any', inplace=True)

# Looking at the dataframe
movie_ratings.head()

	Rater	Star Wars	Finding Nemo	Forrest Gump	Parasite	Citizen Kane
5	Arthur	2.0	2.0	3.0	3.0	3.0
7	Martina	5.0	5.0	5.0	5.0	5.0
8	Orson	1.0	1.0	1.0	2.0	5.0

So now if we want to work with the “movie_ratings” object without the columns we just dropped, we would have to go back and rerun the read_csv() function and fetch the original data again! It is therefore important when writing your cleaning code that you use this parameter carefully.

Filling in missing data

Sometimes it is the right cleaning decision to fill in missing data with some values.

Take the example of our movie ratings dataframe. The basic way to fill in missing values would you to specify a value to take the place of each missing value. One approach is to calculate and plug in the mean value - which we know to be (3.85) for the entire dataframe:

# Fetching the data, yet again, to make sure we start again with the original
mr = "https://raw.githubusercontent.com/bcgov/" \
        "ds-intro-to-python/main/data/movieratings.csv"
movie_ratings = pd.read_csv(mr)

# using fillna and a passing in a constant
filled_385 = movie_ratings.fillna(3.85)
filled_385.head(10)

	Rater	Star Wars	Finding Nemo	Forrest Gump	Parasite	Citizen Kane
0	Floriana	3.85	5.00	5.00	3.00	3.85
1	Raymundo	4.00	3.85	3.85	3.85	5.00
2	Jung	5.00	3.85	3.85	5.00	3.85
3	Kumar	5.00	3.85	4.00	3.85	4.00
4	Maria	5.00	4.00	5.00	3.85	3.85
5	Arthur	2.00	2.00	3.00	3.00	3.00
6	Marcellus	3.85	3.85	4.00	5.00	4.00
7	Martina	5.00	5.00	5.00	5.00	5.00
8	Orson	1.00	1.00	1.00	2.00	5.00
9	Luke	5.00	3.85	3.85	3.85	3.85

Ok, that’s good. Good, but not great for our use case! Maybe we could avail ourselves to some options within the method= parameter with the parentheses the fillna() function. With a specification of bfill we can fill a NaN value with the preceding valid value in that column. ffill fills with the value after it.

What would that look like for our dataset?

bfilled_mr = movie_ratings.fillna(method='bfill')
bfilled_mr.head(10)

	Rater	Star Wars	Finding Nemo	Forrest Gump	Parasite	Citizen Kane
0	Floriana	4.0	5.0	5.0	3.0	5.0
1	Raymundo	4.0	4.0	4.0	5.0	5.0
2	Jung	5.0	4.0	4.0	5.0	4.0
3	Kumar	5.0	4.0	4.0	3.0	4.0
4	Maria	5.0	4.0	5.0	3.0	3.0
5	Arthur	2.0	2.0	3.0	3.0	3.0
6	Marcellus	5.0	5.0	4.0	5.0	4.0
7	Martina	5.0	5.0	5.0	5.0	5.0
8	Orson	1.0	1.0	1.0	2.0	5.0
9	Luke	5.0	NaN	NaN	NaN	NaN

Notice that the bottom row stays NaN when we use bfill. That’s because it can not find a value to backfill with for this row! Using forward fill creates the same issue for the first row. For our use case, admittedly bfill and ffill probably are sub-optimal solutions.

Since we have some ratings data already for each movie, one approach would be to fill in the missing data with the mean rating for that movie from the valid data.

Let’s try that approach on a single column. For the movie Parasite, we fill in the missing values for that column with the average value:

movie_ratings['Parasite'] = movie_ratings['Parasite']. \
  fillna(movie_ratings['Parasite'].mean())
movie_ratings.head(10)

	Rater	Star Wars	Finding Nemo	Forrest Gump	Parasite	Citizen Kane
0	Floriana	NaN	5.0	5.0	3.000000	NaN
1	Raymundo	4.0	NaN	NaN	3.833333	5.0
2	Jung	5.0	NaN	NaN	5.000000	NaN
3	Kumar	5.0	NaN	4.0	3.833333	4.0
4	Maria	5.0	4.0	5.0	3.833333	NaN
5	Arthur	2.0	2.0	3.0	3.000000	3.0
6	Marcellus	NaN	NaN	4.0	5.000000	4.0
7	Martina	5.0	5.0	5.0	5.000000	5.0
8	Orson	1.0	1.0	1.0	2.000000	5.0
9	Luke	5.0	NaN	NaN	3.833333	NaN

And we can extend that for all numeric columns at once!

movie_ratings = movie_ratings.fillna(movie_ratings.mean(numeric_only=True))
movie_ratings.head(10)

	Rater	Star Wars	Finding Nemo	Forrest Gump	Parasite	Citizen Kane
0	Floriana	4.0	5.0	5.000000	3.000000	4.333333
1	Raymundo	4.0	3.4	3.857143	3.833333	5.000000
2	Jung	5.0	3.4	3.857143	5.000000	4.333333
3	Kumar	5.0	3.4	4.000000	3.833333	4.000000
4	Maria	5.0	4.0	5.000000	3.833333	4.333333
5	Arthur	2.0	2.0	3.000000	3.000000	3.000000
6	Marcellus	4.0	3.4	4.000000	5.000000	4.000000
7	Martina	5.0	5.0	5.000000	5.000000	5.000000
8	Orson	1.0	1.0	1.000000	2.000000	5.000000
9	Luke	5.0	3.4	3.857143	3.833333	4.333333

While far from perfect, now we have some filled in data that has some reasonable chance of adding value.

Similar to fillna() is a method called interpolate(), which takes the average of the values near the missing values and plugs them in. To do this, we pass in the interpolate() in similar fashion to how we used fillna() above.

Challenge 1

You are given a dataset for several days time frame in April/May 2021 of the average daily temperature in celcius for three Canadian cities. You would like to ultimately create a plot of the temperatures but don’t want to have gaps in the data. Run the code below to get the data and see what it looks like.

import pandas as pd

temps = "https://raw.githubusercontent.com/bcgov/" \
        "ds-intro-to-python/main/data/citytemps.csv"
city_temps = pd.read_csv(temps)
city_temps.head(10)

	Day	Victoria	Regina	Sudbury
0	25-Feb-21	15.0	NaN	7.0
1	26-Feb-21	18.0	-11.0	NaN
2	27-Feb-21	NaN	-6.0	NaN
3	28-Feb-21	12.0	-2.0	4.0
4	01-Mar-21	NaN	NaN	7.0
5	02-Mar-21	NaN	NaN	NaN
6	03-Mar-21	NaN	NaN	14.0
7	04-Mar-21	15.0	7.0	11.0
8	05-Mar-21	13.0	NaN	9.0
9	06-Mar-21	11.0	4.0	14.0

Unfortunately you notice that several of the days have data missing. Write code that leverages the interpolate() function to fill in the missing data. Run the code. Is this an appropriate solution for the problem? What are the pros and cons to this approach.

Solution to Challenge 1

Code you would write is below.

interpolate_temps = city_temps.interpolate()
interpolate_temps.head(10)

	Day	Victoria	Regina	Sudbury
0	25-Feb-21	15.00	NaN	7.0
1	26-Feb-21	18.00	-11.00	6.0
2	27-Feb-21	15.00	-6.00	5.0
3	28-Feb-21	12.00	-2.00	4.0
4	01-Mar-21	12.75	0.25	7.0
5	02-Mar-21	13.50	2.50	10.5
6	03-Mar-21	14.25	4.75	14.0
7	04-Mar-21	15.00	7.00	11.0
8	05-Mar-21	13.00	5.50	9.0
9	06-Mar-21	11.00	4.00	14.0

Here we can see how the values “smooth” out the data. Were we to go ahead and plot the data, it would make more sense.

Dealing with missing data is one normal task in cleaning, changing data and header content is another, let’s take a look at that next.

5.3 Modifying existing data

Now let’s grad a different dataset, this one is quite redacted snip of the 2014 Mental Health in Tech Survey https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey.

techhealth = "https://raw.githubusercontent.com/bcgov/" \
        "ds-intro-to-python/main/data/techhealth.csv"
m_health = pd.read_csv(techhealth)
m_health.head()

	Timestamp	Age	Gender	Country	self employed	family history	treatment	work interfere	remote work	tech company	benefits	leave	mental health consequence
0	27/08/2014 11:35	46	Male	United States	No	No	Yes	Often	Yes	Yes	Yes	Don't know	Maybe
1	27/08/2014 11:36	41	Male	United States	No	No	Yes	Never	No	No	Don't know	Don't know	Maybe
2	27/08/2014 11:36	33	male	United States	No	Yes	Yes	Rarely	No	Yes	Yes	Don't know	No
3	27/08/2014 11:37	35	male	United States	No	Yes	Yes	Sometimes	No	No	Yes	Very easy	Yes
4	27/08/2014 11:42	35	M	United States	No	No	Yes	Rarely	Yes	Yes	Yes	Very easy	No

This abridged version of the dataset contains 24 cases and 13 variables:

Timestamp
Age
Gender
Country
Self Employed: Are you self-employed?
Family History: Do you have a family history of mental illness?
Treatment: Have you sought treatment for a mental health condition?
Work Interfere: If you have a mental health condition, do you feel that it interferes with your work?
Remote Work: Do you work remotely (outside of an office) at least 50% of the time?
Tech Company: Is your employer primarily a tech company/organization?
Benefits: Does your employer provide mental health benefits?
Leave: How easy is it for you to take medical leave for a mental health condition?
Mental Health Consequence: Do you think that discussing a mental health issue with your employer would have negative consequences?

Cleaning the header row

A good practice, both in terms of ensuring a consistent convention for column names and for the ability to write more efficient code, is to ensure that all column header names do not have spaces in them and are all in lower case. To do this with our current dataset, we use the str.replace() function on all the header row, identified in pandas via .columns:

m_health.columns = m_health.columns.str.replace(' ', '_')
m_health.columns = m_health.columns.str.lower()
m_health.head(10)
m_health.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   timestamp                  24 non-null     object
 1   age                        24 non-null     int64 
 2   gender                     24 non-null     object
 3   country                    24 non-null     object
 4   self_employed              24 non-null     object
 5   family_history             24 non-null     object
 6   treatment                  24 non-null     object
 7   work_interfere             24 non-null     object
 8   remote_work                24 non-null     object
 9   tech_company               24 non-null     object
 10  benefits                   24 non-null     object
 11  leave                      24 non-null     object
 12  mental_health_consequence  24 non-null     object
dtypes: int64(1), object(12)
memory usage: 2.6+ KB

The other thing to notice that there are no missing values! So we can just concentrate on other cleaning tasks.

Ensuring the right datatypes

A good practice is to notice whether there are series in the dataframe that are improperly datatyped (that is, where the “Dtype” indicated does not correspond to the datatype that it should be).

Taking a look at the data, we notice that “Timestamp” is datatyped as an “object”, which in this context means a “string”. However, that is sub-optimal because the content is time-oriented, and by coding it as an object/string, we limit a lot of its value. If we identify datatype as a time oriented one, that can give us advantages downstream in being able to work with the data more elegantly.

Pandas has functions specifically designed to cast datatypes from one to another, as was covered in the previous section with the astype() function.

To cast a variable as a datetime datatype, we can use the pandas to_datetime() function and pass in the pandas series you want to modify:

m_health['timestamp'] = pd.to_datetime(m_health['timestamp'])
m_health.head()
m_health.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   timestamp                  24 non-null     datetime64[ns]
 1   age                        24 non-null     int64         
 2   gender                     24 non-null     object        
 3   country                    24 non-null     object        
 4   self_employed              24 non-null     object        
 5   family_history             24 non-null     object        
 6   treatment                  24 non-null     object        
 7   work_interfere             24 non-null     object        
 8   remote_work                24 non-null     object        
 9   tech_company               24 non-null     object        
 10  benefits                   24 non-null     object        
 11  leave                      24 non-null     object        
 12  mental_health_consequence  24 non-null     object        
dtypes: datetime64[ns](1), int64(1), object(11)
memory usage: 2.6+ KB

Et voila - timestamp is now of the datatime64 datatype! Were we wanting to leverage timestamp’s date time qualities, we could now do so more easily. It should be added, that this example works well because the existing data is nicely formatted to be understood as data/time datatype data. In the real world, if the timestamp column has different date types, or isn’t actually a date column at all, the method raises exceptions that you will have to work through. For example, look at what happens if we try to force a different variable into a data/time data type:

m_health['tech_company'] = pd.to_datetime(m_health['tech_company'])
m_health.head()
m_health.info()

ParserError: Unknown string format: Yes

The error let’s us know that this data typing conversion will not work as coded.

Recoding values

Value recoding is a data cleaning task to put the column values into more manageable categories. This is a common data recoding task that most data analyst type folks will be familiar with. Pandas alone has many functions that solve recoding issues in different ways. We will just highlight a few in this tutorial.

With this dataset we’ve called there are a few columns that we would like to make changes to so that it will be easier to analyze the data in the way we want. The value_counts() method shows the range of responses and their frequency in the dataframe.

m_health['gender'].value_counts()

Male        12
male         6
M            2
Female       1
m            1
F            1
Cis Male     1
Name: gender, dtype: int64

With that knowledge, we can utilize the replace() function and pass in custom lists with all of the responses that should be replaced by a single specfied value.

m_health['gender'].replace(['Male ', 'male', 'M', 'm', 'Male', 'Cis Male'],
                            'Male', inplace=True)
m_health['gender'].replace(['Female ', 'female', 'F'],
                            'Female', inplace=True)
m_health['gender'].value_counts()

Male      22
Female     2
Name: gender, dtype: int64

Another task is to put categorical objects into a form that is more amenable to computational tasks. Let’s take a look at a categorial variable that consists of responses of “Yes” and “No”.

m_health['self_employed'].value_counts()

No     23
Yes     1
Name: self_employed, dtype: int64

If we want to leverage this data purely as a segment to split other metrics with, we would probably leave as is. But perhaps we would like to treat it as a dummy variable so we can do more sophisticated math with it.

For our self-employed variable, we begin by implementing replace() to transform the existing content to numbers as shown below:

m_health['self_employed'].replace(['Yes'], '1', inplace=True)
m_health['self_employed'].replace(['No'], '0', inplace=True)
m_health['self_employed'].value_counts()

0    23
1     1
Name: self_employed, dtype: int64

The pandas astype() function is able to interpret the 1s and 0s as integers and so we can transform the series into an int datatype.

m_health['self_employed'] = m_health['self_employed'] \
        .astype(dtype='int')
m_health.head()

	timestamp	age	gender	country	family_history	treatment	work_interfere	remote_work	tech_company	benefits	leave	mental_health_consequence
0	2014-08-27 11:35:00	46	Male	United States	No	Yes	Often	Yes	Yes	Yes	Don't know	Maybe
1	2014-08-27 11:36:00	41	Male	United States	No	Yes	Never	No	No	Don't know	Don't know	Maybe
2	2014-08-27 11:36:00	33	Male	United States	Yes	Yes	Rarely	No	Yes	Yes	Don't know	No
3	2014-08-27 11:37:00	35	Male	United States	Yes	Yes	Sometimes	No	No	Yes	Very easy	Yes
4	2014-08-27 11:42:00	35	Male	United States	No	Yes	Rarely	Yes	Yes	Yes	Very easy	No

Now we see that variable as a series of int-typed 0s and 1s.

Challenge 2

You would like to clean up the scale used in the question about whether one’s mental illness affects one’s work.

Find the column (pandas series) that needs to be transformed and transform its values using replace() with numerical values that make sense.
Display the modified dataframe showing the replaced values.

Solution to Challenge 2

First, find the distribution of cases per value label:

m_health['work_interfere'].value_counts()

Sometimes    12
Never         5
Rarely        4
Often         3
Name: work_interfere, dtype: int64

Then show how to use replace to transform the scale into a numeric one:

m_health['work_interfere'].replace(['Never'], '0', inplace=True)
m_health['work_interfere'].replace(['Rarely'], '1', inplace=True)
m_health['work_interfere'].replace(['Sometimes'], '2', inplace=True)
m_health['work_interfere'].replace(['Often'], '3', inplace=True)
m_health['work_interfere'].value_counts()

2    12
0     5
1     4
3     3
Name: work_interfere, dtype: int64

Distribution of values looks good. Let’s take a look at the datatypes for the dataframe object.

m_health.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   timestamp                  24 non-null     datetime64[ns]
 1   age                        24 non-null     int64         
 2   gender                     24 non-null     object        
 3   country                    24 non-null     object        
 4   self_employed              24 non-null     int64         
 5   family_history             24 non-null     object        
 6   treatment                  24 non-null     object        
 7   work_interfere             24 non-null     object        
 8   remote_work                24 non-null     object        
 9   tech_company               24 non-null     object        
 10  benefits                   24 non-null     object        
 11  leave                      24 non-null     object        
 12  mental_health_consequence  24 non-null     object        
dtypes: datetime64[ns](1), int64(2), object(10)
memory usage: 2.6+ KB

But it is still typed as an object. So change datatype to int.

m_health['work_interfere'] = m_health['work_interfere'] \
        .astype(dtype='int')
m_health.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   timestamp                  24 non-null     datetime64[ns]
 1   age                        24 non-null     int64         
 2   gender                     24 non-null     object        
 3   country                    24 non-null     object        
 4   self_employed              24 non-null     int64         
 5   family_history             24 non-null     object        
 6   treatment                  24 non-null     object        
 7   work_interfere             24 non-null     int64         
 8   remote_work                24 non-null     object        
 9   tech_company               24 non-null     object        
 10  benefits                   24 non-null     object        
 11  leave                      24 non-null     object        
 12  mental_health_consequence  24 non-null     object        
dtypes: datetime64[ns](1), int64(3), object(9)
memory usage: 2.6+ KB

Finally, show the dataframe itself.

m_health

	timestamp	age	gender	country	self_employed	family_history	treatment	work_interfere	remote_work	tech_company	benefits	leave	mental_health_consequence
0	2014-08-27 11:35:00	46	Male	United States	0	No	Yes	3	Yes	Yes	Yes	Don't know	Maybe
1	2014-08-27 11:36:00	41	Male	United States	0	No	Yes	0	No	No	Don't know	Don't know	Maybe
2	2014-08-27 11:36:00	33	Male	United States	0	Yes	Yes	1	No	Yes	Yes	Don't know	No
3	2014-08-27 11:37:00	35	Male	United States	0	Yes	Yes	2	No	No	Yes	Very easy	Yes
4	2014-08-27 11:42:00	35	Male	United States	0	No	Yes	1	Yes	Yes	Yes	Very easy	No
5	2014-08-27 11:42:00	24	Male	United Kingdom	0	No	Yes	2	No	Yes	No	Don't know	Maybe
6	2014-08-27 11:42:00	35	Male	United States	0	No	No	2	Yes	Yes	Yes	Somewhat difficult	Yes
7	2014-08-27 11:43:00	27	Male	Canada	0	Yes	Yes	2	No	Yes	No	Very difficult	Maybe
8	2014-08-27 11:43:00	18	Male	Netherlands	0	No	No	3	No	Yes	No	Somewhat difficult	Yes
9	2014-08-27 11:43:00	30	Male	United States	0	No	Yes	2	No	Yes	Don't know	Don't know	No
10	2014-08-27 11:43:00	38	Female	United States	0	Yes	Yes	2	No	Yes	Yes	Somewhat easy	No
11	2014-08-27 11:43:00	28	Male	United Kingdom	0	No	No	2	No	Yes	Don't know	Don't know	No
12	2014-08-27 11:43:00	34	Male	United States	0	No	No	2	No	Yes	No	Don't know	No
13	2014-08-27 11:43:00	26	Male	Canada	1	No	No	2	No	Yes	No	Don't know	No
14	2014-08-27 11:44:00	30	Male	United States	0	Yes	Yes	1	No	Yes	Yes	Don't know	Maybe
15	2014-08-27 12:12:00	31	Male	United States	0	No	No	0	No	Yes	Yes	Somewhat easy	No
16	2014-08-27 12:13:00	40	Male	United States	0	No	Yes	0	No	Yes	No	Very difficult	No
17	2014-08-27 12:14:00	34	Male	United States	0	No	No	0	No	Yes	Don't know	Somewhat easy	Maybe
18	2014-08-27 12:15:00	25	Female	Canada	0	No	Yes	2	No	Yes	Yes	Don't know	Maybe
19	2014-08-27 12:15:00	29	Male	United States	0	No	No	0	No	Yes	Don't know	Don't know	No
20	2014-08-27 12:16:00	24	Male	United States	0	Yes	No	1	No	Yes	Don't know	Somewhat easy	Maybe
21	2014-08-27 12:17:00	31	Male	Mexico	0	Yes	Yes	2	No	Yes	Don't know	Don't know	No
22	2014-08-27 12:18:00	33	Male	United States	0	No	Yes	2	No	Yes	No	Somewhat easy	Maybe
23	2014-08-27 12:18:00	30	Male	United States	0	No	Yes	3	No	Yes	No	Somewhat easy	No