4  Getting Data With Pandas

Pandas!

4.1 Introduction to Pandas

Pandas is one of the most widely-used Python libraries for statistics, and excels at processing tabular data. If you have ever had any experience with R, it is modeled extensively on the dataframes used there. In this section we aim to learn about the fundamental objects used in pandas, bring our own data into these objects, and then view basic information about it.

Did You Know?

Did you know that the pandas library package is aptly named? It is a portmanteau of the words panel and data. We are literally viewing panels of data!

First, we need to make sure that we import the library into Python so we can use it:

import pandas as pd

Objects in pandas are typically two-dimensional tables called dataframes. You can think of this as being similar to a single spreadsheet in excel. Each column (called series) in the dataframe can have a name, as well as an indexing value.

A pandas dataframe

We can create both single series in pandas or full dataframes. Let’s consider a single series first. We can create this using the pandas Series method, as well as a list of elements we wish to include in our series.

series_x = pd.Series([1, 2, 3])
series_x
0    1
1    2
2    3
dtype: int64

When this displays in python, we see 2 columns. The left column is always the index, while the right column contains the data values. The series is also given a dtype. This is similar to the datatypes we considered previously, and tells us what type of data is held inside the series. When a series is first created, pandas will try to guess the dtype for the series based on the individual elements inside it. We can make a series made of float numbers:

series_y = pd.Series([1, 2.0, 3.14])
series_y
0    1.00
1    2.00
2    3.14
dtype: float64

Notice that here, the first element was actually converted from an int to a float by the program! This will often happen when working with int and float type objects.

Next, a series of str elements will default to the dtype of object.

series_z = pd.Series(['settlers', 'of', 'catan'])
series_z
0    settlers
1          of
2       catan
dtype: object

The dtype: object is pandas ‘catchall’ Series type, and we want to be cautious when we see this! It is also the Series dtype used when we have mixed data:

series_mixed = pd.Series([1, 1.0, 'one'])
series_mixed
0      1
1    1.0
2    one
dtype: object

When pandas sees data that cannot be obviously converted into a single datatype, it leaves each individual entry alone, as whatever its original datatype happened to be. Be careful when you see this! Many operations on pandas dataframes apply to the entire column (similar to how a new excel column is often created with a function that applies to a previous column). If this operation is built to only work on a single datatype, we might run into errors! To avoid this, we can utilize the .astype() methods. Possible arguments for astype() include the standard datatypes:

  • int
  • float
  • str
  • bool

However if you need to be more specific, you can be as well. These arguments are all required to be included in quotations, as they refer to aliases for pandas specific datatypes:

  • 'Int8', 'Int32', 'Int64' provide nullable integer types (note the capital I! more on nullable typing below)
  • 'string' provides access to a series that will be string specific, instead of the general objects type. Pandas recommends using this type for strings whenever possible.
display(series_y)
display(series_y.astype(int))
display(series_y.astype(bool))
display(series_y.astype(str))
display(series_y.astype('string'))
0    1.00
1    2.00
2    3.14
dtype: float64
0    1
1    2
2    3
dtype: int64
0    True
1    True
2    True
dtype: bool
0     1.0
1     2.0
2    3.14
dtype: object
0     1.0
1     2.0
2    3.14
dtype: string

You will have noticed that each of these series has an index associated with it. We can access series indicies using the index attribute:

series_a = pd.Series([2, 4, 6])
display(series_a)
display(series_a.index)
0    2
1    4
2    6
dtype: int64
RangeIndex(start=0, stop=3, step=1)

When an index is first assigned to a series, it is automatically assigned as an integer index, with similar properties to a list index (starts at 0, can be sliced, etc.). However we can change this index to be whatever we want by directly modifying the index attribute:

series_a.index = ['a', 'b', 'c']
display(series_a.index)
display(series_a)
Index(['a', 'b', 'c'], dtype='object')
a    2
b    4
c    6
dtype: int64

A useful feature in pandas is the ability to reindex the dataset. Reindexing a dataset will do two things:

  1. Reorder the data according to the order we ask for.
  2. Add new rows for indicies that are missing in the original dataset but are included in our new index.

One popular use for this method is in filling out a timeseries dataset: if there were missing years in a dataset but we do not wish to simply skip over them, we can add the extra years to the index.

series_a = series_a.reindex(['d', 'c', 'b', 'a'])
display(series_a.index)
display(series_a)
Index(['d', 'c', 'b', 'a'], dtype='object')
d    NaN
c    6.0
b    4.0
a    2.0
dtype: float64
Beware NaN values!

Sometimes in our datasets, we want to allow a row to contain missing, or null values. The default null value for pandas is NaN. In python, NaN is considered a float value. In the above example, we introduced a missing value for the new ‘d’ index, which defaults to the NaN float value. Because its type is float, this converted our entire series to float as well. If we wish to keep the series as an int, we can coerce it back using the astype() method with one of the nullable integer types. This will introduce a slightly different looking null value that works for integers!

series_a.astype('Int64')
d    <NA>
c       6
b       4
a       2
dtype: Int64

We commonly want to make new series out of old ones inside our dataframes. Operations are typically done on an element by element basis. We will see many examples of these in future sessions as we learn to manipulate dataframes, but here is a short example of what we may wish to do:

series_a = pd.Series([2, 4, 6])
series_b = pd.Series([1, 2, 3])

display(series_a + series_b)
display(series_a > 3)
display(series_a*5)
0    3
1    6
2    9
dtype: int64
0    False
1     True
2     True
dtype: bool
0    10
1    20
2    30
dtype: int64

4.2 Bringing in our own data

Enough about series. Let’s talk dataframes! This is the main tool in our pandas toolkit. As we showed earlier, it is simply a collection of series all stacked together like an excel spreadsheet. There are many different ways to create a dataframe including:

  • Within python itself
  • From a csv, excel file, or some other local tabular format
  • From more exotic data sources, such as a parquet file, json, or a website
  • From a SQL database

For the purposes of this course, we are going to focus on opening up local datafiles (with the most common type being a csv or excel file), and then utilizing the data once it is in python. To bring in data from a csv or excel file, we utilize the pandas methods read_csv() or read_excel(), with the only required argument being the path to the datafile.

But first we need some data! Navigate to this URL, right click on the data, and save the csv as gapfinder.csv in a folder called data inside our project folder. Now that we have a dataset, let’s load it into pandas.

The Data Folder

While everyone may organize their projects and folders slightly differently, there are some general principles to adhere to that make project management easier. Arguably the most important of these is to treat input data as read only. If you have an excel spreadsheet, it is tempting to go make changes to the data directly in the spreadsheet: I’ll just tweak a single value here, or add a column there. However, once we start doing this we lose the concept of reproducibility! How did we get certain values if the changes are all hidden in the excel spreadsheet? By keeping data as read only and making all of our changes in the python scripts, the processing is not only reproducible but also transparent to others.

Because of this, it is common to have a folder exclusively for raw data. Secondary folders may then be set-up for ‘cleaned’ data, python scripts, analysis outputs and more.

df = pd.read_csv('../data/gapfinder.csv')
df
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138
4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418.0 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340.0 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948.0 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563.0 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143.0 Africa 43.487 469.709298

1704 rows × 6 columns

.. ?

The .. used in the above path tells python to look in the directory directly above the one you are currently in - the one where your notebook is saved.

As an example, my directory structure looks like:

/intro_to_python
│
└───/src
│   │   hello_world.ipynb
│   │   hello_world.py
│   │   intro_to_pandas.ipynb  <------ (THIS FILE!)
│
└───/data
│   │   gapfinder.csv <--------------- (The file we want access to!)

This file is in a different subfolder relative to the csv, so we first ‘back out’ of this folder using .., and then ‘enter’ the data folder using a regular file path. This is called relative pathing and can be useful for accessing data within a single project that will always be in the same spot!

This dataset has 6 columns, 1704 rows, and a mixture of different datatypes. Just like we were able to access the index of a series, we can do the same with a dataframe. Now, we can also access (and change if need be!) the column names as well:

display(df.index)
display(df.columns)

df.columns = ['country', 'year', 'pop', 'continent', 'life expectancy', 'gdpPercap']
display(df.columns)
RangeIndex(start=0, stop=1704, step=1)
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
Index(['country', 'year', 'pop', 'continent', 'life expectancy', 'gdpPercap'], dtype='object')

We can access each of these series individually if we want. There are two ways to access a series in a dataframe - either with square bracket indexing, or treating the column name as an attribute:

display(df['country'])
display(df.country)
0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object
0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object
No Spaces!

As a general rule of thumb, we never want to include special characters such as spaces, periods, hyphens and so on in our column names, as this will alter pandas capability of calling each of the columns as an attribute. Above, you will notice we reset one of the columns to have a space in the name. If we try to access this series as an attribute now, it will fail:

df['life expectancy']
0       28.801
1       30.332
2       31.997
3       34.020
4       36.088
         ...  
1699    62.351
1700    60.377
1701    46.809
1702    39.989
1703    43.487
Name: life expectancy, Length: 1704, dtype: float64
df.life expectancy
  Cell In[16], line 1
    df.life expectancy
            ^
SyntaxError: invalid syntax

Try to stick to the same naming conventions for columns as for your python variables: lowercase letters, numbers (but not at the start of the name) and underscores only! (And as a matter of fact, let us change it back now):

df.columns = ['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap']

When reading data into a dataframe from a csv (or an excel file), there are multiple optional arguments we can use to start the process of data wrangling, which is writing code to shape the data into the format we want it for our analysis. Some important options include:

  • header: row number to use as the column names. This allows us to skip past rows in the dataset and start from lower down if need be.
  • index_col: name of the column we might wish to use for the index of the dataframe instead of the default integer list.
  • usecols: list of columns we wish to use. If the dataset is large with many columns that we do not care about, we can pull in only those of interest!

Challenge 1

Challenge 1

Bring the dataset into pandas again. This time:

  • use the country as the index
  • only include the year, continent and population columns.
df_challenge = pd.read_csv(
    '../data/gapfinder.csv',
    index_col = 'country',
    usecols = ['country', 'continent', 'year', 'pop']
)
display(df_challenge)
year pop continent
country
Afghanistan 1952 8425333.0 Asia
Afghanistan 1957 9240934.0 Asia
Afghanistan 1962 10267083.0 Asia
Afghanistan 1967 11537966.0 Asia
Afghanistan 1972 13079460.0 Asia
... ... ... ...
Zimbabwe 1987 9216418.0 Africa
Zimbabwe 1992 10704340.0 Africa
Zimbabwe 1997 11404948.0 Africa
Zimbabwe 2002 11926563.0 Africa
Zimbabwe 2007 12311143.0 Africa

1704 rows × 3 columns

Notice that, because we specified an index, the index now has a name! You can access the name of the index via df_challenge.index.name:

df_challenge.index.name
'country'

4.3 Learning about our data

Okay, so now we have a dataset. Great! Now what can we do with it? In the next few sessions, we will explore in detail some of the more in-depth tools that pandas gives us. For now, let’s stick to learning how to view different portions of the data, as well as learning how to describe the overall dataset.

When viewing data, we do not want to be scrolling past multiple lines of individual rows of the data. This might be a shift in mindset if you are used to working with tables of data directly in front of you! An excel spreadsheet just has all the data right there for you to look at! Why not do that here? The simple answer is magnitude. If you only have 10s to 100s of rows of data, seeing it visually is okay. But once you start to deal with thousands, millions or even trillions of rows of data, it’s going to take a while to scroll through the entire thing. At this stage, the important piece of information is how we are treating the data we see, not the actual values.

Typically, we just want to view a small slice of the data to get an understanding of the types of data we have in our dataset. We have three tools in the toolkit for this:

  • head(): This returns the first N rows of data (default N = 5)
  • tail(): This returns the last N rows of data (default N = 5)
  • sample(): This returns a random sampling of N rows of data (default N = 1)
print('The first 5 rows of data:')
display(df.head())

print('The last 3 rows of data:')
display(df.tail(3))

print('A random sampling of 7 rows of data:')
display(df.sample(7))
The first 5 rows of data:
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138
4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
The last 3 rows of data:
country year pop continent lifeExp gdpPercap
1701 Zimbabwe 1997 11404948.0 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563.0 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143.0 Africa 43.487 469.709298
A random sampling of 7 rows of data:
country year pop continent lifeExp gdpPercap
563 Gambia 2007 1688359.0 Africa 59.448 752.749726
1190 Paraguay 1962 2009813.0 Americas 64.361 2148.027146
640 Haiti 1972 4698301.0 Americas 48.042 1654.456946
148 Bosnia and Herzegovina 1972 3819000.0 Europe 67.450 2860.169750
99 Bangladesh 1967 62821884.0 Asia 43.453 721.186086
1189 Paraguay 1957 1770902.0 Americas 63.196 2046.154706
422 Djibouti 1962 89898.0 Africa 39.693 3020.989263

Once we have looked at the data, and it seems to look normal at first glance, we can ask some basic questions about the dataset. How many columns are there? How many rows? Are there null values in any of our columns? What about some basic statistics??

Luckily for us, pandas has done all of the hard work here. Two valuable methods built into pandas will give us basic information about the overall dataset: .info() and .describe().

.info() will gives us basic information about each column: what data type it is storing and how many non-null values are in the column.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   year       1704 non-null   int64  
 2   pop        1704 non-null   float64
 3   continent  1704 non-null   object 
 4   lifeExp    1704 non-null   float64
 5   gdpPercap  1704 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB

.describe() will give us basic statistical information about every numerical column: mean, standard deviation, quartiles, and counts are all included with a call to a single method!

df.describe()
year pop lifeExp gdpPercap
count 1704.00000 1.704000e+03 1704.000000 1704.000000
mean 1979.50000 2.960121e+07 59.474439 7215.327081
std 17.26533 1.061579e+08 12.917107 9857.454543
min 1952.00000 6.001100e+04 23.599000 241.165876
25% 1965.75000 2.793664e+06 48.198000 1202.060309
50% 1979.50000 7.023596e+06 60.712500 3531.846988
75% 1993.25000 1.958522e+07 70.845500 9325.462346
max 2007.00000 1.318683e+09 82.603000 113523.132900

Finally, if we want basic information about the non-numerical columns, we can use the value_counts() method. For a given series (or multiple series), this tells us how freqeuntly a given value appears. We will learn more about what this is doing under the hood when we learning about aggregation methods in a later section, but we can apply it to singular text columns here as a teaser

df.country.value_counts()
Afghanistan          12
Pakistan             12
New Zealand          12
Nicaragua            12
Niger                12
                     ..
Eritrea              12
Equatorial Guinea    12
El Salvador          12
Egypt                12
Zimbabwe             12
Name: country, Length: 142, dtype: int64

Challenge 2

Challenge 2

What will happen if we run the following code:

df.sample(42).describe()
  1. Do we expect the results to be the same as df.describe()? Why or why not?
  2. Run the code again. Are the results the same or different than before? Can you explain?
  1. This should output a different result from df.describe(). We can break this down into two portions:
  • We create a new dataframe that holds the sampled dataframe via df.sample(42)
df_sample = df.sample(42)
  • We are now sending the results of this sampling to the describe method. Because the sampled dataset has only 42 rows that were randomly chosen from the original 1704, it would be an impressive coincidence if all the outputs were identical!
df_sample.describe()
year pop lifeExp gdpPercap
count 42.000000 4.200000e+01 42.000000 42.000000
mean 1981.404762 1.697625e+07 61.123762 7200.368298
std 18.877086 3.127289e+07 12.244976 8340.174027
min 1952.000000 2.854830e+05 40.533000 379.564628
25% 1962.000000 2.304787e+06 51.001000 1552.887023
50% 1984.500000 4.847150e+06 63.604000 4193.800963
75% 2000.750000 1.023027e+07 70.635000 8049.055235
max 2007.000000 1.367250e+08 82.208000 39724.978670
  1. The results should be different from the previous call! This is because sample() outputs a random sampling of the dataframe. Everytime we sample the dataset, we get a different subset! Each subset should end up with slightly different statistics if it is small enough relative to the entire dataset!
df.sample(42).describe()
year pop lifeExp gdpPercap
count 42.000000 4.200000e+01 42.000000 42.000000
mean 1977.595238 1.293377e+07 56.925929 7911.151033
std 19.729957 2.369760e+07 13.629115 17939.263861
min 1952.000000 2.128460e+05 28.801000 369.165080
25% 1958.250000 1.750682e+06 48.229000 937.538587
50% 1977.000000 5.243198e+06 58.335000 2727.373282
75% 1995.750000 1.028661e+07 68.672500 8021.253545
max 2007.000000 1.137046e+08 75.744000 113523.132900

4.4 Saving our data

Yes, you can even save to a pickle.

What do we do once we have cleaned up our data or produced some analysis? It is very likely that we will want to save that clean dataset or analysis to a new file. Pandas to the rescue! As simple as it is to read in data via read_csv() or read_excel(), we can export it back out. While I’ve shown the entire list of to_file() options available in pandas (it’s extensive!), we will focus on to_csv(). Required arguments to this method are:

  • path_or_buf - full path/filename where you wish to save this file

That’s it! However, there are some useful optional arguments as well:

  • index: True or False. Whether we wish to include the index in our output (default is True). We will often want to set this to False, as the index is just a set of integers labeling the row numbers.
  • columns: list of columns to keep in the output
pd.method() or df.method()?

Sometimes, in order to access a function, we directly access it via the library (pd), or we access it as a method of the dataframe we are using (df). It can be hard to keep track of which functions live where. As a general rule of thumb, if the method is being used to do something to a specific dataframe, it probably belongs to the dataframe object (which is, let’s be honest, most of the functions we might use). Don’t be afraid to use all of your resources to keep it straight! (Yes, google counts as a valid resource). Using the help function is also a quick and easy way to check where a method lives.

For example, to find the to_csv() function, we can see that it belongs to the dataframe by checking help(df.to_csv). However, trying to use pd.to_csv will throw an error - a good hint that it was a dataframe method all along!

help(pd.to_csv)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[27], line 1
----> 1 help(pd.to_csv)

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/__init__.py:261, in __getattr__(name)
    257     from pandas.core.arrays.sparse import SparseArray as _SparseArray
    259     return _SparseArray
--> 261 raise AttributeError(f"module 'pandas' has no attribute '{name}'")

AttributeError: module 'pandas' has no attribute 'to_csv'

Inversely, the read_csv function belongs directly to pandas, and so trying something like this will break:

help(df.read_csv)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2370/72358213.py in ?()
----> 1 help(df.read_csv)

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/generic.py in ?(self, name)
   5571             and name not in self._accessors
   5572             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573         ):
   5574             return self[name]
-> 5575         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'read_csv'

Challenge 3

Challenge 3

Save a copy of all of the summary statistics for the gapfinder dataset. Only include the statistics for the pop and lifeExp columns. What happens when we include or exclude the index?

First, we want to make sure we have a folder to save our outputs to. As discussed in the Working with Code section, I have made a folder called output that lives at the same level as the data folder. Next, we can invoke the to_csv method on a copy of our descriptive statistics:

df_descriptive = df.describe()
df_descriptive.to_csv('../output/challenge_output.csv', columns=['pop', 'lifeExp'])

In this case, if we exclude the index, we will actually lose information about what each row represents. This is one case when keeping the index will retain valuable information!

df.describe().to_csv(
    '../output/challenge_output_no_index.csv', 
    columns=['pop', 'lifeExp'],
    index=False
    )

Up next, we will learn how to clean our data.