6 Exploring Data Structures

Author

Stuart Hemerling

6.1 Exploring and Understanding Data

Getting a high level summary of the data is important but data is particularly valuable when refined. Your analysis will start to come alive when we start to do some slicing and dicing and grouping of data or even creating additional variables. In an Excel world, this is like when you use filter options for columns, or create pivot tables, or when you create a formula in a new column to create a new variable. In data science parlance, this is the kind of thing that is referred to as data wrangling - getting the data you want in the form you want it in.

In the world of python, this usually means working with a library or package you have already been introduced to called pandas. It lets you do so much!

The first dataset we’ll look at is one that looks at data for countries around the world and shows population counts, life expectancy and GDP per capita over a number of years. Let’s import the pandas library and then use the .read_csv() function to get some population by country data. We will read the data into an object that will call country_metrics, so that it is easier for us to remember what it consists of.

import pandas as pd
url = "https://raw.githubusercontent.com/bcgov/"\
    "ds-intro-to-python/main/data/gapfinder.csv"
country_metrics = pd.read_csv(url)

It’s usually a good idea right away to take a quick look at the data to make sure what we have read in makes sense. The .info() method prints the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

Reminder: Naming variables!

When you create a variable or object, you can name it pretty much whatever you want. You will see widespread use of the name “df” on the internet, which stands for “dataframe”. However, as it is good practice to name your data something that relates to the contents. Naming it something intuitive will help reduce chance for confusion as you develop code. Objects you create should be some noun that is at the same time intuitive but not overly verbose. Also remember the form is important too. Remember not to not leave spaces in your names, and to always be consistent in the naming conventions you use (e.g. camelCase, underscore_case, etc.).

country_metrics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   year       1704 non-null   int64  
 2   pop        1704 non-null   float64
 3   continent  1704 non-null   object 
 4   lifeExp    1704 non-null   float64
 5   gdpPercap  1704 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB

So this data is looking good so far. We can see that we have created a pandas dataframe within our python environment. There are 1204 rows of data and six columns or variables. You can see there are only non-null values… so happily there is no missing data to worry about.

The .head() pandas method lets us look at actual data in a somewhat similar way to how one does in an Excel spreadsheet.

country_metrics.head(2)

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030

While we don’t learn too too much from this look, but we do see some sample data to see and start getting familiar with visually.

6.2 Selecting columns (variables)

In the “real world” of data it is not uncommon to see hundreds or even thousands of columns in a single dataframe, so knowing how to pare down our dataset so it is not overly bloated is important.

Selecting Columns

First things first. If we want to look the contents of a single column, we could do it like this, specifying the column name after the . without parentheses:

country_metrics.country

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

When we want to select more than one column to include in our dataframe, we use the convention df[['col1', 'col2']] to select the columns we want.

In our example below, we will create a new dataframe called narrow_country_metrics. We are primarily interested in getting the population metric from the original dataset, and since we want to be able to analyze it by year, country, and continent, we will add those into our new dataframe also. So the dataset will be more narrow than the one it was created from.

There are two ways to approach this. The first way to approach is to identify all the columns you want to keep.

narrow_country_metrics = country_metrics[
    ['country', 'year', 'pop', 'continent']
    ]
narrow_country_metrics.head(2)

	country	year	pop	continent
0	Afghanistan	1952	8425333.0	Asia
1	Afghanistan	1957	9240934.0	Asia

The second way is use the pandas .drop() method to eliminate the columns from the original object that you do not want to keep. Thus in the end we can achieve the same result as the example above by simply dropping two variables instead of naming four. With this method, we also need to add the specification axis = 1, which indicates that it is columns (and not rows) being referenced for being dropped.

narrow_country_metrics = country_metrics.drop(
    ['lifeExp', 'gdpPercap'], axis=1
    )
narrow_country_metrics.head(2)

	country	year	pop	continent
0	Afghanistan	1952	8425333.0	Asia
1	Afghanistan	1957	9240934.0	Asia

Either way we have a more manageable dataset to work with called narrow_country_metrics. The original dataframe country_metrics is still there, and it still the same shape as before, we have not changed it.

However, it is important to realize that in python it is very common to change an object by referring to that object on both sides of the assignment operator. In the code below, we narrow the narrow_country_metrics dataset down even further by removing the column pop as well.

narrow_country_metrics = narrow_country_metrics.drop(['pop'], axis=1)
narrow_country_metrics.head(2)

	country	year	continent
0	Afghanistan	1952	Asia
1	Afghanistan	1957	Asia

If we want to have the pop variable back in the narrow_country_metrics object, we will need to recreate it from the source it came from, where pop was still intact.

Reminder on Assignment!

When you create a variable or dataframe object in python, to the left of the = sign you always put the name of the object you want to make or modify. To the right, that’s where you put the content that shows how the existing object is to be modified. It may feel counterintuitive but don’t be concerned, it is the standard way that the code is structured to be understood.

Challenge 1

Let’s say you want to narrow down the dataset to include only the country and the year. How would you do it? Don’t forget to run your code so it also shows a view of the result so you can confirm the code worked as you wanted it to.

Solution to Challenge 1

First you would like to create a new object to hold this narrowed down version of the dataset, then use the .head() to see some of the resulting data.

challenge1_df = country_metrics[['country', 'year']]
challenge1_df.head(2)

	country	year
0	Afghanistan	1952
1	Afghanistan	1957

6.3 Selecting rows

Of course, in data analysis we are usually interested in looking at just some rows of data, not all rows all the time.

Selecting Rows

Relational operators

When we want to look at just selected rows (i.e. select rows that have certain values within a given column) we can supply a condition that must be met for a given row in that column. To do this, we must use one of the following comparison operators, which are also called relational operators:

equal to ==
not equal to !=
less than <
greater than >
less than or equal to <=
greater than or equal to >=

Let’s try an example. The most commonly used relational operator is likely equal to (==). In the example below we have a statement that pandas evaluates line by line in the dataframe as to whether it corresponds to a boolean value of True or False.

country_metrics['year'] == 1972

0       False
1       False
2       False
3       False
4        True
        ...  
1699    False
1700    False
1701    False
1702    False
1703    False
Name: year, Length: 1704, dtype: bool

When we create an object from this evaluation in pandas, it returns to that object only the rows that evaluate as True:

filtered_country_metrics = country_metrics[country_metrics['year'] == 1972]
filtered_country_metrics.head(2)

	country	year	pop	continent	lifeExp	gdpPercap
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106
16	Albania	1972	2263554.0	Europe	67.690	3313.422188

Logical operators

In python, it is common to use the logical operators and, or, and not to evaluate expressions. For example, as in the code below, python evaluates the and to see whether BOTH operands are true:

x = 5
print(x > 3 and x < 10)

True

However, when we ask pandas to evaluate whether a set of logical relations exist within a pandas series object, we must use pandas bitwise logical operators, whose syntax is different:

and &
or |
not ~

When combining multiple conditional statements in a pandas series object, each condition must be surrounded by parentheses () within the square brackets []. In the example below, we use an & to indicate that both conditions must be true for a row to be returned:

filtered_country_metrics = country_metrics[
    (country_metrics['year'] == 1972) &   
    (country_metrics['country'] == 'Albania')
    ]
filtered_country_metrics.head()

	country	year	pop	continent	lifeExp	gdpPercap
16	Albania	1972	2263554.0	Europe	67.69	3313.422188

In the example above, there is only one row of data that matches the condition.

We can use the | which indicates that if either or both of the conditions are true for a given row, that row is returned into the object:

filtered_country_metrics = country_metrics[
    (country_metrics['year'] == 1972) | 
    (country_metrics['country'] == 'Albania')
    ]
filtered_country_metrics.head()

	country	year	pop	continent	lifeExp	gdpPercap
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106
12	Albania	1952	1282697.0	Europe	55.230	1601.056136
13	Albania	1957	1476505.0	Europe	59.280	1942.284244
14	Albania	1962	1728137.0	Europe	64.820	2312.888958
15	Albania	1967	1984060.0	Europe	66.220	2760.196931

In the example above, there are many rows that satisfy one or the other condition. As an aside, were we to want to know how many rows fulfill this | situation, we could have called the .info() method instead of .head().

Challenge 2

Your director has come to you and asked if you know what the life expectancy has been in Canada since 1992. How would you use pandas code to get the data you need? And after having run the code, what’s the answer?

Solution to Challenge 2

First you would like to create a new object to hold this narrowed down version of the dataset, then use some method like .print() or .head() to see some of the resulting data.

challenge2_df = country_metrics[
    (country_metrics['year'] >= 1992) & 
    (country_metrics['country'] == 'Canada')
    ]    
challenge2_df.head()

	country	year	pop	continent	lifeExp	gdpPercap
248	Canada	1992	28523502.0	Americas	77.950	26342.88426
249	Canada	1997	30305843.0	Americas	78.610	28954.92589
250	Canada	2002	31902268.0	Americas	79.770	33328.96507
251	Canada	2007	33390141.0	Americas	80.653	36319.23501

You will find that it is very common within python/pandas to see the .iloc() function used, so it is worthwhile to give that a brief introduction also.

This function enables the selection of data from a particular row or column, according to the integer based position index as shown in the image below:

Row and column values iloc() refers to

By referring to these locations, we can use iloc() to retrieve the exact cells we want to see. For example, if we just want the row of data corresponding to the index integer value 5, we would run the following code:

country_metrics.iloc[5]

country      Afghanistan
year                1977
pop           14880372.0
continent           Asia
lifeExp           38.438
gdpPercap      786.11336
Name: 5, dtype: object

We can also modify the above call to look at column(s)! For example, we specify the row(s) of interest to the left of a comma in the square brackets, and the column(s) of interest to the right of it. In the example below, the code will retrieve just the content found at the intersection of row 5 and column 3:

country_metrics.iloc[5,3]

'Asia'

We can also retrieve ranges of rows and columns respectively by employing a colon along with the starting and ending rows or columns we want. In the example below, we want to get rows “0” through “2” and the columns “0” through “3”:

country_metrics.iloc[0:2,0:3]

	country	year	pop
0	Afghanistan	1952	8425333.0
1	Afghanistan	1957	9240934.0

It is a flexible tool that can be very helpful as you build and review your code.

6.4 Sorting rows

So now you’ve mastered how to select rows and columns you want, congratulations! Instead of hunting and pecking for insights, one way to quickly make some sense of the data is to sort it - something you probably do in Excel all the time.

Let’s say we wanted to know more about Asian countries since 2002. First we should go ahead and create a dataframe that consists of just the data we are wanting to look at.

countries_2000s = country_metrics[
    (country_metrics['year'] >= 2002) & 
    (country_metrics['continent'] == 'Asia')
    ]    
countries_2000s.head()

	country	year	pop	continent	lifeExp	gdpPercap
10	Afghanistan	2002	25268405.0	Asia	42.129	726.734055
11	Afghanistan	2007	31889923.0	Asia	43.828	974.580338
94	Bahrain	2002	656397.0	Asia	74.795	23403.559270
95	Bahrain	2007	708573.0	Asia	75.635	29796.048340
106	Bangladesh	2002	135656790.0	Asia	62.013	1136.390430

This is a nice start. But to see some meaningful insights it is helpful to sort the metric we are interested in something other than the default way the data appear positioned in the dataframe.

The .sort_values() method let’s us specify a column (or multiple columns) we want to sort and enter an argument that indicates in which direction we would like it to be sorted, from low to high or vice-versa. We want life expectancy from high to low, so the ascending parameter should be set to False.

countries_2000s = countries_2000s.sort_values('lifeExp', ascending=False)
countries_2000s.head()

	country	year	pop	continent	lifeExp	gdpPercap
803	Japan	2007	127467972.0	Asia	82.603	31656.06806
671	Hong Kong China	2007	6980412.0	Asia	82.208	39724.97867
802	Japan	2002	127065841.0	Asia	82.000	28604.59190
670	Hong Kong China	2002	6762476.0	Asia	81.495	30209.01516
767	Israel	2007	6426679.0	Asia	80.745	25523.27710

That was pretty straightforward and it helps us get more insights. Looks like Japan, in 2007, was the Asian country with the highest life expectancy.

Often one will want to sort by more than one column. To do that, instead of passing in a single column, we pass in a list of the columns we want to sort on. We can also control whether we would like each column to be sorted in ascending or descending order by passing in a respective list of boolean values in the ascending= parameter. In the example below, the code indicates that we want to sort first by ‘country’ in reverse alphabetical order, then by life expectancy from high to low.

countries_2000s = countries_2000s.sort_values(
    ['country', 'lifeExp'], ascending=[False, False]
    )
countries_2000s.head()

	country	year	pop	continent	lifeExp	gdpPercap
1679	Yemen Rep.	2007	22211743.0	Asia	62.698	2280.769906
1678	Yemen Rep.	2002	18701257.0	Asia	60.308	2234.820827
1667	West Bank and Gaza	2007	4018332.0	Asia	73.422	3025.349798
1666	West Bank and Gaza	2002	3389578.0	Asia	72.370	4515.487575
1655	Vietnam	2007	85262356.0	Asia	74.249	2441.576404

We see that when we created and sorted this new object, the index values are those that were associated with the original country_metrics dataframe.

Depending on what we want to do with countries_2000s going forward, we might want to create a new “default” index for it instead of the existing one. That’s where the .reset_index() function comes into play.

In the code below, we reset the index.

countries_2000s = countries_2000s.reset_index(drop=False)
countries_2000s.head()

	index	country	year	pop	continent	lifeExp	gdpPercap
0	1679	Yemen Rep.	2007	22211743.0	Asia	62.698	2280.769906
1	1678	Yemen Rep.	2002	18701257.0	Asia	60.308	2234.820827
2	1667	West Bank and Gaza	2007	4018332.0	Asia	73.422	3025.349798
3	1666	West Bank and Gaza	2002	3389578.0	Asia	72.370	4515.487575
4	1655	Vietnam	2007	85262356.0	Asia	74.249	2441.576404

Notice how the original index now is turned into a column itself called “index”. Depending on the use case, we might want to keep this column in the new dataframe. For example, if we ultimately wanted to join this dataframe back with the original it could act as a key. If we know, however, that it is of no value anymore to us, we can and probably should delete it. To do this, we would have set the drop parameter above to True.

Let’s use the iloc() method to look at the positional content of this countries_2000s object and confirm that Yeman shows up in row 0 as we expect.

countries_2000s.iloc[0:4,0:5]

	index	country	year	pop	continent
0	1679	Yemen Rep.	2007	22211743.0	Asia
1	1678	Yemen Rep.	2002	18701257.0	Asia
2	1667	West Bank and Gaza	2007	4018332.0	Asia
3	1666	West Bank and Gaza	2002	3389578.0	Asia

6.5 Putting multiple methods together

The next step is usually to bring several of these commands together to get a nicely refined look at the data. Essentially you will need to invoke a number of calls sequentially to some object, with each one in turn performing some action on it.

A common approach is to put the object name on each sucessive line of code along with the = operator as well as the modifying code to the right. Imagine we were given the following task:

Select only the rows where the country is equal to Ireland AND where the year is 1992 or greater
Take these rows and sort them by year, going from high to low values, top to bottom

The example below shows how we would employ the line-by-line approach:

chained2_df = country_metrics[(country_metrics['country'] == 'Ireland')]
chained2_df = chained2_df[(chained2_df['year'] >= 1992)]
chained2_df = chained2_df.sort_values('year', ascending=False)
chained2_df.head(3)

	country	year	pop	continent	lifeExp	gdpPercap
755	Ireland	2007	4109086.0	Europe	78.885	40675.99635
754	Ireland	2002	3879155.0	Europe	77.783	34077.04939
753	Ireland	1997	3667233.0	Europe	76.122	24521.94713

You can read fairly clearly what is happening in each line.

You can also put multiple methods in the same line of code, as long as you separate each of the elements. In pandas, this is sometimes referred to as joining or chaining, as you are essentially creating a joining/chaining actions together.

The code below accomplishes what the more verbose code above does but in a more efficient way:

chained_df = country_metrics[
    (country_metrics['country'] == 'Ireland') & 
    (country_metrics['year'] >= 1992)
    ].sort_values('year', ascending = False)
chained_df.head(3)

	country	year	pop	continent	lifeExp	gdpPercap
755	Ireland	2007	4109086.0	Europe	78.885	40675.99635
754	Ireland	2002	3879155.0	Europe	77.783	34077.04939
753	Ireland	1997	3667233.0	Europe	76.122	24521.94713

Each line is executed in sequence, so be careful when constructing code that the order in which you modify your dataframe is what you intend. Take the example below, in which we want to find European values for GDP per capita. In it, we first have a line of code that selects the rows that contain “Europe” in the first line, then looks at the column values for “year”, “country”, and “gdpPercap”.

year_country_gdp = country_metrics[
    (country_metrics['continent'] == 'Europe')
    ]
year_country_gdp = year_country_gdp[['year', 'country', 'gdpPercap']]
year_country_gdp.head(3)

	year	country	gdpPercap
12	1952	Albania	1601.056136
13	1957	Albania	1942.284244
14	1962	Albania	2312.888958

Had we reversed the lines of code, and selected the “year”, “country”, and “gdpPercap” columns first to put into our new object, we would not have been able to sort for “Europe” in the “continent” column, as the “continent” would have been essentially removed from the dataframe in the step above. So an error would have been thrown.

Challenge 3

Your director has come back to you and wondered about whether the life expectancy of people changed during the 1920s in Cambodia. Use what you know about selecting rows and sorting data to get the data you need to answer the question, sorted from most recent at the top.

Solution to Challenge 3

Using code to select the country Cambodia, then call sort_values() method and chain them together and sorting by year in descending fashion. Looking at the data we see that in the 1920s there was a sharp decline in life expectancy in Cambodia. We also see, thankfully, that it has recovered strongly since then.

challenge3_df = country_metrics[
    (country_metrics['country'] == 'Cambodia')
    ].sort_values('year', ascending = False)
challenge3_df.head(10)

	country	year	pop	continent	lifeExp	gdpPercap
227	Cambodia	2007	14131858.0	Asia	59.723	1713.778686
226	Cambodia	2002	12926707.0	Asia	56.752	896.226015
225	Cambodia	1997	11782962.0	Asia	56.534	734.285170
224	Cambodia	1992	10150094.0	Asia	55.803	682.303175
223	Cambodia	1987	8371791.0	Asia	53.914	683.895573
222	Cambodia	1982	7272485.0	Asia	50.957	624.475478
221	Cambodia	1977	6978607.0	Asia	31.220	524.972183
220	Cambodia	1972	7450606.0	Asia	40.317	421.624026
219	Cambodia	1967	6960067.0	Asia	45.415	523.432314
218	Cambodia	1962	6083619.0	Asia	43.415	496.913648

6.6 Creating new columns of data

Very often we have some data in our dataset that we want to transform to give us additional information. In Excel this is something that is done all the time by creating a formula in a cell that refers to other columns and applies some sort of logic or mathematical expression to it.

There are different ways you can go about this in pandas. The most straightforward way is to define a new column on the left of the = and then reference the existing column and whatever additional conditions you would like on the right. In the example below, the existing population variable “pop” is converted to a value that shows population in millions.

new_cols_df = country_metrics
new_cols_df['pop_millions'] = new_cols_df['pop']/1000000
new_cols_df.head(2)

	country	year	pop	continent	lifeExp	gdpPercap	pop_millions
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314	8.425333
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030	9.240934

Another way is to call the pandas function apply() in which other, more complicated functions can be passed for a given column. This could be a complicated mathematical formula, a function acting on strings or dates, or any other function that cannot be represented by a simple multiplication or division. In the example below, the len() function is applied to each row in the “country” column and a new column with the number of characters for that row is returned:

new_cols_df['country_name_chars'] = new_cols_df.country.apply(len)
new_cols_df.sample(5)

	country	year	pop	continent	lifeExp	gdpPercap	pop_millions	country_name_chars
144	Bosnia and Herzegovina	1952	2791000.0	Europe	53.820	973.533195	2.791000	22
696	India	1952	372000000.0	Asia	37.373	546.565749	372.000000	5
637	Haiti	1957	3507701.0	Americas	40.696	1726.887882	3.507701	5
444	Ecuador	1952	3548753.0	Americas	48.357	3522.110717	3.548753	7
158	Botswana	1962	512764.0	Africa	51.520	983.653976	0.512764	8

There is also a special python function called lambda. It is known as an anonymous function, appears in a single line of code, and is not given a name other than lambda. It can take any number of arguments in an expression.

The .assign() method looks at this expression with lambda in it and returns the value it is asked to do and assigns it to the variable name given it. This is how it comes all together to give us actual GDP for each row in our dataframe:

GDP_df = new_cols_df.assign(GDP=lambda x: x['pop'] * x['gdpPercap'])
GDP_df.head(5)

	country	year	pop	continent	lifeExp	gdpPercap	pop_millions	country_name_chars	GDP
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314	8.425333	11	6.567086e+09
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030	9.240934	11	7.585449e+09
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710	10.267083	11	8.758856e+09
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138	11.537966	11	9.648014e+09
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106	13.079460	11	9.678553e+09

So now we can see the newly created column “GDP” and see the value for each country in our new object. Notice that the GDP data is in scientific notation (i.e. the decimal number times x number of zeros), so it’s a bit hard to read. If we wanted readers to consume that data we would go ahead and change the data type for it. But for current purposes we’ll leave that alone.

6.7 Joining datasets together

One of the most important tasks in data analysis is to be able to join multiple datasets together. With pandas, there are functions called .merge() and .join() that are similar to each other. As .merge() is perhaps the more used, intuitive, and powerful of the two, we will introduce that method in this tutorial in some depth. There is also a cool function called .concat() that will be introduced below as well.

But first, let’s get a second file that gives us country size in square kilometers by country. We will use this data to put together with our country_metrics dataframe.

country_size_url = "https://raw.githubusercontent.com/bcgov/"\
    "ds-intro-to-python/main/data/countrysize.csv"
country_size = pd.read_csv(country_size_url, encoding= 'unicode_escape')
country_size.info()
country_size.sample(2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   nation           235 non-null    object
 1   area_square_kms  235 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 3.8+ KB

	nation	area_square_kms
67	Malaysia	330803
74	Burkina Faso	272967

Ok, this country_size dataset looks like we expect - a list of countries alongside the land size of that country in square kilometers. There are some omissions in this list - not all countries in our country_metrics dataset are present in this country_size object. But for our present purposes this is ok. We just want to get that square kilometers data into a combined dataset and it is sufficient for that.

Essentially what we want to do is a classic “left-join” operation of the sort in the diagram below. Conceptually, the country_metrics dataset is like the purple table and the country_size dataset is like the red one.

Types of Joins

The .merge() method (with reference material available here) works similarly to how table joining works in SQL or how the VLOOKUP function works in Excel. One needs to specify both dataframes, the key variables on which to join them, and the kind of join desired.

So let’s look at the example below to see how it all comes together in code.

combined_df = country_metrics.merge(
    country_size, 
    left_on='country', 
    right_on='nation', 
    how='left'
    )
combined_df.head(2)

	country	year	pop	continent	lifeExp	gdpPercap	pop_millions	country_name_chars	nation	area_square_kms
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314	8.425333	11	Afghanistan	652230.0
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030	9.240934	11	Afghanistan	652230.0

When you run the code above you will notice that both the nation key column and the area_square_kms column have been joined together in the new combined_df object. One can keep that nation column in there for control purposes, or it can be removed by using the .drop() method we used earlier:

combined_df = combined_df.drop('nation', axis=1) 
combined_df.head()

	country	year	pop	continent	lifeExp	gdpPercap	pop_millions	country_name_chars	area_square_kms
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314	8.425333	11	652230.0
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030	9.240934	11	652230.0
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710	10.267083	11	652230.0
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138	11.537966	11	652230.0
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106	13.079460	11	652230.0

Another innovative way to put “merge” data together in pandas is with the .concat() function. Conceptually you can think if it like “stacking” two data objects on top of each other or side-by-side as shown in the diagram below.

Ways to stack data

To illustrate, let’s fetch two simple dataframes. Each contains the average scores for three subjects by year for two separate schools.

school1_url  = "https://raw.githubusercontent.com/bcgov/"\
    "ds-intro-to-python/main/data/school1.csv"
school2_url  = "https://raw.githubusercontent.com/bcgov/"\
    "ds-intro-to-python/main/data/school2.csv"
school1_df = pd.read_csv(school1_url)
school2_df = pd.read_csv(school2_url)

Let’s take a quick look at the two dataframes.

school1_df

	Subject	Avg Score 2018	Avg Score 2019	Avg Score 2020
0	Science	67	88	89
1	English	86	67	77
2	Math	84	73	75

school2_df

	Subject	Avg Score 2018	Avg Score 2019	Avg Score 2020
0	Science	32	88	90
1	English	88	44	77
2	Math	84	73	53

We call .concat() and pass in the objects that we want to stack vertically. This is a similiar operation to union in SQL.

vertical_stack_df = pd.concat([school1_df, school2_df])
vertical_stack_df

	Subject	Avg Score 2018	Avg Score 2019	Avg Score 2020
0	Science	67	88	89
1	English	86	67	77
2	Math	84	73	75
0	Science	32	88	90
1	English	88	44	77
2	Math	84	73	53

It is also possible to stack the data horizontally. Here it is necessary to specify the columnar axis (axis=1) as the default setting is for rows (axis=0).

horizontal_stack = pd.concat([school1_df, school2_df], axis=1)
horizontal_stack

	Subject	Avg Score 2018	Avg Score 2019	Avg Score 2020	Subject	Avg Score 2018	Avg Score 2019	Avg Score 2020
0	Science	67	88	89	Science	32	88	90
1	English	86	67	77	English	88	44	77
2	Math	84	73	75	Math	84	73	53

This introduction only scratches the surface of how to leverage this way of joining datasets together. But it can be a powerful tool in the toolkit for the right use case. More detail found here.

6.8 Grouping and summarizing

Sometimes of course one would prefer to group rows together for the purpose of summarizing them in various ways.

In pandas, we can accomplish this using the .groupby() method. A .groupby() operation involves some combination of splitting the object, applying a function, and combining the results. It is used together with one or more aggregation functions:

count(): Total number of items
first(), last(): First and last item
mean(), median(): Mean and median
min(), max(): Minimum and maximum
std(), var(): Standard deviation and variance
mad(): Mean absolute deviation
prod(): Product of all items
sum(): Sum of all items

First let us look at a simple example where we want to get the mean life expectancy for each continent in the data. To do this, we would use the groupby() function to call the appropriate segment (i.e. continent), metric (i.e. lifeExp), and type of aggregation (i.e. mean):

simple_mean = country_metrics.groupby('continent').lifeExp.mean()
simple_mean

continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64

Fortunately, if we want to look at several aggregations at once, we can do that too. To extend our example above, we would specify the “continent” column in the .groupby() function, then pass additional aggregation functions (in our case, we will add “mean”, “min”, and “max”) as a dictionary within the .agg() function.

Recall: Dictionaries in Python

Remember that a dictionary in Python language is a particular type of data structure that contains a collection of key: value pairs. It is analogous to a regular word dictionary you are familiar with which is a collection of words to their meanings. Python dictionaries allow us to associate a value to a unique key, and then to quickly access this value. They are generally created within curly braces {} and have a specified key name and value (e.g. basic form for a python dictionary {"key1": "value1"})

This dictionary takes the column that we are aggregating - in this case life expectancy - as a key and a list of aggregation functions as its value. We also add a line of code that gives each of the columns a new name with the .columns() function.

grouped_single = country_metrics.groupby('continent') \
    .agg({'lifeExp': ['mean', 'min', 'max']})
grouped_single.columns = ['lifeExp_mean', 'lifeExp_min', 'lifeExp_max']
grouped_single

	lifeExp_mean	lifeExp_min	lifeExp_max
continent
Africa	48.865330	23.599	76.442
Americas	64.658737	37.579	80.653
Asia	60.064903	28.801	82.603
Europe	71.903686	43.585	81.757
Oceania	74.326208	69.120	81.235

If you look closely at grouped_single you see that the variable “continent” is on a different line than are the columns that are aggregated. That is because continent in this instance is actually an index and not a column.

We can nest additional “groups within groups” by creating a list of column names and passing them to the .groupby() function instead of passing a single value.

The example below adds more granularity with the introduction of ‘country’ and the creation of a list to hold both ‘continent’ and ‘country’.

grouped_multiple = country_metrics.groupby(
    ['continent', 'country']).agg({'pop': ['mean', 'min', 'max']}
    )
grouped_multiple.columns = ['pop_mean', 'pop_min', 'pop_max']
grouped_multiple.head(10)

		pop_mean	pop_min	pop_max
continent	country
Africa	Algeria	1.987541e+07	9279525.0	33333216.0
	Angola	7.309390e+06	4232095.0	12420476.0
	Benin	4.017497e+06	1738315.0	8078314.0
	Botswana	9.711862e+05	442308.0	1639131.0
	Burkina Faso	7.548677e+06	4469979.0	14326203.0
	Burundi	4.651608e+06	2445618.0	8390505.0
	Cameroon	9.816648e+06	5009067.0	17696293.0
	Central African Republic	2.560963e+06	1291695.0	4369038.0
	Chad	5.329256e+06	2682462.0	10238807.0
	Comoros	3.616839e+05	153936.0	710960.0

Again will notice that this dataframe looks a little different than some others we have seen in that there are blank spaces below each grouped value of continent in that index. This object is a multi-indexed dataframe. It presents the data nicely to consume visually, but it is not ideal to work with if you want to do further manipulation.

This is likely again a good time to use the reset_index() function. This is a step that resets the index to an integer based index and re-creates a non-indexed pandas dataframe.

The code block below is identical to the one we just ran except for the line of code that resets the index:

grouped_multiple = country_metrics.groupby(
    ['continent', 'country']).agg({'pop': ['mean', 'min', 'max']}
    )
grouped_multiple.columns = ['pop_mean', 'pop_min', 'pop_max']
grouped_multiple = grouped_multiple.reset_index()  # resets the index
grouped_multiple.head(10)

	continent	country	pop_mean	pop_min	pop_max
0	Africa	Algeria	1.987541e+07	9279525.0	33333216.0
1	Africa	Angola	7.309390e+06	4232095.0	12420476.0
2	Africa	Benin	4.017497e+06	1738315.0	8078314.0
3	Africa	Botswana	9.711862e+05	442308.0	1639131.0
4	Africa	Burkina Faso	7.548677e+06	4469979.0	14326203.0
5	Africa	Burundi	4.651608e+06	2445618.0	8390505.0
6	Africa	Cameroon	9.816648e+06	5009067.0	17696293.0
7	Africa	Central African Republic	2.560963e+06	1291695.0	4369038.0
8	Africa	Chad	5.329256e+06	2682462.0	10238807.0
9	Africa	Comoros	3.616839e+05	153936.0	710960.0

This looks more like the structure of the dataframes we know already and will be easier to manipulate further.

Challenge 4

You would like to summarize population as well as life expectancy by year, grouped by continent. Pick some aggregations that would make sense to look at for this task. Don’t worry about re-setting the index or about creating new labels for the result.

Solution to Challenge 4

Using what we learned about how to select rows, we should limit the dataframe to rows where the year is greater than or equal to 1992. Next we should create a multi-column call to the .groupby() function. Finally we should select some aggregations such as mean and max among others could make sense here.

challenge4_df = country_metrics.groupby(
    ['continent', 'year']).agg({'pop' : ['mean', 'max'], 'lifeExp' : ['mean', 'max']}
    )
challenge4_df

		pop		lifeExp
		mean	max	mean	max
continent	year
Africa	1952	4.570010e+06	3.311910e+07	39.135500	52.724
	1957	5.093033e+06	3.717334e+07	41.266346	58.089
	1962	5.702247e+06	4.187135e+07	43.319442	60.246
	1967	6.447875e+06	4.728775e+07	45.334538	61.557
	1972	7.305376e+06	5.374008e+07	47.450942	64.274
	1977	8.328097e+06	6.220917e+07	49.580423	67.064
	1982	9.602857e+06	7.303938e+07	51.592865	69.885
	1987	1.105450e+07	8.155152e+07	53.344788	71.913
	1992	1.267464e+07	9.336424e+07	53.629577	73.615
	1997	1.430448e+07	1.062078e+08	53.598269	74.772
	2002	1.603315e+07	1.199013e+08	53.325231	75.744
	2007	1.787576e+07	1.350312e+08	54.806038	76.442
Americas	1952	1.380610e+07	1.575530e+08	53.279840	68.750
	1957	1.547816e+07	1.719840e+08	55.960280	69.960
	1962	1.733081e+07	1.865380e+08	58.398760	71.300
	1967	1.922986e+07	1.987120e+08	60.410920	72.130
	1972	2.117537e+07	2.098960e+08	62.394920	72.880
	1977	2.312271e+07	2.202390e+08	64.391560	74.210
	1982	2.521164e+07	2.321878e+08	66.228840	75.760
	1987	2.731016e+07	2.428035e+08	68.090720	76.860
	1992	2.957096e+07	2.568942e+08	69.568360	77.950
	1997	3.187602e+07	2.729118e+08	71.150480	78.610
	2002	3.399091e+07	2.876755e+08	72.422040	79.770
	2007	3.595485e+07	3.011399e+08	73.608120	80.653
Asia	1952	4.228356e+07	5.562635e+08	46.314394	65.390
	1957	4.735699e+07	6.374080e+08	49.318544	67.840
	1962	5.140476e+07	6.657700e+08	51.563223	69.390
	1967	5.774736e+07	7.545500e+08	54.663640	71.430
	1972	6.518098e+07	8.620300e+08	57.319269	73.420
	1977	7.225799e+07	9.434550e+08	59.610556	75.380
	1982	7.909502e+07	1.000281e+09	62.617939	77.110
	1987	8.700669e+07	1.084035e+09	64.851182	78.670
	1992	9.494825e+07	1.164970e+09	66.537212	79.360
	1997	1.025238e+08	1.230075e+09	68.020515	80.690
	2002	1.091455e+08	1.280400e+09	69.233879	82.000
	2007	1.155138e+08	1.318683e+09	70.728485	82.603
Europe	1952	1.393736e+07	6.914595e+07	64.408500	72.670
	1957	1.459635e+07	7.101907e+07	66.703067	73.470
	1962	1.534517e+07	7.373912e+07	68.539233	73.680
	1967	1.603930e+07	7.636845e+07	69.737600	74.160
	1972	1.668784e+07	7.871709e+07	70.775033	74.720
	1977	1.723882e+07	7.816077e+07	71.937767	76.110
	1982	1.770890e+07	7.833527e+07	72.806400	76.990
	1987	1.810314e+07	7.771830e+07	73.642167	77.410
	1992	1.860476e+07	8.059776e+07	74.440100	78.770
	1997	1.896480e+07	8.201107e+07	75.505167	79.390
	2002	1.927413e+07	8.235067e+07	76.700600	80.620
	2007	1.953662e+07	8.240100e+07	77.648600	81.757
Oceania	1952	5.343003e+06	8.691212e+06	69.255000	69.390
	1957	5.970988e+06	9.712569e+06	70.295000	70.330
	1962	6.641759e+06	1.079497e+07	71.085000	71.240
	1967	7.300207e+06	1.187226e+07	71.310000	71.520
	1972	8.053050e+06	1.317700e+07	71.910000	71.930
	1977	8.619500e+06	1.407410e+07	72.855000	73.490
	1982	9.197425e+06	1.518420e+07	74.290000	74.740
	1987	9.787208e+06	1.625725e+07	75.320000	76.320
	1992	1.045983e+07	1.748198e+07	76.945000	77.560
	1997	1.112072e+07	1.856524e+07	78.190000	78.830
	2002	1.172741e+07	1.954679e+07	79.740000	80.370
	2007	1.227497e+07	2.043418e+07	80.719500	81.235

By now you should be able to select the rows and columns, join together dataframes, and create some basic summarizations of data. The next section will look at some of the more sophisticated and elegant tools for understanding and presenting your data. But the building blocks you have just learned about are the meat and potatoes of data analysis in the python world.