9 Advanced Pandas

Advanced. Pandas.

In this section we will go over some Pandas functionality that is slightly more involved than the prior sessions. We hope to cover, or at least provide reference for operations such as:

Conditional Column Assignment
Advanced grouping techniques
Rolling and cumulative functions
Reshaping our data

9.1 Conditional Column Assignment

In our previous sections on Pandas, we introduced the idea of column creation based on the data we currently have. We limited ourselves to fairly standard examples that are likely to crop up in typical data analysis: arithmetic operations, applying column specific functions, and the use of the lambda function inside a call to either apply or assign. Here we will expand our column possibilities to include conditional assignments, where the value of the column is dependent on the value/state of some other column.

Let’s look at the penguins dataset we considered previously. This dataset has a good mix of categorical and numerical datatypes for us to create new columns from.

import pandas as pd
from seaborn import load_dataset

penguins = load_dataset('penguins')
penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

Binary If/Else

The first type of conditional assignment we will consider is the simplest: if original column is value A, then assign value x, otherwise assign value y. The most straightforward way to do this is via the apply method again, combined with a lambda function acting on the column of interest. For this example, we are going to make a shorthand version of the sex column - if the sex column is Female, then assign F, otherwise assign M.

penguins['if_else_column'] = penguins.sex.apply(lambda x: 'F' if x=='Female' else 'M')
penguins[['sex', 'if_else_column']].sample(5)

	sex	if_else_column
68	Female	F
186	Male	M
185	Male	M
171	Male	M
270	Female	F

While apply is robust and can be used in many different applications and settings, it is often the slowest method available in Pandas. If you are doing very specific operations, Pandas often provides faster, vectorized methods to do the same thing. For instance, this if/else example can also be calculated using the Pandas where and mask methods. These methods complement each other, and you may find one more intuitive than the other based on the specific circumstances of the data you are working with.

Both methods assume you already have some column with a default value in place. For example, we can assign the value ‘M’ to an entire column. where and mask then work by updating the values of this column where some other condition is met. The syntax for where is:

If condition is met, keep the default value, otherwise update to the other value:

penguins['where_column'] = 'M'
penguins['where_column'] = penguins.where_column.where(penguins.sex=='Male', 'F')
penguins[['sex', 'if_else_column', 'where_column']].sample(5)

	sex	if_else_column	where_column
341	Male	M	M
287	Male	M	M
223	Male	M	M
214	Female	F	F
77	Male	M	M

While the syntax for mask is:

If condition is met, update to other value, otherwise keep the default value:

penguins['mask_column'] = 'M'
penguins['mask_column'] = penguins.mask_column.mask(penguins.sex=='Female', 'F')
penguins[['sex', 'if_else_column', 'where_column', 'mask_column']].sample(5)

	sex	if_else_column	where_column	mask_column
220	Female	F	F	F
242	Female	F	F	F
201	Female	F	F	F
286	NaN	M	F	M
249	Male	M	M	M

Again, while these all result in the same column, you may find one method more intuitive, efficient, or faster for the data you are working with. Explore and use what works best for you!

Multiple Conditions

Sometimes, a simple if/else statement doesn’t cut it. Perhaps there are multiple options, and creating an if…elseif…elseif…else block will not only only be less intuitive but can easily be used incorrectly, leading to undesired outputs. In this case, the mapping functionality of Pandas provides a vectorized efficient method to create multiple changes to a column. The map method, which we apply to a single series, is optimized for mapping elements from one domain to another. The best example of this is using a dictionary to provide a mapping of your keys (initial column values) to your new value set.

Let’s see this in action with the penguins. Perhaps we do not wish to bias our analysis by having knowledge of which island each penguin came from. In this case, we wish to replace the actual names of the three islands with generic island IDs. First, we supply our mapping dictionary that will transform each island to its generic ID. Then, we supply this dictionary to our map method:

my_mapping = {
    'Torgersen': 'Island 1',
    'Biscoe': 'Island 2',
    'Dream': 'Island 3'
}
penguins['mapped_islands'] = penguins.island.map(my_mapping)
penguins[['island', 'mapped_islands']].sample(10)

	island	mapped_islands
254	Biscoe	Island 2
5	Torgersen	Island 1
278	Biscoe	Island 2
290	Biscoe	Island 2
33	Dream	Island 3
72	Torgersen	Island 1
76	Torgersen	Island 1
109	Biscoe	Island 2
342	Biscoe	Island 2
20	Biscoe	Island 2

Complex Conditions

The above example used the map method, which is inherently a method to be used for Series only. As such, we cannot use it for more complicated conditions that may involve multiple columns. The final example we will consider is a more complicated set of conditions: we wish to create new categories based on the penguins body mass: underweight, overweight, or average weight. However, we also wish to consider the penguin species when creating these categories: what is underweight for the Gentoo species may be overweight for the Adelie! Thus, in theory we have 9 new categories to account for that depend on two columns.

The best (read: efficient) way to approach this multi-column multi-conditional scenario is actually to go outside of the Pandas toolkit and use a tool that has been implemented by numpy. Numpy is a computational tool-kit for Python that utilizes the efficiency of powerful computational languages such as Fortran and C to provide fast and versatile computational tools for the vast majority of numerical and scientific anaylsis in Python. We have seen it already, having used numpy’s statistical functions such as mean and median within our aggregations. Here, we wish to make use of the numpy.select() method.

In numpy, the select method is fairly simple. We supply a list of conditions we wish to consider, and a list of choices for what we wish to happen for each of the conditions. We may also supply a default value, for if none of the conditions are met. select will go through the list one by one, and will apply the choice for the first condition that is met. Let’s see what this looks like for our penguins:

import numpy as np

# list all possible combos of species and specific body masses
# note that we don't include the overweight combos, as we can catch these
# with the default option in select!
conditions = [
    (penguins.species=='Adelie') & (penguins.body_mass_g<3300),
    (penguins.species=='Adelie') & (penguins.body_mass_g<3700),
    (penguins.species=='Chinstrap') & (penguins.body_mass_g<3333),
    (penguins.species=='Chinstrap') & (penguins.body_mass_g<3733),
    (penguins.species=='Gentoo') & (penguins.body_mass_g<4676),
    (penguins.species=='Gentoo') & (penguins.body_mass_g<5076)
]

# for each listed condition, tell select what value to use
choices = [
    'Underweight',
    'Average weight', 
    'Underweight', 
    'Average weight',
    'Underweight',
    'Average weight'
]

# use select to create our new column
penguins['general_weight'] = np.select(conditions, choices, default='Overweight')

# display results
penguins[['species', 'body_mass_g', 'general_weight']].sample(10)

	species	body_mass_g	general_weight
299	Gentoo	5950.0	Overweight
243	Gentoo	5050.0	Average weight
102	Adelie	3075.0	Underweight
100	Adelie	3725.0	Overweight
112	Adelie	3200.0	Underweight
339	Gentoo	NaN	Overweight
74	Adelie	3700.0	Overweight
302	Gentoo	4725.0	Average weight
192	Chinstrap	3950.0	Overweight
304	Gentoo	4750.0	Average weight

Order Matters!

When using the select method, the order of the conditions matters! Numpy always goes through the list in order, looking for the first TRUE condition. This ordering will matter, for example, if creating a list of bins for some numerical category. Conditions of the form:

conditions = [
    penguins.body_mass_g < 5_000,
    penguins.body_mass_g < 4_000,
    penguins.body_mass_g < 3_000,
    penguins.body_mass_g < 2_000,
    penguins.body_mass_g < 1_000
]

Will never evaluate the choices for the conditions 2-5, as the first condition will be true as well. Instead we should flip this list and write our conditions starting from 1,000 and ending at 5,000. Keep this in mind when creating your conditions!

Challenge 1

Create a column for 5-10 flipper length bins and plot the resulting frequencies (counts for each bin).

Solution to Challenge 1

Here we’re going to pull together some Seaborn code and our new-found conditional knowledge to create the bins.

# make sure seaborn is imported
import seaborn as sns
import matplotlib.pyplot as plt

# check roughly where we could place bins
display(penguins.agg({'flipper_length_mm': ['mean', 'std', 'min', 'max']}))

# create some conditions for bins
conditions = [
    penguins.flipper_length_mm < 180,
    penguins.flipper_length_mm < 190,
    penguins.flipper_length_mm < 200,
    penguins.flipper_length_mm < 210,
    penguins.flipper_length_mm < 220,
    penguins.flipper_length_mm < 230
]

# assign names to each condition
choices = [
    '0 - 180',
    '180 - 190',
    '190 - 200',
    '200 - 210',
    '210 - 220',
    '220 - 230'
]

# add the new column
penguins['flipper_length_bins'] = np.select(conditions, choices, default='230 +')

# plot this data using the 'count' option
sns.catplot(
    penguins, 
    x='flipper_length_bins', 
    kind='count',
    order = choices,
    color = 'blue'
)
plt.xticks(rotation=45)
plt.show()

	flipper_length_mm
mean	200.915205
std	14.061714
min	172.000000
max	231.000000

9.2 Advanced Grouping

groupby cuteness

In our previous section on Pandas, we explored the basic tools we need to group our data by various columns, apply a host of statistical aggregation operations, and look at these outcomes. While incredibly powerful as is, these tools lack some functionality that you might wish to have. Some examples include:

Combining row level values and group level aggregations in a single command.
Filtering and subsetting a dataframe based on group values.
Combining aggregations across different columns.

In order to handle these operations, we need to introduce two new methods: transform and apply.

Apply… again?

You might question whether apply is actually a new function, seeing as we have used it previously when creating new columns in a dataframe. In fact, the effect of apply within pandas is subtly different depending on how and where we apply it. The effect of apply is different when used on:

A dataframe as a whole - dataframe.apply(...)
A single series - series.apply(...)
A grouped dataframe object - dataframe.groupby('column').apply(...)

Make sure you keep track of what kind of apply you are using when doing your dataframe manipulations, as identical functions in different scenarios can lead to unusually different results! In this section, we are focusing on the grouped dataframe objects.

Both transform and apply can be used on a grouped dataframe object. This is the object that is created when we use the .groupby() method. Let’s create a small dataset and see what this object looks like:

starwars = pd.DataFrame(
    {
        'unit' : [
            'Blaster', 'Blaster', 'Blaster', 
            'Lightsaber', 'Lightsaber', 'Lightsaber', 'Lightsaber', 
            'Stick', 'Stick', 'Stick'
            ], 
        'cost' : [42, 60, 40, 900, 4000, 2000, 100, 10, 1, 5],
        'sale_price' : [50, 75, 42, 1000, 5000, 2000, 4242, 4, 2, 20]
    }
)

starwars

	unit	cost	sale_price
0	Blaster	42	50
1	Blaster	60	75
2	Blaster	40	42
3	Lightsaber	900	1000
4	Lightsaber	4000	5000
5	Lightsaber	2000	2000
6	Lightsaber	100	4242
7	Stick	10	4
8	Stick	1	2
9	Stick	5	20

Here we have a column that looks categorical (unit), and a couple that look numerical (cost and sale_price). We are going to focus on grouping our data according to the categorical column, and see what we can do with the numerical columns.

grouped_starwars = starwars.groupby('unit')
grouped_starwars

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb461b2f310>

Note that this object is no longer a dataframe, but some sort of DataFrameGroupBy object. By itself, it is not very useful, but we can use it to get all sorts of interesting information. Recall briefly how we used this earlier. Maybe we want the average buy and sell prices of each type of unit:

grouped_starwars.mean().reset_index()

	unit	cost	sale_price
0	Blaster	47.333333	55.666667
1	Lightsaber	1750.000000	3060.500000
2	Stick	5.333333	8.666667

What if we wanted to know how much each particular unit contributed to it’s groups overall sale? Or perhaps how much each group profited, on average? This is where transform and apply come in. Let’s look at what each of these do.

Transform

Used to apply a function to a dataframe that produces a new dataframe with the same shape as the original.
While an entire dataframe can be passed to transform, it only ever sees (operates on) a single column/series at a time.
The function must either return a scalar value or a sequence that is the same length as the original dataframe.
Useful for getting groups aggregations back into the original dataframe
Useful for filtering based on group values

Ex. 1: Percent of Group

Let’s see each this in action. We will use transform to determine how much each unit contributed to the total sales of its group.

pct_of_sales = (
    grouped_starwars
    .sale_price
    .transform(lambda series_: series_/series_.sum())
)
pct_of_sales

0    0.299401
1    0.449102
2    0.251497
3    0.081686
4    0.408430
5    0.163372
6    0.346512
7    0.153846
8    0.076923
9    0.769231
Name: sale_price, dtype: float64

Notice here that we used the lambda function again, which allows us to define a function inline acting on a given argument (in this case, the argument is named series_). This let us allow the value of the series to be the sum of the series. Because we are acting on a grouped object, Pandas knows that the sum we wish to employ is that of the group, not the overall sum! This new object will be in the same order (have the same indexing) as our initial dataframe, and so we can add it back to the dataframe as an extra column if we wish. This works explicitly because the transform method is required to return a series of the same length as the original object, giving us exactly what we want!

starwars['percent_of_sales'] = pct_of_sales
starwars

	unit	cost	sale_price	percent_of_sales
0	Blaster	42	50	0.299401
1	Blaster	60	75	0.449102
2	Blaster	40	42	0.251497
3	Lightsaber	900	1000	0.081686
4	Lightsaber	4000	5000	0.408430
5	Lightsaber	2000	2000	0.163372
6	Lightsaber	100	4242	0.346512
7	Stick	10	4	0.153846
8	Stick	1	2	0.076923
9	Stick	5	20	0.769231

Ex 2: Grouped Aggregate

Transform can be used exclusively with an aggregation function as well, but it will still return a value for every row of the initial dataset. This allows us to add grouped level values as columns to the data:

starwars['average_unit_cost'] = grouped_starwars.cost.transform('mean')
starwars

	unit	cost	sale_price	percent_of_sales	average_unit_cost
0	Blaster	42	50	0.299401	47.333333
1	Blaster	60	75	0.449102	47.333333
2	Blaster	40	42	0.251497	47.333333
3	Lightsaber	900	1000	0.081686	1750.000000
4	Lightsaber	4000	5000	0.408430	1750.000000
5	Lightsaber	2000	2000	0.163372	1750.000000
6	Lightsaber	100	4242	0.346512	1750.000000
7	Stick	10	4	0.153846	5.333333
8	Stick	1	2	0.076923	5.333333
9	Stick	5	20	0.769231	5.333333

Ex 3: Multiple Series

Instead of passing a single column to the transform method, we could also pass the entire dataframe (or some subset of the columns in the dataframe), and calculate values for each series:

grouped_starwars.transform(lambda series_: (series_-series_.mean())/series_.std())

	cost	sale_price
0	-0.484182	-0.329183
1	1.149932	1.123095
2	-0.665750	-0.793912
3	-0.502909	-1.099922
4	1.331229	1.035331
5	0.147914	-0.566109
6	-0.976235	0.630700
7	1.034910	-0.473016
8	-0.960988	-0.675737
9	-0.073922	1.148754

This has created two columns that have both had the same function applied. Note that if we wanted to merge these in with the original dataframe, we should be careful to rename the columns so that they do not overlap!

Ex 4: Filtering Original Dataframe

Another use case for transform is in filtering a dataframe. Maybe we want to zero in on those items where the sale price of the object was less than the average unit cost. Let’s filter our dataframe to only those rows:

condition = starwars.sale_price < grouped_starwars.cost.transform(lambda x: x.mean())
starwars[condition]

	unit	cost	sale_price	percent_of_sales	average_unit_cost
2	Blaster	40	42	0.251497	47.333333
3	Lightsaber	900	1000	0.081686	1750.000000
7	Stick	10	4	0.153846	5.333333
8	Stick	1	2	0.076923	5.333333

Apply

Used to apply a function (aggregated or otherwise) across multiple columns
Implicitly passes all the columns of the dataframe as a dataframe to the function, allowing for column interactions
The function can return a scalar or a sequence of any length.

Ex 1. Aggregate over Multiple Columns

Let’s try something with apply. What if we want to know the average overall profit of each group. We could produce a profit column, group up on the unit, and then calculate the mean.

starwars['profit'] = starwars.sale_price - starwars.cost
starwars.groupby('unit').profit.mean()

unit
Blaster          8.333333
Lightsaber    1310.500000
Stick            3.333333
Name: profit, dtype: float64

We could also do this using the apply function, applied to our grouped object:

grouped_starwars.apply(lambda df_: (df_.sale_price - df_.cost).mean())

unit
Blaster          8.333333
Lightsaber    1310.500000
Stick            3.333333
dtype: float64

Again, we made use of that lambda function, this time applying it to a df_ argument. You might notice that here I used a df_ argument instead of series_: this is merely notation, and we could have used anything (series_, df_, x, this_is_my_argument would all work the same). However, we have used df_ and series_ so that we can remind ourselves exactly what type of object we are acting on.

To see how transform differs from apply, let’s try to do that exact same operation:

grouped_starwars.transform(lambda _df: (_df.sale_price - _df.cost).mean())

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2563/2867570833.py in ?()
----> 1 grouped_starwars.transform(lambda _df: (_df.sale_price - _df.cost).mean())

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(self, func, engine, engine_kwargs, *args, **kwargs)
   1181     @Substitution(klass="DataFrame")
   1182     @Appender(_transform_template)
   1183     def transform(self, func, *args, engine=None, engine_kwargs=None, **kwargs):
-> 1184         return self._transform(
   1185             func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
   1186         )

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in ?(self, func, engine, engine_kwargs, *args, **kwargs)
   1638         # optimized transforms
   1639         func = com.get_cython_func(func) or func
   1640 
   1641         if not isinstance(func, str):
-> 1642             return self._transform_general(func, *args, **kwargs)
   1643 
   1644         elif func not in base.transform_kernel_allowlist:
   1645             msg = f"'{func}' is not a valid function name for transform(name)"

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(self, func, *args, **kwargs)
   1157             except TypeError:
   1158                 return self._transform_item_by_item(obj, fast_path)
   1159             except ValueError as err:
   1160                 msg = "transform must return a scalar value for each group"
-> 1161                 raise ValueError(msg) from err
   1162             if group.size > 0:
   1163                 res = _wrap_transform_general_frame(self.obj, group, res)
   1164                 applied.append(res)

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(self, fast_path, slow_path, group)
   1206     def _choose_path(self, fast_path: Callable, slow_path: Callable, group: DataFrame):
   1207         path = slow_path
-> 1208         res = slow_path(group)
   1209 
   1210         # if we make it here, test if we can use the fast path
   1211         try:

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(group)
-> 1201             slow_path = lambda group: group.apply(
   1202                 lambda x: func(x, *args, **kwargs), axis=self.axis
   1203             )

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/frame.py in ?(self, func, axis, raw, result_type, args, **kwargs)
   8844             result_type=result_type,
   8845             args=args,
   8846             kwargs=kwargs,
   8847         )
-> 8848         return op.apply().__finalize__(self, method="apply")

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/apply.py in ?(self)
    729         # raw
    730         elif self.raw:
    731             return self.apply_raw()
    732 
--> 733         return self.apply_standard()

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/apply.py in ?(self)
    856     def apply_standard(self):
--> 857         results, res_index = self.apply_series_generator()
    858 
    859         # wrap results
    860         return self.wrap_results(results, res_index)

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/apply.py in ?(self)
    873                 results[i] = self.f(v)
    874                 if isinstance(results[i], ABCSeries):
    875                     # If we have a view on v, we need to make a copy because
    876                     #  series_generator will swap out the underlying data
--> 877                     results[i] = results[i].copy(deep=False)
    878 
    879         return results, res_index

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(x)
-> 1202                 lambda x: func(x, *args, **kwargs), axis=self.axis

/tmp/ipykernel_2563/2867570833.py in ?(_df)
----> 1 grouped_starwars.transform(lambda _df: (_df.sale_price - _df.cost).mean())

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/generic.py in ?(self, name)
   5571             and name not in self._accessors
   5572             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573         ):
   5574             return self[name]
-> 5575         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'sale_price'

This fails because the object being acted on inside transform is itself just a series object, not the entire dataframe! As such, it has no attribute for sale_price or buy_price like our original dataframe does. Instead, it acts on the sale_price series, then the buy_price series, and returns the results.

Ex 2. Mix Row/Aggregate Levels

While the first apply example returns a rolled up aggregated dataframe, we can also use apply to return the individual rows of the dataframe by mixing aggregation functions with row level functions.

grouped_starwars.apply(lambda df_: df_.sale_price - df_.cost.mean())

unit         
Blaster     0       2.666667
            1      27.666667
            2      -5.333333
Lightsaber  3    -750.000000
            4    3250.000000
            5     250.000000
            6    2492.000000
Stick       7      -1.333333
            8      -3.333333
            9      14.666667
Name: sale_price, dtype: float64

In this way we have to be careful: transform will always return a dataframe that is the same size as the original, while apply will return something that varies with the type of function we have utilized.

Ex 3. Partial Aggregates

Apply lets us play some cool tricks as well. Suppose we only wanted to know about the two most expensive sales in each category. How could we filter to show this? We can use the nlargest (or smallest) method in conjunction with apply. nlargest does exactly what we might expect, returning the rows with the n largest values according to the provided column(s):

starwars.nlargest(2, 'sale_price')

	unit	cost	sale_price	percent_of_sales	average_unit_cost	profit
4	Lightsaber	4000	5000	0.408430	1750.0	1000
6	Lightsaber	100	4242	0.346512	1750.0	4142

But if we mix this with apply and our grouping dataframe, we can get the largest for each group!

grouped_starwars.apply(lambda df_: df_.nlargest(2, 'sale_price'))

		unit	cost	sale_price	percent_of_sales	average_unit_cost	profit
unit
Blaster	1	Blaster	60	75	0.449102	47.333333	15
Blaster	0	Blaster	42	50	0.299401	47.333333	8
Lightsaber	4	Lightsaber	4000	5000	0.408430	1750.000000	1000
Lightsaber	6	Lightsaber	100	4242	0.346512	1750.000000	4142
Stick	9	Stick	5	20	0.769231	5.333333	15
Stick	7	Stick	10	4	0.153846	5.333333	-6

Note that when apply produces some different sized aggregate than the original dataframe, it tacks on an extra index indicating what the grouper was. We do not always care about this, and can eliminate it in the initial creation of our grouped dataframe via the group_keys argument:

(
    starwars
    .groupby('unit', group_keys=False)
    .apply(lambda df_: df_.nlargest(2, 'sale_price'))
)

	unit	cost	sale_price	percent_of_sales	average_unit_cost	profit
1	Blaster	60	75	0.449102	47.333333	15
0	Blaster	42	50	0.299401	47.333333	8
4	Lightsaber	4000	5000	0.408430	1750.000000	1000
6	Lightsaber	100	4242	0.346512	1750.000000	4142
9	Stick	5	20	0.769231	5.333333	15
7	Stick	10	4	0.153846	5.333333	-6

Ex 4. Create Multiple Columns

We can use the dataframe behaviour of apply to create multiple additional columns all at once within an apply. Let’s create a function that will act on each dataframe group, create new values, and return the dataframe.

def create_multiple_columns(df_):
    df_['average_group_profit'] = (df_.sale_price - df_.cost).mean()
    df_['profit_relative_to_group'] = df_['profit']/df_['average_group_profit']
    return df_

grouped_starwars.apply(create_multiple_columns)

	unit	cost	sale_price	percent_of_sales	average_unit_cost	profit	average_group_profit	profit_relative_to_group
0	Blaster	42	50	0.299401	47.333333	8	8.333333	0.960000
1	Blaster	60	75	0.449102	47.333333	15	8.333333	1.800000
2	Blaster	40	42	0.251497	47.333333	2	8.333333	0.240000
3	Lightsaber	900	1000	0.081686	1750.000000	100	1310.500000	0.076307
4	Lightsaber	4000	5000	0.408430	1750.000000	1000	1310.500000	0.763068
5	Lightsaber	2000	2000	0.163372	1750.000000	0	1310.500000	0.000000
6	Lightsaber	100	4242	0.346512	1750.000000	4142	1310.500000	3.160626
7	Stick	10	4	0.153846	5.333333	-6	3.333333	-1.800000
8	Stick	1	2	0.076923	5.333333	1	3.333333	0.300000
9	Stick	5	20	0.769231	5.333333	15	3.333333	4.500000

When done this way, we are explicitly returning the dataframe, regardless of if it was grouped or not. Even if all of the new columns are aggregates, this will still produce a non-aggregated output:

def create_multiple_aggregated_columns(df_):
    df_['average_group_profit'] = (df_.sale_price - df_.cost).mean()
    df_['average_sale_price'] = df_.sale_price.mean()
    return df_

grouped_starwars.apply(create_multiple_aggregated_columns)

	unit	cost	sale_price	percent_of_sales	average_unit_cost	profit	average_group_profit	average_sale_price
0	Blaster	42	50	0.299401	47.333333	8	8.333333	55.666667
1	Blaster	60	75	0.449102	47.333333	15	8.333333	55.666667
2	Blaster	40	42	0.251497	47.333333	2	8.333333	55.666667
3	Lightsaber	900	1000	0.081686	1750.000000	100	1310.500000	3060.500000
4	Lightsaber	4000	5000	0.408430	1750.000000	1000	1310.500000	3060.500000
5	Lightsaber	2000	2000	0.163372	1750.000000	0	1310.500000	3060.500000
6	Lightsaber	100	4242	0.346512	1750.000000	4142	1310.500000	3060.500000
7	Stick	10	4	0.153846	5.333333	-6	3.333333	8.666667
8	Stick	1	2	0.076923	5.333333	1	3.333333	8.666667
9	Stick	5	20	0.769231	5.333333	15	3.333333	8.666667

Challenge 2

Determine which of the units in each group had the greatest profit, and return a dataframe containing only the rows corresponding to these units.

Solution to Challenge 2

There are a variety of different ways this could be approached. We have done a similar question using apply already, so let’s do this via the transform function.

condition = starwars.profit == starwars.groupby('unit').profit.transform('max')

starwars[condition]

	unit	cost	sale_price	percent_of_sales	average_unit_cost	profit
1	Blaster	60	75	0.449102	47.333333	15
6	Lightsaber	100	4242	0.346512	1750.000000	4142
9	Stick	5	20	0.769231	5.333333	15

9.3 Windowing Functions

Often when we are looking at our data, we may have a question about how one value compares to the values around it. This is typically useful for timeseries data, where we can organize our data by date (or year, month, etc). Once sorted on a date, we might wish to know things such as:

Did this year have a lower or higher average than the previous year?
What is the roling average over the last 5 years?

Or many other comparative questions. We can answer some of these questions using the windowing capabilities of Pandas. Windowing is the act of looking at a single ‘window’ of the data, asking our question (what is the average value for the time period in this window?), and then sliding the window to do it again. You have likely heard the term rolling average - this is what windowing allows us to calculate.

Rolling

To calculate a rolling average, we use two methods. First, we indicate that we want rolling statistics by using the rolling() method, and include in the argument the number of periods (rows) to include. After the rolling method, we indicate which aggregation we would like to use. Let’s try this out on a timeseries dataset:

dowjones = load_dataset('dowjones')
dowjones.head()

	Date	Price
0	1914-12-01	55.00
1	1915-01-01	56.55
2	1915-02-01	56.00
3	1915-03-01	58.30
4	1915-04-01	66.45

This is a monthly aggregate of the Dow Jones index. Let’s compare this monthly aggregate to a 6 month rolling average:

dowjones['six_month_avg'] = dowjones.rolling(6).Price.mean()
dowjones.head(10)

	Date	Price	six_month_avg
0	1914-12-01	55.00	NaN
1	1915-01-01	56.55	NaN
2	1915-02-01	56.00	NaN
3	1915-03-01	58.30	NaN
4	1915-04-01	66.45	NaN
5	1915-05-01	65.95	59.708333
6	1915-06-01	68.40	61.941667
7	1915-07-01	71.85	64.491667
8	1915-08-01	79.25	68.366667
9	1915-09-01	85.50	72.900000

Sort your Data!

When using the rolling method this way, it assumes that the data is already sorted in the order you want! The dowjones dataset comes pre-sorted, but this isn’t always the case. Before doing any sort of windowing function, it is always a good idea to pre-sort your data! For the dowjones dataset, we would sort on the Date column:

dowjones = dowjones.sort_values(by='Date')

You’ll notice that for the first 5 months, this method returns NaN - this is because there were not enough values to include in the average, as the method assumes you always wish to have exactly 6 values in each rolling window. We can alter this however, using the min_periods argument:

dowjones['six_month_avg'] = dowjones.rolling(6, min_periods=1).Price.mean()
dowjones.head(10)

	Date	Price	six_month_avg
0	1914-12-01	55.00	55.000000
1	1915-01-01	56.55	55.775000
2	1915-02-01	56.00	55.850000
3	1915-03-01	58.30	56.462500
4	1915-04-01	66.45	58.460000
5	1915-05-01	65.95	59.708333
6	1915-06-01	68.40	61.941667
7	1915-07-01	71.85	64.491667
8	1915-08-01	79.25	68.366667
9	1915-09-01	85.50	72.900000

Let’s see what these rolling averages look like visually:

import matplotlib.pyplot as plt

# add more averages
dowjones['five_year'] = dowjones.rolling(60, min_periods=1).Price.mean()

# plot the original data
plt.plot(dowjones.Date, dowjones.Price, label='Actual Data')

# plot the smoothed data
plt.plot(dowjones.Date, dowjones.six_month_avg, label='Six Month')
plt.plot(dowjones.Date, dowjones.five_year, label='Five Year')

# add a legend
plt.legend()

plt.show()

The six month average is starting to smooth out the jagged peaks in the original data, while the five-year average has completely smoothed out all the wiggles!

Cumulative

While the rolling capabilities will provide us a fixed window into which we can calculate aggregate statistics, we may also wish to calculate cumulative statistics. The most common cumulative statistic is the cumulative sum, which calculates the sum of the current row and all preceding rows. In Pandas, we can calculate this with the cumsum method.

dowjones['cumulative_price'] = dowjones.Price.cumsum()
dowjones.head()

	Date	Price	six_month_avg	five_year	cumulative_price
0	1914-12-01	55.00	55.0000	55.0000	55.00
1	1915-01-01	56.55	55.7750	55.7750	111.55
2	1915-02-01	56.00	55.8500	55.8500	167.55
3	1915-03-01	58.30	56.4625	56.4625	225.85
4	1915-04-01	66.45	58.4600	58.4600	292.30

Shifts

Finally, knowing the value preceding or following the current value is sometimes of interest. Forecast modeling will often use the current value and past values to predict future values, for example. To retrieve lead and lag values, Pandas has the shift method. This takes as an argument the number of periods by which we wish to shift our dataset (defaulting to 1). The periods can be negative, letting us get future values as well! Let’s look at this with the dowjones dataset again:

dowjones['year_prior'] = dowjones.Price.shift(12)
dowjones['year_after'] = dowjones.Price.shift(-12)

dowjones.head()

	Date	Price	six_month_avg	five_year	cumulative_price	year_prior	year_after
0	1914-12-01	55.00	55.0000	55.0000	55.00	NaN	97.00
1	1915-01-01	56.55	55.7750	55.7750	111.55	NaN	94.70
2	1915-02-01	56.00	55.8500	55.8500	167.55	NaN	93.55
3	1915-03-01	58.30	56.4625	56.4625	225.85	NaN	93.30
4	1915-04-01	66.45	58.4600	58.4600	292.30	NaN	89.75

To show that this is actually the year before and after, let’s filter to a year somewhere in the middle.

Advanced Tip: Date Columns

In order to filter to the middle year, we need to be sure of the type of data we are working with. Let’s take a quick look at the Date column, and see what the datatype is:

print(dowjones.Date.dtype)

datetime64[ns]

This is a new column type that we have not encountered in detail before, but it tells us that this column contains date information. If we were to look at a single value in the Series, we would see that it is not simply a ‘YYYY-MM-DD’ string as displayed when we view the dataframe:

single_date = dowjones.Date[0]
print(single_date)
print(type(single_date))

1914-12-01 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Because this is not a string, we cannot do a simple string comparison between two dates:

print(single_date == '1914-12-01')

False

To filter based on date columns then, we should instead create a date type object in Python. The easiest way to do that is using the built in python library datetime.

from datetime import datetime

# create a date-like object
new_date = datetime(1914, 12, 1)
print(new_date)

# compare with the single date
print(single_date == new_date)

1914-12-01 00:00:00
True

For more information on dates in Pandas, see the time series/date functionality page for Pandas, and the numpy documentation for datetime64.

from datetime import datetime

# create dates for the filter
current_year = datetime(1942, 8, 1)
previous_year = datetime(1941, 8, 1)
next_year = datetime(1943, 8, 1)

# filter the date column
dowjones[
    dowjones.Date.isin([current_year, previous_year, next_year])
][
    ['Date', 'Price', 'year_prior', 'year_after']
]

	Date	Price	year_prior	year_after
320	1941-08-01	126.55	125.35	106.20
332	1942-08-01	106.20	126.55	136.22
344	1943-08-01	136.22	106.20	146.93

If we look at these three lines, we see that the year_prior and year_after columns do in fact give us the corresponding values from the initial dataset. And if we plot this, we will see three copies of the data, each shifted by a year:

plt.plot(dowjones.Date, dowjones.year_prior, label='past')
plt.plot(dowjones.Date, dowjones.Price, label='current')
plt.plot(dowjones.Date, dowjones.year_after, label='future')

plt.legend()

plt.show()

Challenge 3

Find the average, minimum, and maximum Dow Jones prices on a yearly basis. Plot all three values to see how they compare! (Recall: this dataset is currently monthly)

Solution to Challenge 3

# first let's find the three aggregations we want
dowjones_rolling = dowjones.rolling(12, min_periods=1).agg(
    {'Price': ['min', 'max', 'mean']}
)
display(dowjones_rolling.head())

# get rid of hierarchical naming scheme
dowjones_rolling.columns = ['price_min', 'price_max', 'price_mean']

# plot these (note that this DF doesn't have the dates, so we use the old DF dates)
# here's a fancy plot that will show a coloured band!
plt.fill_between(
    dowjones.Date, 
    dowjones_rolling.price_min, 
    dowjones_rolling.price_max, 
    alpha=0.3 # alpha makes the band more transparent
) 
plt.plot(dowjones.Date, dowjones_rolling.price_mean)
plt.show()

	Price
	min	max	mean
0	55.0	55.00	55.0000
1	55.0	56.55	55.7750
2	55.0	56.55	55.8500
3	55.0	58.30	56.4625
4	55.0	66.45	58.4600

9.4 Reshaping Data

Wide or Long?

Another fairly common task in data analysis is pivoting our data. Typically this means we are creating a wide-form table that has more columns than originally. This may be near the end of a particular analysis and we want to prepare a final table for a report, or perhaps we wish to pick out specific values from a column for more detailed inspection. Another common use of a pivot table might be for timeseries data, and we wish to separate different variables into their own individual columns. In these cases, we can use the pivot and pivot_table methods within Pandas.

In the other direction, we may wish to ‘unpivot’ a dataset. This will often occur early on in an analysis, where we wish to create a long-form table where multiple columns have been combined into a single set of id and value columns. This will be done with the melt method.

Let’s explore these ideas with the gapminder dataset.

url = "https://raw.githubusercontent.com/bcgov/"\
    "ds-intro-to-python/main/data/gapfinder.csv"

gapminder = pd.read_csv(url)
gapminder.head()

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106

Long to Wide

Pivoting a Table

pivot

The first thing we may wish to do is a simple pivot in which we make the table wider than it was before. The simplest way to do this is by the pivot method. This takes as input three arguments:

columns: the only required argument! Column(s) we wish to use for the new columns of our dataframe. If more than one is provided, a hierarchy of columns is created.
index: the column(s) we wish to use for the index of our new dataframe. This is entirely optional, and will just default to the pre-existing index if not supplied.
values: the value(s) we wish to retain for the new dataframe. This is also optional, and will use all leftover columns if nothing is provided. If more than one column is used here, it again creates a hierarchy in the columns.

The output of the pivot method is a new dataframe that has the requested index/columns, with the corresponding value associated with each index-column pair in the dataframe.

For instance, perhaps we wish to explore the population of each country individually overtime. In this case, we can pivot our gapminder dataset so that each of the countries in the original country column become their own column, and compare the population value against the year:

gapminder_year_country = gapminder.pivot(
    index='year',
    columns='country',
    values='pop'
)
gapminder_year_country.head()

country	Afghanistan	Albania	Algeria	Angola	Argentina	Australia	Austria	Bahrain	Bangladesh	Belgium	...	Uganda	United Kingdom	United States	Uruguay	Venezuela	Vietnam	West Bank and Gaza	Yemen Rep.	Zambia	Zimbabwe
year
1952	8425333.0	1282697.0	9279525.0	4232095.0	17876956.0	8691212.0	6927772.0	120447.0	46886859.0	8730405.0	...	5824797.0	50430000.0	157553000.0	2252965.0	5439568.0	26246839.0	1030585.0	4963829.0	2672000.0	3080907.0
1957	9240934.0	1476505.0	10270856.0	4561361.0	19610538.0	9712569.0	6965860.0	138655.0	51365468.0	8989111.0	...	6675501.0	51430000.0	171984000.0	2424959.0	6702668.0	28998543.0	1070439.0	5498090.0	3016000.0	3646340.0
1962	10267083.0	1728137.0	11000948.0	4826015.0	21283783.0	10794968.0	7129864.0	171863.0	56839289.0	9218400.0	...	7688797.0	53292000.0	186538000.0	2598466.0	8143375.0	33796140.0	1133134.0	6120081.0	3421000.0	4277736.0
1967	11537966.0	1984060.0	12760499.0	5247469.0	22934225.0	11872264.0	7376998.0	202182.0	62821884.0	9556500.0	...	8900294.0	54959000.0	198712000.0	2748579.0	9709552.0	39463910.0	1142636.0	6740785.0	3900000.0	4995432.0
1972	13079460.0	2263554.0	14760787.0	5894858.0	24779799.0	13177000.0	7544201.0	230800.0	70759295.0	9709100.0	...	10190285.0	56079000.0	209896000.0	2829526.0	11515649.0	44655014.0	1089572.0	7407075.0	4506497.0	5861135.0

5 rows × 142 columns

Note that doing this moves ‘year’ into the index of the dataframe, and gives the columns the name of ‘country’. We can move ‘year’ back into the core of the dataframe using reset_index(), and remove the column name using the name attribute of the columns:

gapminder_year_country = gapminder_year_country.reset_index()
gapminder_year_country.columns.name = None
gapminder_year_country.head()

	year	Afghanistan	Albania	Algeria	Angola	Argentina	Australia	Austria	Bahrain	Bangladesh	...	Uganda	United Kingdom	United States	Uruguay	Venezuela	Vietnam	West Bank and Gaza	Yemen Rep.	Zambia	Zimbabwe
0	1952	8425333.0	1282697.0	9279525.0	4232095.0	17876956.0	8691212.0	6927772.0	120447.0	46886859.0	...	5824797.0	50430000.0	157553000.0	2252965.0	5439568.0	26246839.0	1030585.0	4963829.0	2672000.0	3080907.0
1	1957	9240934.0	1476505.0	10270856.0	4561361.0	19610538.0	9712569.0	6965860.0	138655.0	51365468.0	...	6675501.0	51430000.0	171984000.0	2424959.0	6702668.0	28998543.0	1070439.0	5498090.0	3016000.0	3646340.0
2	1962	10267083.0	1728137.0	11000948.0	4826015.0	21283783.0	10794968.0	7129864.0	171863.0	56839289.0	...	7688797.0	53292000.0	186538000.0	2598466.0	8143375.0	33796140.0	1133134.0	6120081.0	3421000.0	4277736.0
3	1967	11537966.0	1984060.0	12760499.0	5247469.0	22934225.0	11872264.0	7376998.0	202182.0	62821884.0	...	8900294.0	54959000.0	198712000.0	2748579.0	9709552.0	39463910.0	1142636.0	6740785.0	3900000.0	4995432.0
4	1972	13079460.0	2263554.0	14760787.0	5894858.0	24779799.0	13177000.0	7544201.0	230800.0	70759295.0	...	10190285.0	56079000.0	209896000.0	2829526.0	11515649.0	44655014.0	1089572.0	7407075.0	4506497.0	5861135.0

5 rows × 143 columns

We could make a slightly more complex dataset by including the continent as well in our new column scheme:

gapminder_year_country_continent = gapminder.pivot(
    index='year',
    columns=['country', 'continent'],
    values='pop'
)
gapminder_year_country_continent.head()

country	Afghanistan	Albania	Algeria	Angola	Argentina	Australia	Austria	Bahrain	Bangladesh	Belgium	...	Uganda	United Kingdom	United States	Uruguay	Venezuela	Vietnam	West Bank and Gaza	Yemen Rep.	Zambia	Zimbabwe
continent	Asia	Europe	Africa	Africa	Americas	Oceania	Europe	Asia	Asia	Europe	...	Africa	Europe	Americas	Americas	Americas	Asia	Asia	Asia	Africa	Africa
year
1952	8425333.0	1282697.0	9279525.0	4232095.0	17876956.0	8691212.0	6927772.0	120447.0	46886859.0	8730405.0	...	5824797.0	50430000.0	157553000.0	2252965.0	5439568.0	26246839.0	1030585.0	4963829.0	2672000.0	3080907.0
1957	9240934.0	1476505.0	10270856.0	4561361.0	19610538.0	9712569.0	6965860.0	138655.0	51365468.0	8989111.0	...	6675501.0	51430000.0	171984000.0	2424959.0	6702668.0	28998543.0	1070439.0	5498090.0	3016000.0	3646340.0
1962	10267083.0	1728137.0	11000948.0	4826015.0	21283783.0	10794968.0	7129864.0	171863.0	56839289.0	9218400.0	...	7688797.0	53292000.0	186538000.0	2598466.0	8143375.0	33796140.0	1133134.0	6120081.0	3421000.0	4277736.0
1967	11537966.0	1984060.0	12760499.0	5247469.0	22934225.0	11872264.0	7376998.0	202182.0	62821884.0	9556500.0	...	8900294.0	54959000.0	198712000.0	2748579.0	9709552.0	39463910.0	1142636.0	6740785.0	3900000.0	4995432.0
1972	13079460.0	2263554.0	14760787.0	5894858.0	24779799.0	13177000.0	7544201.0	230800.0	70759295.0	9709100.0	...	10190285.0	56079000.0	209896000.0	2829526.0	11515649.0	44655014.0	1089572.0	7407075.0	4506497.0	5861135.0

5 rows × 142 columns

Or by insisting that we retain information about more than just the population:

gapminder_year_country_two_vals = gapminder.pivot(
    index='year',
    columns='country',
    values=['pop', 'lifeExp']
)
gapminder_year_country_two_vals.head()

	pop										...	lifeExp
country	Afghanistan	Albania	Algeria	Angola	Argentina	Australia	Austria	Bahrain	Bangladesh	Belgium	...	Uganda	United Kingdom	United States	Uruguay	Venezuela	Vietnam	West Bank and Gaza	Yemen Rep.	Zambia	Zimbabwe
year
1952	8425333.0	1282697.0	9279525.0	4232095.0	17876956.0	8691212.0	6927772.0	120447.0	46886859.0	8730405.0	...	39.978	69.18	68.44	66.071	55.088	40.412	43.160	32.548	42.038	48.451
1957	9240934.0	1476505.0	10270856.0	4561361.0	19610538.0	9712569.0	6965860.0	138655.0	51365468.0	8989111.0	...	42.571	70.42	69.49	67.044	57.907	42.887	45.671	33.970	44.077	50.469
1962	10267083.0	1728137.0	11000948.0	4826015.0	21283783.0	10794968.0	7129864.0	171863.0	56839289.0	9218400.0	...	45.344	70.76	70.21	68.253	60.770	45.363	48.127	35.180	46.023	52.358
1967	11537966.0	1984060.0	12760499.0	5247469.0	22934225.0	11872264.0	7376998.0	202182.0	62821884.0	9556500.0	...	48.051	71.36	70.76	68.468	63.479	47.838	51.631	36.984	47.768	53.995
1972	13079460.0	2263554.0	14760787.0	5894858.0	24779799.0	13177000.0	7544201.0	230800.0	70759295.0	9709100.0	...	51.016	72.01	71.34	68.673	65.712	50.254	56.532	39.848	50.107	55.635

5 rows × 284 columns

What if we wish to know about the average population of each continent over time? If we try the same approach as above, we will immediately run into an issue:

gapminder_year_continent = gapminder.pivot(
    index='year',
    columns='continent',
    values='pop'
)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[46], line 1
----> 1 gapminder_year_continent = gapminder.pivot(
      2     index='year',
      3     columns='continent',
      4     values='pop'
      5 )

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/frame.py:7885, in DataFrame.pivot(self, index, columns, values)
   7880 @Substitution("")
   7881 @Appender(_shared_docs["pivot"])
   7882 def pivot(self, index=None, columns=None, values=None) -> DataFrame:
   7883     from pandas.core.reshape.pivot import pivot
-> 7885     return pivot(self, index=index, columns=columns, values=values)

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/pivot.py:520, in pivot(data, index, columns, values)
    518     else:
    519         indexed = data._constructor_sliced(data[values]._values, index=multiindex)
--> 520 return indexed.unstack(columns_listlike)

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/series.py:4157, in Series.unstack(self, level, fill_value)
   4114 """
   4115 Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
   4116 
   (...)
   4153 b    2    4
   4154 """
   4155 from pandas.core.reshape.reshape import unstack
-> 4157 return unstack(self, level, fill_value)

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/reshape.py:491, in unstack(obj, level, fill_value)
    489 if is_1d_only_ea_dtype(obj.dtype):
    490     return _unstack_extension_series(obj, level, fill_value)
--> 491 unstacker = _Unstacker(
    492     obj.index, level=level, constructor=obj._constructor_expanddim
    493 )
    494 return unstacker.get_result(
    495     obj._values, value_columns=None, fill_value=fill_value
    496 )

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/reshape.py:140, in _Unstacker.__init__(self, index, level, constructor)
    133 if num_cells > np.iinfo(np.int32).max:
    134     warnings.warn(
    135         f"The following operation may generate {num_cells} cells "
    136         f"in the resulting pandas object.",
    137         PerformanceWarning,
    138     )
--> 140 self._make_selectors()

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/reshape.py:192, in _Unstacker._make_selectors(self)
    189 mask.put(selector, True)
    191 if mask.sum() < len(self.index):
--> 192     raise ValueError("Index contains duplicate entries, cannot reshape")
    194 self.group_index = comp_index
    195 self.mask = mask

ValueError: Index contains duplicate entries, cannot reshape

Here we find that there is more than one value associated with each year-continent pairing, and so the pivot method does not know how to assign a value to the dataframe. This is because pivot is for unique values only. While it is an excellent simple-use tool, we can go one step further to deal with non-unique values.

pivot_table

To address the year-continent question above, we could first group our data by continent, calculate the average population every year, and then pivot the resultant dataframe. This is a completely valid method! However, pandas also offers a built in method that will do aggregations and pivots all at once. This is similar to the functionality of Excel pivot tables, and includes some additional functionality we would not find in groupby alone. Some of the important arguments that we can use in the pivot_table method include:

columns: the only required argument! Column(s) (and values within) we wish to use for the resultant dataframes columns.
values: the column(s) we wish to aggregate. If no columns are provided, all leftover (non column/index columns) columns will be used that make ‘sense’. For example, the average value of a list of strings does not make sense, but the ‘max’ value of a string list could be calculated.
index: the column(s) we wish to use for the resultant dataframes index
aggfunc: the method of aggregation we wish to use. Default is to calculate the average.
fill_value: the value to replace missing values with (after aggregation).

Let’s go through some examples with increasing complexity. The simplest thing we can do is pivot a single column, to which we will get an output for the averages of all numerical columns.

gapminder.pivot_table(
    columns='continent',
    values = ['gdpPercap', 'lifeExp', 'pop', 'year']

)

continent	Africa	Americas	Asia	Europe	Oceania
gdpPercap	2.193755e+03	7.136110e+03	7.902150e+03	1.446948e+04	1.862161e+04
lifeExp	4.886533e+01	6.465874e+01	6.006490e+01	7.190369e+01	7.432621e+01
pop	9.916003e+06	2.450479e+07	7.703872e+07	1.716976e+07	8.874672e+06
year	1.979500e+03	1.979500e+03	1.979500e+03	1.979500e+03	1.979500e+03

Next we may wish to add an index, and focus on a single value for our output. This matches the question we asked above and could not answer with the pivot method:

gapminder_year_continent = gapminder.pivot_table(
    index='year',
    columns='continent',
    values='pop'
)
gapminder_year_continent.head()

continent	Africa	Americas	Asia	Europe	Oceania
year
1952	4.570010e+06	13806097.84	4.228356e+07	1.393736e+07	5343003.0
1957	5.093033e+06	15478156.64	4.735699e+07	1.459635e+07	5970988.0
1962	5.702247e+06	17330810.16	5.140476e+07	1.534517e+07	6641759.0
1967	6.447875e+06	19229864.92	5.774736e+07	1.603930e+07	7300207.0
1972	7.305376e+06	21175368.40	6.518098e+07	1.668784e+07	8053050.0

If we tried to do the same calculation, but use a string column instead of a numerical column for the values we will get a warning about trying to use invalid columns:

gapminder_year_continent_country = gapminder.pivot_table(
    index='year',
    columns='continent',
    values='country'
)
gapminder_year_continent_country.head()

/tmp/ipykernel_2563/2473994706.py:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.mean is deprecated. In a future version, a TypeError will be raised. Before calling .mean, select only columns which should be valid for the function.
  gapminder_year_continent_country = gapminder.pivot_table(

continent
year
1952
1957
1962
1967
1972

However, we can switch our aggregation method to something that is valid for strings (using either string arguments for common aggregations such as ‘max’, or more complicated numpy or user-defined functions as well).

gapminder_year_continent_country = gapminder.pivot_table(
    index='year',
    columns='continent',
    values='country',
    aggfunc='max'
)
gapminder_year_continent_country.head()

continent	Africa	Americas	Asia	Europe	Oceania
year
1952	Zimbabwe	Venezuela	Yemen Rep.	United Kingdom	New Zealand
1957	Zimbabwe	Venezuela	Yemen Rep.	United Kingdom	New Zealand
1962	Zimbabwe	Venezuela	Yemen Rep.	United Kingdom	New Zealand
1967	Zimbabwe	Venezuela	Yemen Rep.	United Kingdom	New Zealand
1972	Zimbabwe	Venezuela	Yemen Rep.	United Kingdom	New Zealand

Finally, similar to our groupby aggregations, we can also pass a dictionary to aggfunc. The dictionary contains the columns we wish to use for values as keys, and the type of aggregation(s) we wish to do as the values.

import numpy as np

gapminder_multiple_pivots = gapminder.pivot_table(
    index='year',
    columns='continent',
    aggfunc={
        'pop': [np.mean, np.max],
        'lifeExp': 'mean',
        'country': [min, max]
    }
)
gapminder_multiple_pivots.head()

	country										...	pop
	max					min					...	max					mean
continent	Africa	Americas	Asia	Europe	Oceania	Africa	Americas	Asia	Europe	Oceania	...	Africa	Americas	Asia	Europe	Oceania	Africa	Americas	Asia	Europe	Oceania
year
1952	Zimbabwe	Venezuela	Yemen Rep.	United Kingdom	New Zealand	Algeria	Argentina	Afghanistan	Albania	Australia	...	33119096.0	157553000.0	5.562635e+08	69145952.0	8691212.0	4.570010e+06	13806097.84	4.228356e+07	1.393736e+07	5343003.0
1957	Zimbabwe	Venezuela	Yemen Rep.	United Kingdom	New Zealand	Algeria	Argentina	Afghanistan	Albania	Australia	...	37173340.0	171984000.0	6.374080e+08	71019069.0	9712569.0	5.093033e+06	15478156.64	4.735699e+07	1.459635e+07	5970988.0
1962	Zimbabwe	Venezuela	Yemen Rep.	United Kingdom	New Zealand	Algeria	Argentina	Afghanistan	Albania	Australia	...	41871351.0	186538000.0	6.657700e+08	73739117.0	10794968.0	5.702247e+06	17330810.16	5.140476e+07	1.534517e+07	6641759.0
1967	Zimbabwe	Venezuela	Yemen Rep.	United Kingdom	New Zealand	Algeria	Argentina	Afghanistan	Albania	Australia	...	47287752.0	198712000.0	7.545500e+08	76368453.0	11872264.0	6.447875e+06	19229864.92	5.774736e+07	1.603930e+07	7300207.0
1972	Zimbabwe	Venezuela	Yemen Rep.	United Kingdom	New Zealand	Algeria	Argentina	Afghanistan	Albania	Australia	...	53740085.0	209896000.0	8.620300e+08	78717088.0	13177000.0	7.305376e+06	21175368.40	6.518098e+07	1.668784e+07	8053050.0

5 rows × 25 columns

Wide to Long

I’m meeelllting!

melt

The opposite of pivoting is melting (…obviously?). This is often used to get data back into long-form. Long-form data will typically have all different possible categories for a single measure in a single column (having a country column instead of Algeria, Albania, Canada… Zimbabwe columns). Long-form data is typically the preferred data style when doing your actual analysis. For example, many of the Seaborns methods we considered for creating stunning visuals assume that the data is in long-form, and is optimized to split out categories internally instead of pulling from multiple different columns.

However, as our data does not always come in the way we want it, we might have to unpivot (melt) our data to get it into long-form first. The melt method is how we attack this in Pandas, and it uses the following arguments:

id_vars: This is a required argument! Column(s) that will be used as identifiers. These are columns that we do not wish to unpivot, but leave as is.
value_vars: Column(s) to unpivot. If left out, the default is to use all non-id columns.

Let’s look at one of the pivoted tables we made earlier, and try to get it back into long form:

gapminder_year_country.head()

	year	Afghanistan	Albania	Algeria	Angola	Argentina	Australia	Austria	Bahrain	Bangladesh	...	Uganda	United Kingdom	United States	Uruguay	Venezuela	Vietnam	West Bank and Gaza	Yemen Rep.	Zambia	Zimbabwe
0	1952	8425333.0	1282697.0	9279525.0	4232095.0	17876956.0	8691212.0	6927772.0	120447.0	46886859.0	...	5824797.0	50430000.0	157553000.0	2252965.0	5439568.0	26246839.0	1030585.0	4963829.0	2672000.0	3080907.0
1	1957	9240934.0	1476505.0	10270856.0	4561361.0	19610538.0	9712569.0	6965860.0	138655.0	51365468.0	...	6675501.0	51430000.0	171984000.0	2424959.0	6702668.0	28998543.0	1070439.0	5498090.0	3016000.0	3646340.0
2	1962	10267083.0	1728137.0	11000948.0	4826015.0	21283783.0	10794968.0	7129864.0	171863.0	56839289.0	...	7688797.0	53292000.0	186538000.0	2598466.0	8143375.0	33796140.0	1133134.0	6120081.0	3421000.0	4277736.0
3	1967	11537966.0	1984060.0	12760499.0	5247469.0	22934225.0	11872264.0	7376998.0	202182.0	62821884.0	...	8900294.0	54959000.0	198712000.0	2748579.0	9709552.0	39463910.0	1142636.0	6740785.0	3900000.0	4995432.0
4	1972	13079460.0	2263554.0	14760787.0	5894858.0	24779799.0	13177000.0	7544201.0	230800.0	70759295.0	...	10190285.0	56079000.0	209896000.0	2829526.0	11515649.0	44655014.0	1089572.0	7407075.0	4506497.0	5861135.0

5 rows × 143 columns

gapminder_year_country.melt(id_vars='year').head()

	year	variable	value
0	1952	Afghanistan	8425333.0
1	1957	Afghanistan	9240934.0
2	1962	Afghanistan	10267083.0
3	1967	Afghanistan	11537966.0
4	1972	Afghanistan	13079460.0

If we wish to only keep a subset of the countries, we can do that too. We can also change the name of the resultant variable and value columns:

gapminder_year_country.melt(
    id_vars='year',
    value_vars = ['Albania', 'Canada', 'Zimbabwe'],
    var_name = 'country',
    value_name = 'pop'
).head()

	year	country	pop
0	1952	Albania	1282697.0
1	1957	Albania	1476505.0
2	1962	Albania	1728137.0
3	1967	Albania	1984060.0
4	1972	Albania	2263554.0

Overmelting!

When melting our data, we want to be careful, and not ‘overmelt’! Think back to the original gapminder dataset:

gapminder.head()

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106

Would the following melt make sense?

gapminder.melt(
    id_vars=['country', 'year', 'continent']
)

	country	year	continent	variable	value
0	Afghanistan	1952	Asia	pop	8.425333e+06
1	Afghanistan	1957	Asia	pop	9.240934e+06
2	Afghanistan	1962	Asia	pop	1.026708e+07
3	Afghanistan	1967	Asia	pop	1.153797e+07
4	Afghanistan	1972	Asia	pop	1.307946e+07
...	...	...	...	...	...
5107	Zimbabwe	1987	Africa	gdpPercap	7.061573e+02
5108	Zimbabwe	1992	Africa	gdpPercap	6.934208e+02
5109	Zimbabwe	1997	Africa	gdpPercap	7.924500e+02
5110	Zimbabwe	2002	Africa	gdpPercap	6.720386e+02
5111	Zimbabwe	2007	Africa	gdpPercap	4.697093e+02

5112 rows × 5 columns

Probably not. Having the population, GDP, and life expectancy all in the same column might lead to confusing results. As a general rule of thumb, including multiple unique measures that have different units (population is measured in number of people, life expectancy in years, and GDP per capita in dollars per person) in the same column is typically avoided. Another way to think of it is this: if I take the average of the entire column, will that actually make sense? Try not to melt further than necessary! Striking a balance between machine-preferred (entirely long-form) and human-preferred (wide-form) is an important and useful skill when creating scripts to analyse our datasets!

These examples only touch the surface of methods to reshape our data. To explore even further, Pandas provides a user guide to reshaping. Check this out for even more reshaping functionality!

Challenge 4

Create a pivot table that counts the number of countries in each continent that has a population greater than 1,000,000 in any given year.

Solution to Challenge 4

First, let’s demonstrate this using the methods we’ve learned about to reshape our data. We can create this aggregated pivot table using the pivot_table method, and supplying ‘count’ to the aggfunc method.

gapminder[gapminder['pop']>1_000_000].pivot_table(
    index='year',
    columns='continent',
    values='country',
    aggfunc='count'
)

continent	Africa	Americas	Asia	Europe	Oceania
year
1952	37	22	28	28	2
1957	37	24	28	28	2
1962	39	24	29	28	2
1967	39	24	30	28	2
1972	40	24	30	28	2
1977	40	25	32	28	2
1982	41	25	32	28	2
1987	43	25	32	28	2
1992	45	25	32	28	2
1997	47	25	32	28	2
2002	47	25	32	28	2
2007	47	25	32	28	2

This type of table is commonly referred to as a crosstable: it gives us the frequency of occurences of items under different groupings. Because it is such a widely used table, it was given its own special method in Pandas: crosstab. Basic functionality is to supply two Series that we wish to cross-tabulate to the method, and let it do all the work. Note that this is a Pandas specific method, not a dataframe specific method!

gapminder_filtered = gapminder[gapminder['pop']>1_000_000]
pd.crosstab(
    index=gapminder_filtered['year'],
    columns=gapminder_filtered['continent']
)

continent	Africa	Americas	Asia	Europe	Oceania
year
1952	37	22	28	28	2
1957	37	24	28	28	2
1962	39	24	29	28	2
1967	39	24	30	28	2
1972	40	24	30	28	2
1977	40	25	32	28	2
1982	41	25	32	28	2
1987	43	25	32	28	2
1992	45	25	32	28	2
1997	47	25	32	28	2
2002	47	25	32	28	2
2007	47	25	32	28	2