9  Advanced Pandas

Advanced. Pandas.

In this section we will go over some Pandas functionality that is slightly more involved than the prior sessions. We hope to cover, or at least provide reference for operations such as:

9.1 Conditional Column Assignment

In our previous sections on Pandas, we introduced the idea of column creation based on the data we currently have. We limited ourselves to fairly standard examples that are likely to crop up in typical data analysis: arithmetic operations, applying column specific functions, and the use of the lambda function inside a call to either apply or assign. Here we will expand our column possibilities to include conditional assignments, where the value of the column is dependent on the value/state of some other column.

Let’s look at the penguins dataset we considered previously. This dataset has a good mix of categorical and numerical datatypes for us to create new columns from.

import pandas as pd
from seaborn import load_dataset

penguins = load_dataset('penguins')
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

Binary If/Else

The first type of conditional assignment we will consider is the simplest: if original column is value A, then assign value x, otherwise assign value y. The most straightforward way to do this is via the apply method again, combined with a lambda function acting on the column of interest. For this example, we are going to make a shorthand version of the sex column - if the sex column is Female, then assign F, otherwise assign M.

penguins['if_else_column'] = penguins.sex.apply(lambda x: 'F' if x=='Female' else 'M')
penguins[['sex', 'if_else_column']].sample(5)
sex if_else_column
68 Female F
186 Male M
185 Male M
171 Male M
270 Female F

While apply is robust and can be used in many different applications and settings, it is often the slowest method available in Pandas. If you are doing very specific operations, Pandas often provides faster, vectorized methods to do the same thing. For instance, this if/else example can also be calculated using the Pandas where and mask methods. These methods complement each other, and you may find one more intuitive than the other based on the specific circumstances of the data you are working with.

Both methods assume you already have some column with a default value in place. For example, we can assign the value ‘M’ to an entire column. where and mask then work by updating the values of this column where some other condition is met. The syntax for where is:

If condition is met, keep the default value, otherwise update to the other value:

penguins['where_column'] = 'M'
penguins['where_column'] = penguins.where_column.where(penguins.sex=='Male', 'F')
penguins[['sex', 'if_else_column', 'where_column']].sample(5)
sex if_else_column where_column
341 Male M M
287 Male M M
223 Male M M
214 Female F F
77 Male M M

While the syntax for mask is:

If condition is met, update to other value, otherwise keep the default value:

penguins['mask_column'] = 'M'
penguins['mask_column'] = penguins.mask_column.mask(penguins.sex=='Female', 'F')
penguins[['sex', 'if_else_column', 'where_column', 'mask_column']].sample(5)
sex if_else_column where_column mask_column
220 Female F F F
242 Female F F F
201 Female F F F
286 NaN M F M
249 Male M M M

Again, while these all result in the same column, you may find one method more intuitive, efficient, or faster for the data you are working with. Explore and use what works best for you!

Multiple Conditions

Sometimes, a simple if/else statement doesn’t cut it. Perhaps there are multiple options, and creating an ifelseifelseifelse block will not only only be less intuitive but can easily be used incorrectly, leading to undesired outputs. In this case, the mapping functionality of Pandas provides a vectorized efficient method to create multiple changes to a column. The map method, which we apply to a single series, is optimized for mapping elements from one domain to another. The best example of this is using a dictionary to provide a mapping of your keys (initial column values) to your new value set.

Let’s see this in action with the penguins. Perhaps we do not wish to bias our analysis by having knowledge of which island each penguin came from. In this case, we wish to replace the actual names of the three islands with generic island IDs. First, we supply our mapping dictionary that will transform each island to its generic ID. Then, we supply this dictionary to our map method:

my_mapping = {
    'Torgersen': 'Island 1',
    'Biscoe': 'Island 2',
    'Dream': 'Island 3'
}
penguins['mapped_islands'] = penguins.island.map(my_mapping)
penguins[['island', 'mapped_islands']].sample(10)
island mapped_islands
254 Biscoe Island 2
5 Torgersen Island 1
278 Biscoe Island 2
290 Biscoe Island 2
33 Dream Island 3
72 Torgersen Island 1
76 Torgersen Island 1
109 Biscoe Island 2
342 Biscoe Island 2
20 Biscoe Island 2

Complex Conditions

The above example used the map method, which is inherently a method to be used for Series only. As such, we cannot use it for more complicated conditions that may involve multiple columns. The final example we will consider is a more complicated set of conditions: we wish to create new categories based on the penguins body mass: underweight, overweight, or average weight. However, we also wish to consider the penguin species when creating these categories: what is underweight for the Gentoo species may be overweight for the Adelie! Thus, in theory we have 9 new categories to account for that depend on two columns.

The best (read: efficient) way to approach this multi-column multi-conditional scenario is actually to go outside of the Pandas toolkit and use a tool that has been implemented by numpy. Numpy is a computational tool-kit for Python that utilizes the efficiency of powerful computational languages such as Fortran and C to provide fast and versatile computational tools for the vast majority of numerical and scientific anaylsis in Python. We have seen it already, having used numpy’s statistical functions such as mean and median within our aggregations. Here, we wish to make use of the numpy.select() method.

In numpy, the select method is fairly simple. We supply a list of conditions we wish to consider, and a list of choices for what we wish to happen for each of the conditions. We may also supply a default value, for if none of the conditions are met. select will go through the list one by one, and will apply the choice for the first condition that is met. Let’s see what this looks like for our penguins:

import numpy as np

# list all possible combos of species and specific body masses
# note that we don't include the overweight combos, as we can catch these
# with the default option in select!
conditions = [
    (penguins.species=='Adelie') & (penguins.body_mass_g<3300),
    (penguins.species=='Adelie') & (penguins.body_mass_g<3700),
    (penguins.species=='Chinstrap') & (penguins.body_mass_g<3333),
    (penguins.species=='Chinstrap') & (penguins.body_mass_g<3733),
    (penguins.species=='Gentoo') & (penguins.body_mass_g<4676),
    (penguins.species=='Gentoo') & (penguins.body_mass_g<5076)
]

# for each listed condition, tell select what value to use
choices = [
    'Underweight',
    'Average weight', 
    'Underweight', 
    'Average weight',
    'Underweight',
    'Average weight'
]

# use select to create our new column
penguins['general_weight'] = np.select(conditions, choices, default='Overweight')

# display results
penguins[['species', 'body_mass_g', 'general_weight']].sample(10)
species body_mass_g general_weight
299 Gentoo 5950.0 Overweight
243 Gentoo 5050.0 Average weight
102 Adelie 3075.0 Underweight
100 Adelie 3725.0 Overweight
112 Adelie 3200.0 Underweight
339 Gentoo NaN Overweight
74 Adelie 3700.0 Overweight
302 Gentoo 4725.0 Average weight
192 Chinstrap 3950.0 Overweight
304 Gentoo 4750.0 Average weight
Order Matters!

When using the select method, the order of the conditions matters! Numpy always goes through the list in order, looking for the first TRUE condition. This ordering will matter, for example, if creating a list of bins for some numerical category. Conditions of the form:

conditions = [
    penguins.body_mass_g < 5_000,
    penguins.body_mass_g < 4_000,
    penguins.body_mass_g < 3_000,
    penguins.body_mass_g < 2_000,
    penguins.body_mass_g < 1_000
]

Will never evaluate the choices for the conditions 2-5, as the first condition will be true as well. Instead we should flip this list and write our conditions starting from 1,000 and ending at 5,000. Keep this in mind when creating your conditions!

Challenge 1

Challenge 1

Create a column for 5-10 flipper length bins and plot the resulting frequencies (counts for each bin).

Here we’re going to pull together some Seaborn code and our new-found conditional knowledge to create the bins.

# make sure seaborn is imported
import seaborn as sns
import matplotlib.pyplot as plt

# check roughly where we could place bins
display(penguins.agg({'flipper_length_mm': ['mean', 'std', 'min', 'max']}))

# create some conditions for bins
conditions = [
    penguins.flipper_length_mm < 180,
    penguins.flipper_length_mm < 190,
    penguins.flipper_length_mm < 200,
    penguins.flipper_length_mm < 210,
    penguins.flipper_length_mm < 220,
    penguins.flipper_length_mm < 230
]

# assign names to each condition
choices = [
    '0 - 180',
    '180 - 190',
    '190 - 200',
    '200 - 210',
    '210 - 220',
    '220 - 230'
]

# add the new column
penguins['flipper_length_bins'] = np.select(conditions, choices, default='230 +')

# plot this data using the 'count' option
sns.catplot(
    penguins, 
    x='flipper_length_bins', 
    kind='count',
    order = choices,
    color = 'blue'
)
plt.xticks(rotation=45)
plt.show()
flipper_length_mm
mean 200.915205
std 14.061714
min 172.000000
max 231.000000

9.2 Advanced Grouping

groupby cuteness

In our previous section on Pandas, we explored the basic tools we need to group our data by various columns, apply a host of statistical aggregation operations, and look at these outcomes. While incredibly powerful as is, these tools lack some functionality that you might wish to have. Some examples include:

  • Combining row level values and group level aggregations in a single command.
  • Filtering and subsetting a dataframe based on group values.
  • Combining aggregations across different columns.

In order to handle these operations, we need to introduce two new methods: transform and apply.

Apply… again?

You might question whether apply is actually a new function, seeing as we have used it previously when creating new columns in a dataframe. In fact, the effect of apply within pandas is subtly different depending on how and where we apply it. The effect of apply is different when used on:

  • A dataframe as a whole - dataframe.apply(...)
  • A single series - series.apply(...)
  • A grouped dataframe object - dataframe.groupby('column').apply(...)

Make sure you keep track of what kind of apply you are using when doing your dataframe manipulations, as identical functions in different scenarios can lead to unusually different results! In this section, we are focusing on the grouped dataframe objects.

Both transform and apply can be used on a grouped dataframe object. This is the object that is created when we use the .groupby() method. Let’s create a small dataset and see what this object looks like:

starwars = pd.DataFrame(
    {
        'unit' : [
            'Blaster', 'Blaster', 'Blaster', 
            'Lightsaber', 'Lightsaber', 'Lightsaber', 'Lightsaber', 
            'Stick', 'Stick', 'Stick'
            ], 
        'cost' : [42, 60, 40, 900, 4000, 2000, 100, 10, 1, 5],
        'sale_price' : [50, 75, 42, 1000, 5000, 2000, 4242, 4, 2, 20]
    }
)

starwars
unit cost sale_price
0 Blaster 42 50
1 Blaster 60 75
2 Blaster 40 42
3 Lightsaber 900 1000
4 Lightsaber 4000 5000
5 Lightsaber 2000 2000
6 Lightsaber 100 4242
7 Stick 10 4
8 Stick 1 2
9 Stick 5 20

Here we have a column that looks categorical (unit), and a couple that look numerical (cost and sale_price). We are going to focus on grouping our data according to the categorical column, and see what we can do with the numerical columns.

grouped_starwars = starwars.groupby('unit')
grouped_starwars
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb461b2f310>

Note that this object is no longer a dataframe, but some sort of DataFrameGroupBy object. By itself, it is not very useful, but we can use it to get all sorts of interesting information. Recall briefly how we used this earlier. Maybe we want the average buy and sell prices of each type of unit:

grouped_starwars.mean().reset_index()
unit cost sale_price
0 Blaster 47.333333 55.666667
1 Lightsaber 1750.000000 3060.500000
2 Stick 5.333333 8.666667

What if we wanted to know how much each particular unit contributed to it’s groups overall sale? Or perhaps how much each group profited, on average? This is where transform and apply come in. Let’s look at what each of these do.

Transform

  • Used to apply a function to a dataframe that produces a new dataframe with the same shape as the original.
  • While an entire dataframe can be passed to transform, it only ever sees (operates on) a single column/series at a time.
  • The function must either return a scalar value or a sequence that is the same length as the original dataframe.
  • Useful for getting groups aggregations back into the original dataframe
  • Useful for filtering based on group values

Ex. 1: Percent of Group

Let’s see each this in action. We will use transform to determine how much each unit contributed to the total sales of its group.

pct_of_sales = (
    grouped_starwars
    .sale_price
    .transform(lambda series_: series_/series_.sum())
)
pct_of_sales
0    0.299401
1    0.449102
2    0.251497
3    0.081686
4    0.408430
5    0.163372
6    0.346512
7    0.153846
8    0.076923
9    0.769231
Name: sale_price, dtype: float64

Notice here that we used the lambda function again, which allows us to define a function inline acting on a given argument (in this case, the argument is named series_). This let us allow the value of the series to be the sum of the series. Because we are acting on a grouped object, Pandas knows that the sum we wish to employ is that of the group, not the overall sum! This new object will be in the same order (have the same indexing) as our initial dataframe, and so we can add it back to the dataframe as an extra column if we wish. This works explicitly because the transform method is required to return a series of the same length as the original object, giving us exactly what we want!

starwars['percent_of_sales'] = pct_of_sales
starwars
unit cost sale_price percent_of_sales
0 Blaster 42 50 0.299401
1 Blaster 60 75 0.449102
2 Blaster 40 42 0.251497
3 Lightsaber 900 1000 0.081686
4 Lightsaber 4000 5000 0.408430
5 Lightsaber 2000 2000 0.163372
6 Lightsaber 100 4242 0.346512
7 Stick 10 4 0.153846
8 Stick 1 2 0.076923
9 Stick 5 20 0.769231

Ex 2: Grouped Aggregate

Transform can be used exclusively with an aggregation function as well, but it will still return a value for every row of the initial dataset. This allows us to add grouped level values as columns to the data:

starwars['average_unit_cost'] = grouped_starwars.cost.transform('mean')
starwars
unit cost sale_price percent_of_sales average_unit_cost
0 Blaster 42 50 0.299401 47.333333
1 Blaster 60 75 0.449102 47.333333
2 Blaster 40 42 0.251497 47.333333
3 Lightsaber 900 1000 0.081686 1750.000000
4 Lightsaber 4000 5000 0.408430 1750.000000
5 Lightsaber 2000 2000 0.163372 1750.000000
6 Lightsaber 100 4242 0.346512 1750.000000
7 Stick 10 4 0.153846 5.333333
8 Stick 1 2 0.076923 5.333333
9 Stick 5 20 0.769231 5.333333

Ex 3: Multiple Series

Instead of passing a single column to the transform method, we could also pass the entire dataframe (or some subset of the columns in the dataframe), and calculate values for each series:

grouped_starwars.transform(lambda series_: (series_-series_.mean())/series_.std())
cost sale_price
0 -0.484182 -0.329183
1 1.149932 1.123095
2 -0.665750 -0.793912
3 -0.502909 -1.099922
4 1.331229 1.035331
5 0.147914 -0.566109
6 -0.976235 0.630700
7 1.034910 -0.473016
8 -0.960988 -0.675737
9 -0.073922 1.148754

This has created two columns that have both had the same function applied. Note that if we wanted to merge these in with the original dataframe, we should be careful to rename the columns so that they do not overlap!

Ex 4: Filtering Original Dataframe

Another use case for transform is in filtering a dataframe. Maybe we want to zero in on those items where the sale price of the object was less than the average unit cost. Let’s filter our dataframe to only those rows:

condition = starwars.sale_price < grouped_starwars.cost.transform(lambda x: x.mean())
starwars[condition]
unit cost sale_price percent_of_sales average_unit_cost
2 Blaster 40 42 0.251497 47.333333
3 Lightsaber 900 1000 0.081686 1750.000000
7 Stick 10 4 0.153846 5.333333
8 Stick 1 2 0.076923 5.333333

Apply

  • Used to apply a function (aggregated or otherwise) across multiple columns
  • Implicitly passes all the columns of the dataframe as a dataframe to the function, allowing for column interactions
  • The function can return a scalar or a sequence of any length.

Ex 1. Aggregate over Multiple Columns

Let’s try something with apply. What if we want to know the average overall profit of each group. We could produce a profit column, group up on the unit, and then calculate the mean.

starwars['profit'] = starwars.sale_price - starwars.cost
starwars.groupby('unit').profit.mean()
unit
Blaster          8.333333
Lightsaber    1310.500000
Stick            3.333333
Name: profit, dtype: float64

We could also do this using the apply function, applied to our grouped object:

grouped_starwars.apply(lambda df_: (df_.sale_price - df_.cost).mean())
unit
Blaster          8.333333
Lightsaber    1310.500000
Stick            3.333333
dtype: float64

Again, we made use of that lambda function, this time applying it to a df_ argument. You might notice that here I used a df_ argument instead of series_: this is merely notation, and we could have used anything (series_, df_, x, this_is_my_argument would all work the same). However, we have used df_ and series_ so that we can remind ourselves exactly what type of object we are acting on.

To see how transform differs from apply, let’s try to do that exact same operation:

grouped_starwars.transform(lambda _df: (_df.sale_price - _df.cost).mean())
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2563/2867570833.py in ?()
----> 1 grouped_starwars.transform(lambda _df: (_df.sale_price - _df.cost).mean())

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(self, func, engine, engine_kwargs, *args, **kwargs)
   1181     @Substitution(klass="DataFrame")
   1182     @Appender(_transform_template)
   1183     def transform(self, func, *args, engine=None, engine_kwargs=None, **kwargs):
-> 1184         return self._transform(
   1185             func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
   1186         )

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in ?(self, func, engine, engine_kwargs, *args, **kwargs)
   1638         # optimized transforms
   1639         func = com.get_cython_func(func) or func
   1640 
   1641         if not isinstance(func, str):
-> 1642             return self._transform_general(func, *args, **kwargs)
   1643 
   1644         elif func not in base.transform_kernel_allowlist:
   1645             msg = f"'{func}' is not a valid function name for transform(name)"

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(self, func, *args, **kwargs)
   1157             except TypeError:
   1158                 return self._transform_item_by_item(obj, fast_path)
   1159             except ValueError as err:
   1160                 msg = "transform must return a scalar value for each group"
-> 1161                 raise ValueError(msg) from err
   1162             if group.size > 0:
   1163                 res = _wrap_transform_general_frame(self.obj, group, res)
   1164                 applied.append(res)

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(self, fast_path, slow_path, group)
   1206     def _choose_path(self, fast_path: Callable, slow_path: Callable, group: DataFrame):
   1207         path = slow_path
-> 1208         res = slow_path(group)
   1209 
   1210         # if we make it here, test if we can use the fast path
   1211         try:

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(group)
-> 1201             slow_path = lambda group: group.apply(
   1202                 lambda x: func(x, *args, **kwargs), axis=self.axis
   1203             )

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/frame.py in ?(self, func, axis, raw, result_type, args, **kwargs)
   8844             result_type=result_type,
   8845             args=args,
   8846             kwargs=kwargs,
   8847         )
-> 8848         return op.apply().__finalize__(self, method="apply")

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/apply.py in ?(self)
    729         # raw
    730         elif self.raw:
    731             return self.apply_raw()
    732 
--> 733         return self.apply_standard()

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/apply.py in ?(self)
    856     def apply_standard(self):
--> 857         results, res_index = self.apply_series_generator()
    858 
    859         # wrap results
    860         return self.wrap_results(results, res_index)

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/apply.py in ?(self)
    873                 results[i] = self.f(v)
    874                 if isinstance(results[i], ABCSeries):
    875                     # If we have a view on v, we need to make a copy because
    876                     #  series_generator will swap out the underlying data
--> 877                     results[i] = results[i].copy(deep=False)
    878 
    879         return results, res_index

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(x)
-> 1202                 lambda x: func(x, *args, **kwargs), axis=self.axis

/tmp/ipykernel_2563/2867570833.py in ?(_df)
----> 1 grouped_starwars.transform(lambda _df: (_df.sale_price - _df.cost).mean())

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/generic.py in ?(self, name)
   5571             and name not in self._accessors
   5572             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573         ):
   5574             return self[name]
-> 5575         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'sale_price'

This fails because the object being acted on inside transform is itself just a series object, not the entire dataframe! As such, it has no attribute for sale_price or buy_price like our original dataframe does. Instead, it acts on the sale_price series, then the buy_price series, and returns the results.

Ex 2. Mix Row/Aggregate Levels

While the first apply example returns a rolled up aggregated dataframe, we can also use apply to return the individual rows of the dataframe by mixing aggregation functions with row level functions.

grouped_starwars.apply(lambda df_: df_.sale_price - df_.cost.mean())
unit         
Blaster     0       2.666667
            1      27.666667
            2      -5.333333
Lightsaber  3    -750.000000
            4    3250.000000
            5     250.000000
            6    2492.000000
Stick       7      -1.333333
            8      -3.333333
            9      14.666667
Name: sale_price, dtype: float64

In this way we have to be careful: transform will always return a dataframe that is the same size as the original, while apply will return something that varies with the type of function we have utilized.

Ex 3. Partial Aggregates

Apply lets us play some cool tricks as well. Suppose we only wanted to know about the two most expensive sales in each category. How could we filter to show this? We can use the nlargest (or smallest) method in conjunction with apply. nlargest does exactly what we might expect, returning the rows with the n largest values according to the provided column(s):

starwars.nlargest(2, 'sale_price')
unit cost sale_price percent_of_sales average_unit_cost profit
4 Lightsaber 4000 5000 0.408430 1750.0 1000
6 Lightsaber 100 4242 0.346512 1750.0 4142

But if we mix this with apply and our grouping dataframe, we can get the largest for each group!

grouped_starwars.apply(lambda df_: df_.nlargest(2, 'sale_price'))
unit cost sale_price percent_of_sales average_unit_cost profit
unit
Blaster 1 Blaster 60 75 0.449102 47.333333 15
0 Blaster 42 50 0.299401 47.333333 8
Lightsaber 4 Lightsaber 4000 5000 0.408430 1750.000000 1000
6 Lightsaber 100 4242 0.346512 1750.000000 4142
Stick 9 Stick 5 20 0.769231 5.333333 15
7 Stick 10 4 0.153846 5.333333 -6

Note that when apply produces some different sized aggregate than the original dataframe, it tacks on an extra index indicating what the grouper was. We do not always care about this, and can eliminate it in the initial creation of our grouped dataframe via the group_keys argument:

(
    starwars
    .groupby('unit', group_keys=False)
    .apply(lambda df_: df_.nlargest(2, 'sale_price'))
)
unit cost sale_price percent_of_sales average_unit_cost profit
1 Blaster 60 75 0.449102 47.333333 15
0 Blaster 42 50 0.299401 47.333333 8
4 Lightsaber 4000 5000 0.408430 1750.000000 1000
6 Lightsaber 100 4242 0.346512 1750.000000 4142
9 Stick 5 20 0.769231 5.333333 15
7 Stick 10 4 0.153846 5.333333 -6

Ex 4. Create Multiple Columns

We can use the dataframe behaviour of apply to create multiple additional columns all at once within an apply. Let’s create a function that will act on each dataframe group, create new values, and return the dataframe.

def create_multiple_columns(df_):
    df_['average_group_profit'] = (df_.sale_price - df_.cost).mean()
    df_['profit_relative_to_group'] = df_['profit']/df_['average_group_profit']
    return df_

grouped_starwars.apply(create_multiple_columns)
unit cost sale_price percent_of_sales average_unit_cost profit average_group_profit profit_relative_to_group
0 Blaster 42 50 0.299401 47.333333 8 8.333333 0.960000
1 Blaster 60 75 0.449102 47.333333 15 8.333333 1.800000
2 Blaster 40 42 0.251497 47.333333 2 8.333333 0.240000
3 Lightsaber 900 1000 0.081686 1750.000000 100 1310.500000 0.076307
4 Lightsaber 4000 5000 0.408430 1750.000000 1000 1310.500000 0.763068
5 Lightsaber 2000 2000 0.163372 1750.000000 0 1310.500000 0.000000
6 Lightsaber 100 4242 0.346512 1750.000000 4142 1310.500000 3.160626
7 Stick 10 4 0.153846 5.333333 -6 3.333333 -1.800000
8 Stick 1 2 0.076923 5.333333 1 3.333333 0.300000
9 Stick 5 20 0.769231 5.333333 15 3.333333 4.500000


When done this way, we are explicitly returning the dataframe, regardless of if it was grouped or not. Even if all of the new columns are aggregates, this will still produce a non-aggregated output:

def create_multiple_aggregated_columns(df_):
    df_['average_group_profit'] = (df_.sale_price - df_.cost).mean()
    df_['average_sale_price'] = df_.sale_price.mean()
    return df_

grouped_starwars.apply(create_multiple_aggregated_columns)
unit cost sale_price percent_of_sales average_unit_cost profit average_group_profit average_sale_price
0 Blaster 42 50 0.299401 47.333333 8 8.333333 55.666667
1 Blaster 60 75 0.449102 47.333333 15 8.333333 55.666667
2 Blaster 40 42 0.251497 47.333333 2 8.333333 55.666667
3 Lightsaber 900 1000 0.081686 1750.000000 100 1310.500000 3060.500000
4 Lightsaber 4000 5000 0.408430 1750.000000 1000 1310.500000 3060.500000
5 Lightsaber 2000 2000 0.163372 1750.000000 0 1310.500000 3060.500000
6 Lightsaber 100 4242 0.346512 1750.000000 4142 1310.500000 3060.500000
7 Stick 10 4 0.153846 5.333333 -6 3.333333 8.666667
8 Stick 1 2 0.076923 5.333333 1 3.333333 8.666667
9 Stick 5 20 0.769231 5.333333 15 3.333333 8.666667

Challenge 2

Challenge 2

Determine which of the units in each group had the greatest profit, and return a dataframe containing only the rows corresponding to these units.

There are a variety of different ways this could be approached. We have done a similar question using apply already, so let’s do this via the transform function.

condition = starwars.profit == starwars.groupby('unit').profit.transform('max')

starwars[condition]
unit cost sale_price percent_of_sales average_unit_cost profit
1 Blaster 60 75 0.449102 47.333333 15
6 Lightsaber 100 4242 0.346512 1750.000000 4142
9 Stick 5 20 0.769231 5.333333 15

9.3 Windowing Functions

Often when we are looking at our data, we may have a question about how one value compares to the values around it. This is typically useful for timeseries data, where we can organize our data by date (or year, month, etc). Once sorted on a date, we might wish to know things such as:

  • Did this year have a lower or higher average than the previous year?
  • What is the roling average over the last 5 years?

Or many other comparative questions. We can answer some of these questions using the windowing capabilities of Pandas. Windowing is the act of looking at a single ‘window’ of the data, asking our question (what is the average value for the time period in this window?), and then sliding the window to do it again. You have likely heard the term rolling average - this is what windowing allows us to calculate.

Rolling

To calculate a rolling average, we use two methods. First, we indicate that we want rolling statistics by using the rolling() method, and include in the argument the number of periods (rows) to include. After the rolling method, we indicate which aggregation we would like to use. Let’s try this out on a timeseries dataset:

dowjones = load_dataset('dowjones')
dowjones.head()
Date Price
0 1914-12-01 55.00
1 1915-01-01 56.55
2 1915-02-01 56.00
3 1915-03-01 58.30
4 1915-04-01 66.45

This is a monthly aggregate of the Dow Jones index. Let’s compare this monthly aggregate to a 6 month rolling average:

dowjones['six_month_avg'] = dowjones.rolling(6).Price.mean()
dowjones.head(10)
Date Price six_month_avg
0 1914-12-01 55.00 NaN
1 1915-01-01 56.55 NaN
2 1915-02-01 56.00 NaN
3 1915-03-01 58.30 NaN
4 1915-04-01 66.45 NaN
5 1915-05-01 65.95 59.708333
6 1915-06-01 68.40 61.941667
7 1915-07-01 71.85 64.491667
8 1915-08-01 79.25 68.366667
9 1915-09-01 85.50 72.900000
Sort your Data!

When using the rolling method this way, it assumes that the data is already sorted in the order you want! The dowjones dataset comes pre-sorted, but this isn’t always the case. Before doing any sort of windowing function, it is always a good idea to pre-sort your data! For the dowjones dataset, we would sort on the Date column:

dowjones = dowjones.sort_values(by='Date')

You’ll notice that for the first 5 months, this method returns NaN - this is because there were not enough values to include in the average, as the method assumes you always wish to have exactly 6 values in each rolling window. We can alter this however, using the min_periods argument:

dowjones['six_month_avg'] = dowjones.rolling(6, min_periods=1).Price.mean()
dowjones.head(10)
Date Price six_month_avg
0 1914-12-01 55.00 55.000000
1 1915-01-01 56.55 55.775000
2 1915-02-01 56.00 55.850000
3 1915-03-01 58.30 56.462500
4 1915-04-01 66.45 58.460000
5 1915-05-01 65.95 59.708333
6 1915-06-01 68.40 61.941667
7 1915-07-01 71.85 64.491667
8 1915-08-01 79.25 68.366667
9 1915-09-01 85.50 72.900000

Let’s see what these rolling averages look like visually:

import matplotlib.pyplot as plt

# add more averages
dowjones['five_year'] = dowjones.rolling(60, min_periods=1).Price.mean()

# plot the original data
plt.plot(dowjones.Date, dowjones.Price, label='Actual Data')

# plot the smoothed data
plt.plot(dowjones.Date, dowjones.six_month_avg, label='Six Month')
plt.plot(dowjones.Date, dowjones.five_year, label='Five Year')

# add a legend
plt.legend()

plt.show()

The six month average is starting to smooth out the jagged peaks in the original data, while the five-year average has completely smoothed out all the wiggles!

Cumulative

While the rolling capabilities will provide us a fixed window into which we can calculate aggregate statistics, we may also wish to calculate cumulative statistics. The most common cumulative statistic is the cumulative sum, which calculates the sum of the current row and all preceding rows. In Pandas, we can calculate this with the cumsum method.

dowjones['cumulative_price'] = dowjones.Price.cumsum()
dowjones.head()
Date Price six_month_avg five_year cumulative_price
0 1914-12-01 55.00 55.0000 55.0000 55.00
1 1915-01-01 56.55 55.7750 55.7750 111.55
2 1915-02-01 56.00 55.8500 55.8500 167.55
3 1915-03-01 58.30 56.4625 56.4625 225.85
4 1915-04-01 66.45 58.4600 58.4600 292.30

Shifts

Finally, knowing the value preceding or following the current value is sometimes of interest. Forecast modeling will often use the current value and past values to predict future values, for example. To retrieve lead and lag values, Pandas has the shift method. This takes as an argument the number of periods by which we wish to shift our dataset (defaulting to 1). The periods can be negative, letting us get future values as well! Let’s look at this with the dowjones dataset again:

dowjones['year_prior'] = dowjones.Price.shift(12)
dowjones['year_after'] = dowjones.Price.shift(-12)

dowjones.head()
Date Price six_month_avg five_year cumulative_price year_prior year_after
0 1914-12-01 55.00 55.0000 55.0000 55.00 NaN 97.00
1 1915-01-01 56.55 55.7750 55.7750 111.55 NaN 94.70
2 1915-02-01 56.00 55.8500 55.8500 167.55 NaN 93.55
3 1915-03-01 58.30 56.4625 56.4625 225.85 NaN 93.30
4 1915-04-01 66.45 58.4600 58.4600 292.30 NaN 89.75

To show that this is actually the year before and after, let’s filter to a year somewhere in the middle.

In order to filter to the middle year, we need to be sure of the type of data we are working with. Let’s take a quick look at the Date column, and see what the datatype is:

print(dowjones.Date.dtype)
datetime64[ns]

This is a new column type that we have not encountered in detail before, but it tells us that this column contains date information. If we were to look at a single value in the Series, we would see that it is not simply a ‘YYYY-MM-DD’ string as displayed when we view the dataframe:

single_date = dowjones.Date[0]
print(single_date)
print(type(single_date))
1914-12-01 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Because this is not a string, we cannot do a simple string comparison between two dates:

print(single_date == '1914-12-01')
False

To filter based on date columns then, we should instead create a date type object in Python. The easiest way to do that is using the built in python library datetime.

from datetime import datetime

# create a date-like object
new_date = datetime(1914, 12, 1)
print(new_date)

# compare with the single date
print(single_date == new_date)
1914-12-01 00:00:00
True

For more information on dates in Pandas, see the time series/date functionality page for Pandas, and the numpy documentation for datetime64.

from datetime import datetime

# create dates for the filter
current_year = datetime(1942, 8, 1)
previous_year = datetime(1941, 8, 1)
next_year = datetime(1943, 8, 1)

# filter the date column
dowjones[
    dowjones.Date.isin([current_year, previous_year, next_year])
][
    ['Date', 'Price', 'year_prior', 'year_after']
]
Date Price year_prior year_after
320 1941-08-01 126.55 125.35 106.20
332 1942-08-01 106.20 126.55 136.22
344 1943-08-01 136.22 106.20 146.93

If we look at these three lines, we see that the year_prior and year_after columns do in fact give us the corresponding values from the initial dataset. And if we plot this, we will see three copies of the data, each shifted by a year:

plt.plot(dowjones.Date, dowjones.year_prior, label='past')
plt.plot(dowjones.Date, dowjones.Price, label='current')
plt.plot(dowjones.Date, dowjones.year_after, label='future')

plt.legend()

plt.show()

Challenge 3

Challenge 3

Find the average, minimum, and maximum Dow Jones prices on a yearly basis. Plot all three values to see how they compare! (Recall: this dataset is currently monthly)

# first let's find the three aggregations we want
dowjones_rolling = dowjones.rolling(12, min_periods=1).agg(
    {'Price': ['min', 'max', 'mean']}
)
display(dowjones_rolling.head())

# get rid of hierarchical naming scheme
dowjones_rolling.columns = ['price_min', 'price_max', 'price_mean']

# plot these (note that this DF doesn't have the dates, so we use the old DF dates)
# here's a fancy plot that will show a coloured band!
plt.fill_between(
    dowjones.Date, 
    dowjones_rolling.price_min, 
    dowjones_rolling.price_max, 
    alpha=0.3 # alpha makes the band more transparent
) 
plt.plot(dowjones.Date, dowjones_rolling.price_mean)
plt.show()
Price
min max mean
0 55.0 55.00 55.0000
1 55.0 56.55 55.7750
2 55.0 56.55 55.8500
3 55.0 58.30 56.4625
4 55.0 66.45 58.4600

9.4 Reshaping Data

Wide or Long?

Another fairly common task in data analysis is pivoting our data. Typically this means we are creating a wide-form table that has more columns than originally. This may be near the end of a particular analysis and we want to prepare a final table for a report, or perhaps we wish to pick out specific values from a column for more detailed inspection. Another common use of a pivot table might be for timeseries data, and we wish to separate different variables into their own individual columns. In these cases, we can use the pivot and pivot_table methods within Pandas.

In the other direction, we may wish to ‘unpivot’ a dataset. This will often occur early on in an analysis, where we wish to create a long-form table where multiple columns have been combined into a single set of id and value columns. This will be done with the melt method.

Let’s explore these ideas with the gapminder dataset.

url = "https://raw.githubusercontent.com/bcgov/"\
    "ds-intro-to-python/main/data/gapfinder.csv"

gapminder = pd.read_csv(url)
gapminder.head()
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138
4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106

Long to Wide

Pivoting a Table

pivot

The first thing we may wish to do is a simple pivot in which we make the table wider than it was before. The simplest way to do this is by the pivot method. This takes as input three arguments:

  • columns: the only required argument! Column(s) we wish to use for the new columns of our dataframe. If more than one is provided, a hierarchy of columns is created.
  • index: the column(s) we wish to use for the index of our new dataframe. This is entirely optional, and will just default to the pre-existing index if not supplied.
  • values: the value(s) we wish to retain for the new dataframe. This is also optional, and will use all leftover columns if nothing is provided. If more than one column is used here, it again creates a hierarchy in the columns.

The output of the pivot method is a new dataframe that has the requested index/columns, with the corresponding value associated with each index-column pair in the dataframe.

For instance, perhaps we wish to explore the population of each country individually overtime. In this case, we can pivot our gapminder dataset so that each of the countries in the original country column become their own column, and compare the population value against the year:

gapminder_year_country = gapminder.pivot(
    index='year',
    columns='country',
    values='pop'
)
gapminder_year_country.head()
country Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain Bangladesh Belgium ... Uganda United Kingdom United States Uruguay Venezuela Vietnam West Bank and Gaza Yemen Rep. Zambia Zimbabwe
year
1952 8425333.0 1282697.0 9279525.0 4232095.0 17876956.0 8691212.0 6927772.0 120447.0 46886859.0 8730405.0 ... 5824797.0 50430000.0 157553000.0 2252965.0 5439568.0 26246839.0 1030585.0 4963829.0 2672000.0 3080907.0
1957 9240934.0 1476505.0 10270856.0 4561361.0 19610538.0 9712569.0 6965860.0 138655.0 51365468.0 8989111.0 ... 6675501.0 51430000.0 171984000.0 2424959.0 6702668.0 28998543.0 1070439.0 5498090.0 3016000.0 3646340.0
1962 10267083.0 1728137.0 11000948.0 4826015.0 21283783.0 10794968.0 7129864.0 171863.0 56839289.0 9218400.0 ... 7688797.0 53292000.0 186538000.0 2598466.0 8143375.0 33796140.0 1133134.0 6120081.0 3421000.0 4277736.0
1967 11537966.0 1984060.0 12760499.0 5247469.0 22934225.0 11872264.0 7376998.0 202182.0 62821884.0 9556500.0 ... 8900294.0 54959000.0 198712000.0 2748579.0 9709552.0 39463910.0 1142636.0 6740785.0 3900000.0 4995432.0
1972 13079460.0 2263554.0 14760787.0 5894858.0 24779799.0 13177000.0 7544201.0 230800.0 70759295.0 9709100.0 ... 10190285.0 56079000.0 209896000.0 2829526.0 11515649.0 44655014.0 1089572.0 7407075.0 4506497.0 5861135.0

5 rows × 142 columns

Note that doing this moves ‘year’ into the index of the dataframe, and gives the columns the name of ‘country’. We can move ‘year’ back into the core of the dataframe using reset_index(), and remove the column name using the name attribute of the columns:

gapminder_year_country = gapminder_year_country.reset_index()
gapminder_year_country.columns.name = None
gapminder_year_country.head()
year Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain Bangladesh ... Uganda United Kingdom United States Uruguay Venezuela Vietnam West Bank and Gaza Yemen Rep. Zambia Zimbabwe
0 1952 8425333.0 1282697.0 9279525.0 4232095.0 17876956.0 8691212.0 6927772.0 120447.0 46886859.0 ... 5824797.0 50430000.0 157553000.0 2252965.0 5439568.0 26246839.0 1030585.0 4963829.0 2672000.0 3080907.0
1 1957 9240934.0 1476505.0 10270856.0 4561361.0 19610538.0 9712569.0 6965860.0 138655.0 51365468.0 ... 6675501.0 51430000.0 171984000.0 2424959.0 6702668.0 28998543.0 1070439.0 5498090.0 3016000.0 3646340.0
2 1962 10267083.0 1728137.0 11000948.0 4826015.0 21283783.0 10794968.0 7129864.0 171863.0 56839289.0 ... 7688797.0 53292000.0 186538000.0 2598466.0 8143375.0 33796140.0 1133134.0 6120081.0 3421000.0 4277736.0
3 1967 11537966.0 1984060.0 12760499.0 5247469.0 22934225.0 11872264.0 7376998.0 202182.0 62821884.0 ... 8900294.0 54959000.0 198712000.0 2748579.0 9709552.0 39463910.0 1142636.0 6740785.0 3900000.0 4995432.0
4 1972 13079460.0 2263554.0 14760787.0 5894858.0 24779799.0 13177000.0 7544201.0 230800.0 70759295.0 ... 10190285.0 56079000.0 209896000.0 2829526.0 11515649.0 44655014.0 1089572.0 7407075.0 4506497.0 5861135.0

5 rows × 143 columns

We could make a slightly more complex dataset by including the continent as well in our new column scheme:

gapminder_year_country_continent = gapminder.pivot(
    index='year',
    columns=['country', 'continent'],
    values='pop'
)
gapminder_year_country_continent.head()
country Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain Bangladesh Belgium ... Uganda United Kingdom United States Uruguay Venezuela Vietnam West Bank and Gaza Yemen Rep. Zambia Zimbabwe
continent Asia Europe Africa Africa Americas Oceania Europe Asia Asia Europe ... Africa Europe Americas Americas Americas Asia Asia Asia Africa Africa
year
1952 8425333.0 1282697.0 9279525.0 4232095.0 17876956.0 8691212.0 6927772.0 120447.0 46886859.0 8730405.0 ... 5824797.0 50430000.0 157553000.0 2252965.0 5439568.0 26246839.0 1030585.0 4963829.0 2672000.0 3080907.0
1957 9240934.0 1476505.0 10270856.0 4561361.0 19610538.0 9712569.0 6965860.0 138655.0 51365468.0 8989111.0 ... 6675501.0 51430000.0 171984000.0 2424959.0 6702668.0 28998543.0 1070439.0 5498090.0 3016000.0 3646340.0
1962 10267083.0 1728137.0 11000948.0 4826015.0 21283783.0 10794968.0 7129864.0 171863.0 56839289.0 9218400.0 ... 7688797.0 53292000.0 186538000.0 2598466.0 8143375.0 33796140.0 1133134.0 6120081.0 3421000.0 4277736.0
1967 11537966.0 1984060.0 12760499.0 5247469.0 22934225.0 11872264.0 7376998.0 202182.0 62821884.0 9556500.0 ... 8900294.0 54959000.0 198712000.0 2748579.0 9709552.0 39463910.0 1142636.0 6740785.0 3900000.0 4995432.0
1972 13079460.0 2263554.0 14760787.0 5894858.0 24779799.0 13177000.0 7544201.0 230800.0 70759295.0 9709100.0 ... 10190285.0 56079000.0 209896000.0 2829526.0 11515649.0 44655014.0 1089572.0 7407075.0 4506497.0 5861135.0

5 rows × 142 columns

Or by insisting that we retain information about more than just the population:

gapminder_year_country_two_vals = gapminder.pivot(
    index='year',
    columns='country',
    values=['pop', 'lifeExp']
)
gapminder_year_country_two_vals.head()
pop ... lifeExp
country Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain Bangladesh Belgium ... Uganda United Kingdom United States Uruguay Venezuela Vietnam West Bank and Gaza Yemen Rep. Zambia Zimbabwe
year
1952 8425333.0 1282697.0 9279525.0 4232095.0 17876956.0 8691212.0 6927772.0 120447.0 46886859.0 8730405.0 ... 39.978 69.18 68.44 66.071 55.088 40.412 43.160 32.548 42.038 48.451
1957 9240934.0 1476505.0 10270856.0 4561361.0 19610538.0 9712569.0 6965860.0 138655.0 51365468.0 8989111.0 ... 42.571 70.42 69.49 67.044 57.907 42.887 45.671 33.970 44.077 50.469
1962 10267083.0 1728137.0 11000948.0 4826015.0 21283783.0 10794968.0 7129864.0 171863.0 56839289.0 9218400.0 ... 45.344 70.76 70.21 68.253 60.770 45.363 48.127 35.180 46.023 52.358
1967 11537966.0 1984060.0 12760499.0 5247469.0 22934225.0 11872264.0 7376998.0 202182.0 62821884.0 9556500.0 ... 48.051 71.36 70.76 68.468 63.479 47.838 51.631 36.984 47.768 53.995
1972 13079460.0 2263554.0 14760787.0 5894858.0 24779799.0 13177000.0 7544201.0 230800.0 70759295.0 9709100.0 ... 51.016 72.01 71.34 68.673 65.712 50.254 56.532 39.848 50.107 55.635

5 rows × 284 columns

What if we wish to know about the average population of each continent over time? If we try the same approach as above, we will immediately run into an issue:

gapminder_year_continent = gapminder.pivot(
    index='year',
    columns='continent',
    values='pop'
)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[46], line 1
----> 1 gapminder_year_continent = gapminder.pivot(
      2     index='year',
      3     columns='continent',
      4     values='pop'
      5 )

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/frame.py:7885, in DataFrame.pivot(self, index, columns, values)
   7880 @Substitution("")
   7881 @Appender(_shared_docs["pivot"])
   7882 def pivot(self, index=None, columns=None, values=None) -> DataFrame:
   7883     from pandas.core.reshape.pivot import pivot
-> 7885     return pivot(self, index=index, columns=columns, values=values)

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/pivot.py:520, in pivot(data, index, columns, values)
    518     else:
    519         indexed = data._constructor_sliced(data[values]._values, index=multiindex)
--> 520 return indexed.unstack(columns_listlike)

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/series.py:4157, in Series.unstack(self, level, fill_value)
   4114 """
   4115 Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
   4116 
   (...)
   4153 b    2    4
   4154 """
   4155 from pandas.core.reshape.reshape import unstack
-> 4157 return unstack(self, level, fill_value)

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/reshape.py:491, in unstack(obj, level, fill_value)
    489 if is_1d_only_ea_dtype(obj.dtype):
    490     return _unstack_extension_series(obj, level, fill_value)
--> 491 unstacker = _Unstacker(
    492     obj.index, level=level, constructor=obj._constructor_expanddim
    493 )
    494 return unstacker.get_result(
    495     obj._values, value_columns=None, fill_value=fill_value
    496 )

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/reshape.py:140, in _Unstacker.__init__(self, index, level, constructor)
    133 if num_cells > np.iinfo(np.int32).max:
    134     warnings.warn(
    135         f"The following operation may generate {num_cells} cells "
    136         f"in the resulting pandas object.",
    137         PerformanceWarning,
    138     )
--> 140 self._make_selectors()

File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/reshape.py:192, in _Unstacker._make_selectors(self)
    189 mask.put(selector, True)
    191 if mask.sum() < len(self.index):
--> 192     raise ValueError("Index contains duplicate entries, cannot reshape")
    194 self.group_index = comp_index
    195 self.mask = mask

ValueError: Index contains duplicate entries, cannot reshape

Here we find that there is more than one value associated with each year-continent pairing, and so the pivot method does not know how to assign a value to the dataframe. This is because pivot is for unique values only. While it is an excellent simple-use tool, we can go one step further to deal with non-unique values.

pivot_table

To address the year-continent question above, we could first group our data by continent, calculate the average population every year, and then pivot the resultant dataframe. This is a completely valid method! However, pandas also offers a built in method that will do aggregations and pivots all at once. This is similar to the functionality of Excel pivot tables, and includes some additional functionality we would not find in groupby alone. Some of the important arguments that we can use in the pivot_table method include:

  • columns: the only required argument! Column(s) (and values within) we wish to use for the resultant dataframes columns.
  • values: the column(s) we wish to aggregate. If no columns are provided, all leftover (non column/index columns) columns will be used that make ‘sense’. For example, the average value of a list of strings does not make sense, but the ‘max’ value of a string list could be calculated.
  • index: the column(s) we wish to use for the resultant dataframes index
  • aggfunc: the method of aggregation we wish to use. Default is to calculate the average.
  • fill_value: the value to replace missing values with (after aggregation).

Let’s go through some examples with increasing complexity. The simplest thing we can do is pivot a single column, to which we will get an output for the averages of all numerical columns.

gapminder.pivot_table(
    columns='continent',
    values = ['gdpPercap', 'lifeExp', 'pop', 'year']

)
continent Africa Americas Asia Europe Oceania
gdpPercap 2.193755e+03 7.136110e+03 7.902150e+03 1.446948e+04 1.862161e+04
lifeExp 4.886533e+01 6.465874e+01 6.006490e+01 7.190369e+01 7.432621e+01
pop 9.916003e+06 2.450479e+07 7.703872e+07 1.716976e+07 8.874672e+06
year 1.979500e+03 1.979500e+03 1.979500e+03 1.979500e+03 1.979500e+03

Next we may wish to add an index, and focus on a single value for our output. This matches the question we asked above and could not answer with the pivot method:

gapminder_year_continent = gapminder.pivot_table(
    index='year',
    columns='continent',
    values='pop'
)
gapminder_year_continent.head()
continent Africa Americas Asia Europe Oceania
year
1952 4.570010e+06 13806097.84 4.228356e+07 1.393736e+07 5343003.0
1957 5.093033e+06 15478156.64 4.735699e+07 1.459635e+07 5970988.0
1962 5.702247e+06 17330810.16 5.140476e+07 1.534517e+07 6641759.0
1967 6.447875e+06 19229864.92 5.774736e+07 1.603930e+07 7300207.0
1972 7.305376e+06 21175368.40 6.518098e+07 1.668784e+07 8053050.0

If we tried to do the same calculation, but use a string column instead of a numerical column for the values we will get a warning about trying to use invalid columns:

gapminder_year_continent_country = gapminder.pivot_table(
    index='year',
    columns='continent',
    values='country'
)
gapminder_year_continent_country.head()
/tmp/ipykernel_2563/2473994706.py:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.mean is deprecated. In a future version, a TypeError will be raised. Before calling .mean, select only columns which should be valid for the function.
  gapminder_year_continent_country = gapminder.pivot_table(
continent
year
1952
1957
1962
1967
1972

However, we can switch our aggregation method to something that is valid for strings (using either string arguments for common aggregations such as ‘max’, or more complicated numpy or user-defined functions as well).

gapminder_year_continent_country = gapminder.pivot_table(
    index='year',
    columns='continent',
    values='country',
    aggfunc='max'
)
gapminder_year_continent_country.head()
continent Africa Americas Asia Europe Oceania
year
1952 Zimbabwe Venezuela Yemen Rep. United Kingdom New Zealand
1957 Zimbabwe Venezuela Yemen Rep. United Kingdom New Zealand
1962 Zimbabwe Venezuela Yemen Rep. United Kingdom New Zealand
1967 Zimbabwe Venezuela Yemen Rep. United Kingdom New Zealand
1972 Zimbabwe Venezuela Yemen Rep. United Kingdom New Zealand

Finally, similar to our groupby aggregations, we can also pass a dictionary to aggfunc. The dictionary contains the columns we wish to use for values as keys, and the type of aggregation(s) we wish to do as the values.

import numpy as np

gapminder_multiple_pivots = gapminder.pivot_table(
    index='year',
    columns='continent',
    aggfunc={
        'pop': [np.mean, np.max],
        'lifeExp': 'mean',
        'country': [min, max]
    }
)
gapminder_multiple_pivots.head()
country ... pop
max min ... max mean
continent Africa Americas Asia Europe Oceania Africa Americas Asia Europe Oceania ... Africa Americas Asia Europe Oceania Africa Americas Asia Europe Oceania
year
1952 Zimbabwe Venezuela Yemen Rep. United Kingdom New Zealand Algeria Argentina Afghanistan Albania Australia ... 33119096.0 157553000.0 5.562635e+08 69145952.0 8691212.0 4.570010e+06 13806097.84 4.228356e+07 1.393736e+07 5343003.0
1957 Zimbabwe Venezuela Yemen Rep. United Kingdom New Zealand Algeria Argentina Afghanistan Albania Australia ... 37173340.0 171984000.0 6.374080e+08 71019069.0 9712569.0 5.093033e+06 15478156.64 4.735699e+07 1.459635e+07 5970988.0
1962 Zimbabwe Venezuela Yemen Rep. United Kingdom New Zealand Algeria Argentina Afghanistan Albania Australia ... 41871351.0 186538000.0 6.657700e+08 73739117.0 10794968.0 5.702247e+06 17330810.16 5.140476e+07 1.534517e+07 6641759.0
1967 Zimbabwe Venezuela Yemen Rep. United Kingdom New Zealand Algeria Argentina Afghanistan Albania Australia ... 47287752.0 198712000.0 7.545500e+08 76368453.0 11872264.0 6.447875e+06 19229864.92 5.774736e+07 1.603930e+07 7300207.0
1972 Zimbabwe Venezuela Yemen Rep. United Kingdom New Zealand Algeria Argentina Afghanistan Albania Australia ... 53740085.0 209896000.0 8.620300e+08 78717088.0 13177000.0 7.305376e+06 21175368.40 6.518098e+07 1.668784e+07 8053050.0

5 rows × 25 columns

Wide to Long

I’m meeelllting!

melt

The opposite of pivoting is melting (…obviously?). This is often used to get data back into long-form. Long-form data will typically have all different possible categories for a single measure in a single column (having a country column instead of Algeria, Albania, Canada… Zimbabwe columns). Long-form data is typically the preferred data style when doing your actual analysis. For example, many of the Seaborns methods we considered for creating stunning visuals assume that the data is in long-form, and is optimized to split out categories internally instead of pulling from multiple different columns.

However, as our data does not always come in the way we want it, we might have to unpivot (melt) our data to get it into long-form first. The melt method is how we attack this in Pandas, and it uses the following arguments:

  • id_vars: This is a required argument! Column(s) that will be used as identifiers. These are columns that we do not wish to unpivot, but leave as is.
  • value_vars: Column(s) to unpivot. If left out, the default is to use all non-id columns.

Let’s look at one of the pivoted tables we made earlier, and try to get it back into long form:

gapminder_year_country.head()
year Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain Bangladesh ... Uganda United Kingdom United States Uruguay Venezuela Vietnam West Bank and Gaza Yemen Rep. Zambia Zimbabwe
0 1952 8425333.0 1282697.0 9279525.0 4232095.0 17876956.0 8691212.0 6927772.0 120447.0 46886859.0 ... 5824797.0 50430000.0 157553000.0 2252965.0 5439568.0 26246839.0 1030585.0 4963829.0 2672000.0 3080907.0
1 1957 9240934.0 1476505.0 10270856.0 4561361.0 19610538.0 9712569.0 6965860.0 138655.0 51365468.0 ... 6675501.0 51430000.0 171984000.0 2424959.0 6702668.0 28998543.0 1070439.0 5498090.0 3016000.0 3646340.0
2 1962 10267083.0 1728137.0 11000948.0 4826015.0 21283783.0 10794968.0 7129864.0 171863.0 56839289.0 ... 7688797.0 53292000.0 186538000.0 2598466.0 8143375.0 33796140.0 1133134.0 6120081.0 3421000.0 4277736.0
3 1967 11537966.0 1984060.0 12760499.0 5247469.0 22934225.0 11872264.0 7376998.0 202182.0 62821884.0 ... 8900294.0 54959000.0 198712000.0 2748579.0 9709552.0 39463910.0 1142636.0 6740785.0 3900000.0 4995432.0
4 1972 13079460.0 2263554.0 14760787.0 5894858.0 24779799.0 13177000.0 7544201.0 230800.0 70759295.0 ... 10190285.0 56079000.0 209896000.0 2829526.0 11515649.0 44655014.0 1089572.0 7407075.0 4506497.0 5861135.0

5 rows × 143 columns

gapminder_year_country.melt(id_vars='year').head()
year variable value
0 1952 Afghanistan 8425333.0
1 1957 Afghanistan 9240934.0
2 1962 Afghanistan 10267083.0
3 1967 Afghanistan 11537966.0
4 1972 Afghanistan 13079460.0

If we wish to only keep a subset of the countries, we can do that too. We can also change the name of the resultant variable and value columns:

gapminder_year_country.melt(
    id_vars='year',
    value_vars = ['Albania', 'Canada', 'Zimbabwe'],
    var_name = 'country',
    value_name = 'pop'
).head()
year country pop
0 1952 Albania 1282697.0
1 1957 Albania 1476505.0
2 1962 Albania 1728137.0
3 1967 Albania 1984060.0
4 1972 Albania 2263554.0

When melting our data, we want to be careful, and not ‘overmelt’! Think back to the original gapminder dataset:

gapminder.head()
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138
4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106

Would the following melt make sense?

gapminder.melt(
    id_vars=['country', 'year', 'continent']
)
country year continent variable value
0 Afghanistan 1952 Asia pop 8.425333e+06
1 Afghanistan 1957 Asia pop 9.240934e+06
2 Afghanistan 1962 Asia pop 1.026708e+07
3 Afghanistan 1967 Asia pop 1.153797e+07
4 Afghanistan 1972 Asia pop 1.307946e+07
... ... ... ... ... ...
5107 Zimbabwe 1987 Africa gdpPercap 7.061573e+02
5108 Zimbabwe 1992 Africa gdpPercap 6.934208e+02
5109 Zimbabwe 1997 Africa gdpPercap 7.924500e+02
5110 Zimbabwe 2002 Africa gdpPercap 6.720386e+02
5111 Zimbabwe 2007 Africa gdpPercap 4.697093e+02

5112 rows × 5 columns

Probably not. Having the population, GDP, and life expectancy all in the same column might lead to confusing results. As a general rule of thumb, including multiple unique measures that have different units (population is measured in number of people, life expectancy in years, and GDP per capita in dollars per person) in the same column is typically avoided. Another way to think of it is this: if I take the average of the entire column, will that actually make sense? Try not to melt further than necessary! Striking a balance between machine-preferred (entirely long-form) and human-preferred (wide-form) is an important and useful skill when creating scripts to analyse our datasets!

These examples only touch the surface of methods to reshape our data. To explore even further, Pandas provides a user guide to reshaping. Check this out for even more reshaping functionality!

Challenge 4

Challenge 4

Create a pivot table that counts the number of countries in each continent that has a population greater than 1,000,000 in any given year.

First, let’s demonstrate this using the methods we’ve learned about to reshape our data. We can create this aggregated pivot table using the pivot_table method, and supplying ‘count’ to the aggfunc method.

gapminder[gapminder['pop']>1_000_000].pivot_table(
    index='year',
    columns='continent',
    values='country',
    aggfunc='count'
)
continent Africa Americas Asia Europe Oceania
year
1952 37 22 28 28 2
1957 37 24 28 28 2
1962 39 24 29 28 2
1967 39 24 30 28 2
1972 40 24 30 28 2
1977 40 25 32 28 2
1982 41 25 32 28 2
1987 43 25 32 28 2
1992 45 25 32 28 2
1997 47 25 32 28 2
2002 47 25 32 28 2
2007 47 25 32 28 2

This type of table is commonly referred to as a crosstable: it gives us the frequency of occurences of items under different groupings. Because it is such a widely used table, it was given its own special method in Pandas: crosstab. Basic functionality is to supply two Series that we wish to cross-tabulate to the method, and let it do all the work. Note that this is a Pandas specific method, not a dataframe specific method!

gapminder_filtered = gapminder[gapminder['pop']>1_000_000]
pd.crosstab(
    index=gapminder_filtered['year'],
    columns=gapminder_filtered['continent']
)
continent Africa Americas Asia Europe Oceania
year
1952 37 22 28 28 2
1957 37 24 28 28 2
1962 39 24 29 28 2
1967 39 24 30 28 2
1972 40 24 30 28 2
1977 40 25 32 28 2
1982 41 25 32 28 2
1987 43 25 32 28 2
1992 45 25 32 28 2
1997 47 25 32 28 2
2002 47 25 32 28 2
2007 47 25 32 28 2