In this section we will go over some Pandas functionality that is slightly more involved than the prior sessions. We hope to cover, or at least provide reference for operations such as:
Conditional Column Assignment
Advanced grouping techniques
Rolling and cumulative functions
Reshaping our data
9.1 Conditional Column Assignment
In our previous sections on Pandas, we introduced the idea of column creation based on the data we currently have. We limited ourselves to fairly standard examples that are likely to crop up in typical data analysis: arithmetic operations, applying column specific functions, and the use of the lambda function inside a call to either apply or assign. Here we will expand our column possibilities to include conditional assignments, where the value of the column is dependent on the value/state of some other column.
Let’s look at the penguins dataset we considered previously. This dataset has a good mix of categorical and numerical datatypes for us to create new columns from.
import pandas as pdfrom seaborn import load_datasetpenguins = load_dataset('penguins')penguins.head()
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
0
Adelie
Torgersen
39.1
18.7
181.0
3750.0
Male
1
Adelie
Torgersen
39.5
17.4
186.0
3800.0
Female
2
Adelie
Torgersen
40.3
18.0
195.0
3250.0
Female
3
Adelie
Torgersen
NaN
NaN
NaN
NaN
NaN
4
Adelie
Torgersen
36.7
19.3
193.0
3450.0
Female
Binary If/Else
The first type of conditional assignment we will consider is the simplest: if original column is value A, then assign value x, otherwise assign value y. The most straightforward way to do this is via the apply method again, combined with a lambda function acting on the column of interest. For this example, we are going to make a shorthand version of the sex column - if the sex column is Female, then assign F, otherwise assign M.
While apply is robust and can be used in many different applications and settings, it is often the slowest method available in Pandas. If you are doing very specific operations, Pandas often provides faster, vectorized methods to do the same thing. For instance, this if/else example can also be calculated using the Pandas where and mask methods. These methods complement each other, and you may find one more intuitive than the other based on the specific circumstances of the data you are working with.
Both methods assume you already have some column with a default value in place. For example, we can assign the value ‘M’ to an entire column. where and mask then work by updating the values of this column where some other condition is met. The syntax for where is:
If condition is met, keep the default value, otherwise update to the other value:
Again, while these all result in the same column, you may find one method more intuitive, efficient, or faster for the data you are working with. Explore and use what works best for you!
Multiple Conditions
Sometimes, a simple if/else statement doesn’t cut it. Perhaps there are multiple options, and creating an if…elseif…elseif…else block will not only only be less intuitive but can easily be used incorrectly, leading to undesired outputs. In this case, the mapping functionality of Pandas provides a vectorized efficient method to create multiple changes to a column. The map method, which we apply to a single series, is optimized for mapping elements from one domain to another. The best example of this is using a dictionary to provide a mapping of your keys (initial column values) to your new value set.
Let’s see this in action with the penguins. Perhaps we do not wish to bias our analysis by having knowledge of which island each penguin came from. In this case, we wish to replace the actual names of the three islands with generic island IDs. First, we supply our mapping dictionary that will transform each island to its generic ID. Then, we supply this dictionary to our map method:
The above example used the map method, which is inherently a method to be used for Series only. As such, we cannot use it for more complicated conditions that may involve multiple columns. The final example we will consider is a more complicated set of conditions: we wish to create new categories based on the penguins body mass: underweight, overweight, or average weight. However, we also wish to consider the penguin species when creating these categories: what is underweight for the Gentoo species may be overweight for the Adelie! Thus, in theory we have 9 new categories to account for that depend on two columns.
The best (read: efficient) way to approach this multi-column multi-conditional scenario is actually to go outside of the Pandas toolkit and use a tool that has been implemented by numpy. Numpy is a computational tool-kit for Python that utilizes the efficiency of powerful computational languages such as Fortran and C to provide fast and versatile computational tools for the vast majority of numerical and scientific anaylsis in Python. We have seen it already, having used numpy’s statistical functions such as mean and median within our aggregations. Here, we wish to make use of the numpy.select() method.
In numpy, the select method is fairly simple. We supply a list of conditions we wish to consider, and a list of choices for what we wish to happen for each of the conditions. We may also supply a default value, for if none of the conditions are met. select will go through the list one by one, and will apply the choice for the first condition that is met. Let’s see what this looks like for our penguins:
import numpy as np# list all possible combos of species and specific body masses# note that we don't include the overweight combos, as we can catch these# with the default option in select!conditions = [ (penguins.species=='Adelie') & (penguins.body_mass_g<3300), (penguins.species=='Adelie') & (penguins.body_mass_g<3700), (penguins.species=='Chinstrap') & (penguins.body_mass_g<3333), (penguins.species=='Chinstrap') & (penguins.body_mass_g<3733), (penguins.species=='Gentoo') & (penguins.body_mass_g<4676), (penguins.species=='Gentoo') & (penguins.body_mass_g<5076)]# for each listed condition, tell select what value to usechoices = ['Underweight','Average weight', 'Underweight', 'Average weight','Underweight','Average weight']# use select to create our new columnpenguins['general_weight'] = np.select(conditions, choices, default='Overweight')# display resultspenguins[['species', 'body_mass_g', 'general_weight']].sample(10)
species
body_mass_g
general_weight
299
Gentoo
5950.0
Overweight
243
Gentoo
5050.0
Average weight
102
Adelie
3075.0
Underweight
100
Adelie
3725.0
Overweight
112
Adelie
3200.0
Underweight
339
Gentoo
NaN
Overweight
74
Adelie
3700.0
Overweight
302
Gentoo
4725.0
Average weight
192
Chinstrap
3950.0
Overweight
304
Gentoo
4750.0
Average weight
Order Matters!
When using the select method, the order of the conditions matters! Numpy always goes through the list in order, looking for the first TRUE condition. This ordering will matter, for example, if creating a list of bins for some numerical category. Conditions of the form:
Will never evaluate the choices for the conditions 2-5, as the first condition will be true as well. Instead we should flip this list and write our conditions starting from 1,000 and ending at 5,000. Keep this in mind when creating your conditions!
Challenge 1
Challenge 1
Create a column for 5-10 flipper length bins and plot the resulting frequencies (counts for each bin).
Solution to Challenge 1
Here we’re going to pull together some Seaborn code and our new-found conditional knowledge to create the bins.
# make sure seaborn is importedimport seaborn as snsimport matplotlib.pyplot as plt# check roughly where we could place binsdisplay(penguins.agg({'flipper_length_mm': ['mean', 'std', 'min', 'max']}))# create some conditions for binsconditions = [ penguins.flipper_length_mm <180, penguins.flipper_length_mm <190, penguins.flipper_length_mm <200, penguins.flipper_length_mm <210, penguins.flipper_length_mm <220, penguins.flipper_length_mm <230]# assign names to each conditionchoices = ['0 - 180','180 - 190','190 - 200','200 - 210','210 - 220','220 - 230']# add the new columnpenguins['flipper_length_bins'] = np.select(conditions, choices, default='230 +')# plot this data using the 'count' optionsns.catplot( penguins, x='flipper_length_bins', kind='count', order = choices, color ='blue')plt.xticks(rotation=45)plt.show()
flipper_length_mm
mean
200.915205
std
14.061714
min
172.000000
max
231.000000
9.2 Advanced Grouping
groupby cuteness
In our previous section on Pandas, we explored the basic tools we need to group our data by various columns, apply a host of statistical aggregation operations, and look at these outcomes. While incredibly powerful as is, these tools lack some functionality that you might wish to have. Some examples include:
Combining row level values and group level aggregations in a single command.
Filtering and subsetting a dataframe based on group values.
Combining aggregations across different columns.
In order to handle these operations, we need to introduce two new methods: transform and apply.
Apply… again?
You might question whether apply is actually a new function, seeing as we have used it previously when creating new columns in a dataframe. In fact, the effect of apply within pandas is subtly different depending on how and where we apply it. The effect of apply is different when used on:
A dataframe as a whole - dataframe.apply(...)
A single series - series.apply(...)
A grouped dataframe object - dataframe.groupby('column').apply(...)
Make sure you keep track of what kind of apply you are using when doing your dataframe manipulations, as identical functions in different scenarios can lead to unusually different results! In this section, we are focusing on the grouped dataframe objects.
Both transform and apply can be used on a grouped dataframe object. This is the object that is created when we use the .groupby() method. Let’s create a small dataset and see what this object looks like:
Here we have a column that looks categorical (unit), and a couple that look numerical (cost and sale_price). We are going to focus on grouping our data according to the categorical column, and see what we can do with the numerical columns.
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb461b2f310>
Note that this object is no longer a dataframe, but some sort of DataFrameGroupBy object. By itself, it is not very useful, but we can use it to get all sorts of interesting information. Recall briefly how we used this earlier. Maybe we want the average buy and sell prices of each type of unit:
grouped_starwars.mean().reset_index()
unit
cost
sale_price
0
Blaster
47.333333
55.666667
1
Lightsaber
1750.000000
3060.500000
2
Stick
5.333333
8.666667
What if we wanted to know how much each particular unit contributed to it’s groups overall sale? Or perhaps how much each group profited, on average? This is where transform and apply come in. Let’s look at what each of these do.
Transform
Used to apply a function to a dataframe that produces a new dataframe with the same shape as the original.
While an entire dataframe can be passed to transform, it only ever sees (operates on) a single column/series at a time.
The function must either return a scalar value or a sequence that is the same length as the original dataframe.
Useful for getting groups aggregations back into the original dataframe
Useful for filtering based on group values
Ex. 1: Percent of Group
Let’s see each this in action. We will use transform to determine how much each unit contributed to the total sales of its group.
Notice here that we used the lambda function again, which allows us to define a function inline acting on a given argument (in this case, the argument is named series_). This let us allow the value of the series to be the sum of the series. Because we are acting on a grouped object, Pandas knows that the sum we wish to employ is that of the group, not the overall sum! This new object will be in the same order (have the same indexing) as our initial dataframe, and so we can add it back to the dataframe as an extra column if we wish. This works explicitly because the transform method is required to return a series of the same length as the original object, giving us exactly what we want!
Transform can be used exclusively with an aggregation function as well, but it will still return a value for every row of the initial dataset. This allows us to add grouped level values as columns to the data:
Instead of passing a single column to the transform method, we could also pass the entire dataframe (or some subset of the columns in the dataframe), and calculate values for each series:
This has created two columns that have both had the same function applied. Note that if we wanted to merge these in with the original dataframe, we should be careful to rename the columns so that they do not overlap!
Ex 4: Filtering Original Dataframe
Another use case for transform is in filtering a dataframe. Maybe we want to zero in on those items where the sale price of the object was less than the average unit cost. Let’s filter our dataframe to only those rows:
Used to apply a function (aggregated or otherwise) across multiple columns
Implicitly passes all the columns of the dataframe as a dataframe to the function, allowing for column interactions
The function can return a scalar or a sequence of any length.
Ex 1. Aggregate over Multiple Columns
Let’s try something with apply. What if we want to know the average overall profit of each group. We could produce a profit column, group up on the unit, and then calculate the mean.
unit
Blaster 8.333333
Lightsaber 1310.500000
Stick 3.333333
dtype: float64
Again, we made use of that lambda function, this time applying it to a df_ argument. You might notice that here I used a df_ argument instead of series_: this is merely notation, and we could have used anything (series_, df_, x, this_is_my_argument would all work the same). However, we have used df_ and series_ so that we can remind ourselves exactly what type of object we are acting on.
To see how transform differs from apply, let’s try to do that exact same operation:
---------------------------------------------------------------------------AttributeError Traceback (most recent call last)
/tmp/ipykernel_2563/2867570833.py in ?()----> 1grouped_starwars.transform(lambda _df:(_df.sale_price - _df.cost).mean())/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(self, func, engine, engine_kwargs, *args, **kwargs) 1181@Substitution(klass="DataFrame") 1182@Appender(_transform_template) 1183def transform(self, func,*args, engine=None, engine_kwargs=None,**kwargs):-> 1184 return self._transform(
1185 func,*args, engine=engine, engine_kwargs=engine_kwargs,**kwargs
1186 )
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in ?(self, func, engine, engine_kwargs, *args, **kwargs) 1638# optimized transforms 1639 func = com.get_cython_func(func)or func
1640 1641ifnot isinstance(func, str):-> 1642return self._transform_general(func,*args,**kwargs) 1643 1644elif func notin base.transform_kernel_allowlist: 1645 msg =f"'{func}' is not a valid function name for transform(name)"/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(self, func, *args, **kwargs) 1157except TypeError: 1158return self._transform_item_by_item(obj, fast_path) 1159except ValueError as err: 1160 msg ="transform must return a scalar value for each group"-> 1161raise ValueError(msg)from err
1162if group.size >0: 1163 res = _wrap_transform_general_frame(self.obj, group, res) 1164 applied.append(res)/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(self, fast_path, slow_path, group) 1206def _choose_path(self, fast_path: Callable, slow_path: Callable, group: DataFrame): 1207 path = slow_path
-> 1208res = slow_path(group) 1209 1210# if we make it here, test if we can use the fast path 1211try:/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(group)-> 1201 slow_path = lambda group: group.apply(
1202lambda x: func(x,*args,**kwargs), axis=self.axis
1203 )
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/frame.py in ?(self, func, axis, raw, result_type, args, **kwargs) 8844 result_type=result_type, 8845 args=args, 8846 kwargs=kwargs, 8847 )
-> 8848return op.apply().__finalize__(self, method="apply")/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/apply.py in ?(self) 729# raw 730elif self.raw: 731return self.apply_raw() 732--> 733return self.apply_standard()/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/apply.py in ?(self) 856def apply_standard(self):--> 857results, res_index = self.apply_series_generator() 858 859# wrap results 860return self.wrap_results(results, res_index)/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/apply.py in ?(self) 873 results[i]= self.f(v) 874if isinstance(results[i], ABCSeries): 875# If we have a view on v, we need to make a copy because 876# series_generator will swap out the underlying data--> 877results[i]= results[i].copy(deep=False) 878 879return results, res_index
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/groupby/generic.py in ?(x)-> 1202lambda x: func(x,*args,**kwargs), axis=self.axis
/tmp/ipykernel_2563/2867570833.py in ?(_df)----> 1grouped_starwars.transform(lambda _df:(_df.sale_price - _df.cost).mean())/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/generic.py in ?(self, name) 5571and name notin self._accessors
5572and self._info_axis._can_hold_identifiers_and_holds_name(name) 5573 ):
5574return self[name]-> 5575return object.__getattribute__(self, name)AttributeError: 'Series' object has no attribute 'sale_price'
This fails because the object being acted on inside transform is itself just a series object, not the entire dataframe! As such, it has no attribute for sale_price or buy_price like our original dataframe does. Instead, it acts on the sale_price series, then the buy_price series, and returns the results.
Ex 2. Mix Row/Aggregate Levels
While the first apply example returns a rolled up aggregated dataframe, we can also use apply to return the individual rows of the dataframe by mixing aggregation functions with row level functions.
In this way we have to be careful: transform will always return a dataframe that is the same size as the original, while apply will return something that varies with the type of function we have utilized.
Ex 3. Partial Aggregates
Apply lets us play some cool tricks as well. Suppose we only wanted to know about the two most expensive sales in each category. How could we filter to show this? We can use the nlargest (or smallest) method in conjunction with apply. nlargest does exactly what we might expect, returning the rows with the n largest values according to the provided column(s):
starwars.nlargest(2, 'sale_price')
unit
cost
sale_price
percent_of_sales
average_unit_cost
profit
4
Lightsaber
4000
5000
0.408430
1750.0
1000
6
Lightsaber
100
4242
0.346512
1750.0
4142
But if we mix this with apply and our grouping dataframe, we can get the largest for each group!
Note that when apply produces some different sized aggregate than the original dataframe, it tacks on an extra index indicating what the grouper was. We do not always care about this, and can eliminate it in the initial creation of our grouped dataframe via the group_keys argument:
We can use the dataframe behaviour of apply to create multiple additional columns all at once within an apply. Let’s create a function that will act on each dataframe group, create new values, and return the dataframe.
When done this way, we are explicitly returning the dataframe, regardless of if it was grouped or not. Even if all of the new columns are aggregates, this will still produce a non-aggregated output:
Determine which of the units in each group had the greatest profit, and return a dataframe containing only the rows corresponding to these units.
Solution to Challenge 2
There are a variety of different ways this could be approached. We have done a similar question using apply already, so let’s do this via the transform function.
Often when we are looking at our data, we may have a question about how one value compares to the values around it. This is typically useful for timeseries data, where we can organize our data by date (or year, month, etc). Once sorted on a date, we might wish to know things such as:
Did this year have a lower or higher average than the previous year?
What is the roling average over the last 5 years?
Or many other comparative questions. We can answer some of these questions using the windowing capabilities of Pandas. Windowing is the act of looking at a single ‘window’ of the data, asking our question (what is the average value for the time period in this window?), and then sliding the window to do it again. You have likely heard the term rolling average - this is what windowing allows us to calculate.
Rolling
To calculate a rolling average, we use two methods. First, we indicate that we want rolling statistics by using the rolling() method, and include in the argument the number of periods (rows) to include. After the rolling method, we indicate which aggregation we would like to use. Let’s try this out on a timeseries dataset:
When using the rolling method this way, it assumes that the data is already sorted in the order you want! The dowjones dataset comes pre-sorted, but this isn’t always the case. Before doing any sort of windowing function, it is always a good idea to pre-sort your data! For the dowjones dataset, we would sort on the Date column:
dowjones = dowjones.sort_values(by='Date')
You’ll notice that for the first 5 months, this method returns NaN - this is because there were not enough values to include in the average, as the method assumes you always wish to have exactly 6 values in each rolling window. We can alter this however, using the min_periods argument:
Let’s see what these rolling averages look like visually:
import matplotlib.pyplot as plt# add more averagesdowjones['five_year'] = dowjones.rolling(60, min_periods=1).Price.mean()# plot the original dataplt.plot(dowjones.Date, dowjones.Price, label='Actual Data')# plot the smoothed dataplt.plot(dowjones.Date, dowjones.six_month_avg, label='Six Month')plt.plot(dowjones.Date, dowjones.five_year, label='Five Year')# add a legendplt.legend()plt.show()
The six month average is starting to smooth out the jagged peaks in the original data, while the five-year average has completely smoothed out all the wiggles!
Cumulative
While the rolling capabilities will provide us a fixed window into which we can calculate aggregate statistics, we may also wish to calculate cumulative statistics. The most common cumulative statistic is the cumulative sum, which calculates the sum of the current row and all preceding rows. In Pandas, we can calculate this with the cumsum method.
Finally, knowing the value preceding or following the current value is sometimes of interest. Forecast modeling will often use the current value and past values to predict future values, for example. To retrieve lead and lag values, Pandas has the shift method. This takes as an argument the number of periods by which we wish to shift our dataset (defaulting to 1). The periods can be negative, letting us get future values as well! Let’s look at this with the dowjones dataset again:
To show that this is actually the year before and after, let’s filter to a year somewhere in the middle.
Advanced Tip: Date Columns
In order to filter to the middle year, we need to be sure of the type of data we are working with. Let’s take a quick look at the Date column, and see what the datatype is:
print(dowjones.Date.dtype)
datetime64[ns]
This is a new column type that we have not encountered in detail before, but it tells us that this column contains date information. If we were to look at a single value in the Series, we would see that it is not simply a ‘YYYY-MM-DD’ string as displayed when we view the dataframe:
Because this is not a string, we cannot do a simple string comparison between two dates:
print(single_date =='1914-12-01')
False
To filter based on date columns then, we should instead create a date type object in Python. The easiest way to do that is using the built in python library datetime.
from datetime import datetime# create a date-like objectnew_date = datetime(1914, 12, 1)print(new_date)# compare with the single dateprint(single_date == new_date)
from datetime import datetime# create dates for the filtercurrent_year = datetime(1942, 8, 1)previous_year = datetime(1941, 8, 1)next_year = datetime(1943, 8, 1)# filter the date columndowjones[ dowjones.Date.isin([current_year, previous_year, next_year])][ ['Date', 'Price', 'year_prior', 'year_after']]
Date
Price
year_prior
year_after
320
1941-08-01
126.55
125.35
106.20
332
1942-08-01
106.20
126.55
136.22
344
1943-08-01
136.22
106.20
146.93
If we look at these three lines, we see that the year_prior and year_after columns do in fact give us the corresponding values from the initial dataset. And if we plot this, we will see three copies of the data, each shifted by a year:
Find the average, minimum, and maximum Dow Jones prices on a yearly basis. Plot all three values to see how they compare! (Recall: this dataset is currently monthly)
Solution to Challenge 3
# first let's find the three aggregations we wantdowjones_rolling = dowjones.rolling(12, min_periods=1).agg( {'Price': ['min', 'max', 'mean']})display(dowjones_rolling.head())# get rid of hierarchical naming schemedowjones_rolling.columns = ['price_min', 'price_max', 'price_mean']# plot these (note that this DF doesn't have the dates, so we use the old DF dates)# here's a fancy plot that will show a coloured band!plt.fill_between( dowjones.Date, dowjones_rolling.price_min, dowjones_rolling.price_max, alpha=0.3# alpha makes the band more transparent) plt.plot(dowjones.Date, dowjones_rolling.price_mean)plt.show()
Price
min
max
mean
0
55.0
55.00
55.0000
1
55.0
56.55
55.7750
2
55.0
56.55
55.8500
3
55.0
58.30
56.4625
4
55.0
66.45
58.4600
9.4 Reshaping Data
Wide or Long?
Another fairly common task in data analysis is pivoting our data. Typically this means we are creating a wide-form table that has more columns than originally. This may be near the end of a particular analysis and we want to prepare a final table for a report, or perhaps we wish to pick out specific values from a column for more detailed inspection. Another common use of a pivot table might be for timeseries data, and we wish to separate different variables into their own individual columns. In these cases, we can use the pivot and pivot_table methods within Pandas.
In the other direction, we may wish to ‘unpivot’ a dataset. This will often occur early on in an analysis, where we wish to create a long-form table where multiple columns have been combined into a single set of id and value columns. This will be done with the melt method.
Let’s explore these ideas with the gapminder dataset.
The first thing we may wish to do is a simple pivot in which we make the table wider than it was before. The simplest way to do this is by the pivot method. This takes as input three arguments:
columns: the only required argument! Column(s) we wish to use for the new columns of our dataframe. If more than one is provided, a hierarchy of columns is created.
index: the column(s) we wish to use for the index of our new dataframe. This is entirely optional, and will just default to the pre-existing index if not supplied.
values: the value(s) we wish to retain for the new dataframe. This is also optional, and will use all leftover columns if nothing is provided. If more than one column is used here, it again creates a hierarchy in the columns.
The output of the pivot method is a new dataframe that has the requested index/columns, with the corresponding value associated with each index-column pair in the dataframe.
For instance, perhaps we wish to explore the population of each country individually overtime. In this case, we can pivot our gapminder dataset so that each of the countries in the original country column become their own column, and compare the population value against the year:
Note that doing this moves ‘year’ into the index of the dataframe, and gives the columns the name of ‘country’. We can move ‘year’ back into the core of the dataframe using reset_index(), and remove the column name using the name attribute of the columns:
What if we wish to know about the average population of each continent over time? If we try the same approach as above, we will immediately run into an issue:
---------------------------------------------------------------------------ValueError Traceback (most recent call last)
Cell In[46], line 1----> 1 gapminder_year_continent =gapminder.pivot( 2index='year', 3columns='continent', 4values='pop' 5)
File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/frame.py:7885, in DataFrame.pivot(self, index, columns, values) 7880@Substitution("")
7881@Appender(_shared_docs["pivot"])
7882defpivot(self, index=None, columns=None, values=None) -> DataFrame:
7883frompandas.core.reshape.pivotimport pivot
-> 7885returnpivot(self,index=index,columns=columns,values=values)
File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/pivot.py:520, in pivot(data, index, columns, values) 518else:
519 indexed = data._constructor_sliced(data[values]._values, index=multiindex)
--> 520returnindexed.unstack(columns_listlike)
File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/series.py:4157, in Series.unstack(self, level, fill_value) 4114""" 4115Unstack, also known as pivot, Series with MultiIndex to produce DataFrame. 4116 (...) 4153b 2 4 4154""" 4155frompandas.core.reshape.reshapeimport unstack
-> 4157returnunstack(self,level,fill_value)
File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/reshape.py:491, in unstack(obj, level, fill_value) 489if is_1d_only_ea_dtype(obj.dtype):
490return _unstack_extension_series(obj, level, fill_value)
--> 491 unstacker =_Unstacker( 492obj.index,level=level,constructor=obj._constructor_expanddim 493) 494return unstacker.get_result(
495 obj._values, value_columns=None, fill_value=fill_value
496 )
File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/reshape.py:140, in _Unstacker.__init__(self, index, level, constructor) 133if num_cells > np.iinfo(np.int32).max:
134 warnings.warn(
135f"The following operation may generate {num_cells} cells " 136f"in the resulting pandas object.",
137 PerformanceWarning,
138 )
--> 140self._make_selectors()
File /opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pandas/core/reshape/reshape.py:192, in _Unstacker._make_selectors(self) 189 mask.put(selector, True)
191if mask.sum() <len(self.index):
--> 192raiseValueError("Index contains duplicate entries, cannot reshape")
194self.group_index = comp_index
195self.mask = mask
ValueError: Index contains duplicate entries, cannot reshape
Here we find that there is more than one value associated with each year-continent pairing, and so the pivot method does not know how to assign a value to the dataframe. This is because pivot is for unique values only. While it is an excellent simple-use tool, we can go one step further to deal with non-unique values.
pivot_table
To address the year-continent question above, we could first group our data by continent, calculate the average population every year, and then pivot the resultant dataframe. This is a completely valid method! However, pandas also offers a built in method that will do aggregations and pivots all at once. This is similar to the functionality of Excel pivot tables, and includes some additional functionality we would not find in groupby alone. Some of the important arguments that we can use in the pivot_table method include:
columns: the only required argument! Column(s) (and values within) we wish to use for the resultant dataframes columns.
values: the column(s) we wish to aggregate. If no columns are provided, all leftover (non column/index columns) columns will be used that make ‘sense’. For example, the average value of a list of strings does not make sense, but the ‘max’ value of a string list could be calculated.
index: the column(s) we wish to use for the resultant dataframes index
aggfunc: the method of aggregation we wish to use. Default is to calculate the average.
fill_value: the value to replace missing values with (after aggregation).
Let’s go through some examples with increasing complexity. The simplest thing we can do is pivot a single column, to which we will get an output for the averages of all numerical columns.
Next we may wish to add an index, and focus on a single value for our output. This matches the question we asked above and could not answer with the pivot method:
If we tried to do the same calculation, but use a string column instead of a numerical column for the values we will get a warning about trying to use invalid columns:
/tmp/ipykernel_2563/2473994706.py:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.mean is deprecated. In a future version, a TypeError will be raised. Before calling .mean, select only columns which should be valid for the function.
gapminder_year_continent_country = gapminder.pivot_table(
continent
year
1952
1957
1962
1967
1972
However, we can switch our aggregation method to something that is valid for strings (using either string arguments for common aggregations such as ‘max’, or more complicated numpy or user-defined functions as well).
Finally, similar to our groupby aggregations, we can also pass a dictionary to aggfunc. The dictionary contains the columns we wish to use for values as keys, and the type of aggregation(s) we wish to do as the values.
The opposite of pivoting is melting (…obviously?). This is often used to get data back into long-form. Long-form data will typically have all different possible categories for a single measure in a single column (having a country column instead of Algeria, Albania, Canada… Zimbabwe columns). Long-form data is typically the preferred data style when doing your actual analysis. For example, many of the Seaborns methods we considered for creating stunning visuals assume that the data is in long-form, and is optimized to split out categories internally instead of pulling from multiple different columns.
However, as our data does not always come in the way we want it, we might have to unpivot (melt) our data to get it into long-form first. The melt method is how we attack this in Pandas, and it uses the following arguments:
id_vars: This is a required argument! Column(s) that will be used as identifiers. These are columns that we do not wish to unpivot, but leave as is.
value_vars: Column(s) to unpivot. If left out, the default is to use all non-id columns.
Let’s look at one of the pivoted tables we made earlier, and try to get it back into long form:
Probably not. Having the population, GDP, and life expectancy all in the same column might lead to confusing results. As a general rule of thumb, including multiple unique measures that have different units (population is measured in number of people, life expectancy in years, and GDP per capita in dollars per person) in the same column is typically avoided. Another way to think of it is this: if I take the average of the entire column, will that actually make sense? Try not to melt further than necessary! Striking a balance between machine-preferred (entirely long-form) and human-preferred (wide-form) is an important and useful skill when creating scripts to analyse our datasets!
These examples only touch the surface of methods to reshape our data. To explore even further, Pandas provides a user guide to reshaping. Check this out for even more reshaping functionality!
Challenge 4
Challenge 4
Create a pivot table that counts the number of countries in each continent that has a population greater than 1,000,000 in any given year.
Solution to Challenge 4
First, let’s demonstrate this using the methods we’ve learned about to reshape our data. We can create this aggregated pivot table using the pivot_table method, and supplying ‘count’ to the aggfunc method.
This type of table is commonly referred to as a crosstable: it gives us the frequency of occurences of items under different groupings. Because it is such a widely used table, it was given its own special method in Pandas: crosstab. Basic functionality is to supply two Series that we wish to cross-tabulate to the method, and let it do all the work. Note that this is a Pandas specific method, not a dataframe specific method!