I had a challenge where I need to group a large dataset by a particular feature and then calculate a variety of statistics on those feature groups including a standard score for each record. My inclination was to loop through each group and run these calculations in each iteration–sorta outside my main dataframe–but then I thought, could there be an easier way in pandas to do this work?

Yes there is: transform.

As an example, take the movies dataset that I’ve used in the past:

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


# data from: https://www.kaggle.com/mysarahmadbhat/imdb-top-1000-movies
df = pd.read_csv('./data/regex_imdb.csv').fillna(0)
# filter out movies with no reported Gross
df = df[df.Gross != 0.0]

Suppose you wanted to know how each movie grossed against the average gross for their respective release years. To find this out, my inclination would be to loop through each year, calculate the mean for each year, then merge that value back into my main dataframe so that I could find the mean difference for each movie:

yr_means = []
for yr in df.Year.unique().tolist():
    yr_means.append({'Year': yr, 'year_mean': df[df.Year==yr].Gross.mean()})
    
# put my year mean calculations into a new dataframe
df_year_means = pd.DataFrame(yr_means)

df = df.merge(df_year_means, on='Year')

# and now I can calculate my difference from the mean
df['diff_from_year_mean'] = df.Gross - df.year_mean

But with the transform function, I can do all this work in a single line:

df['year_mean'] = df.groupby('Year').Gross.transform(np.mean)

# and now I can calculate my difference from the mean
df['diff_from_year_mean'] = df.Gross - df.year_mean

And from there you can do interesting work like diverging line charts:

fig, ax = plt.subplots(figsize=(8, 8))

year_to_plot = 2010
plot_data = df[df.Year==year_to_plot][['Name', 'diff_from_year_mean']].sort_values('diff_from_year_mean')
plot_data['color'] = plot_data.diff_from_year_mean.apply(lambda d: 'red' if d < 0 else 'green')

_ = ax.hlines(data=plot_data, y='Name', xmin=0, xmax=plot_data.diff_from_year_mean, color=plot_data.color)
_ = ax.table(cellText=[['${0:.2f} Million'.format(df[df.Year==year_to_plot].year_mean.values[0])]], 
             colLabels=['Avg Gross'], colWidths=[0.25], loc='center right')
_ = ax.set_xlabel('Gross earnings from the average (millions of dollars)')
_ = ax.set_title('Movie gross earnings from the average: top rated movies of {0}'.format(year_to_plot))

The transform function takes a variety of functions, both in the conventional function signature and, sometimes, as a string alias. For example, you can use:

  • np.min or ‘min’ to get the minimum value of the distribution
  • np.max or ‘max’ to get the maximum value of the distribution
  • np.std or ‘std’ to get the standard deviation of the distribution
  • len or ‘count’ to get a record count of your distribution
  • np.var or ‘var’ to get the variance of the distribution

You can even throw other functions/aliases at it like ‘first’ to get the first value of your distribution. However, you may need to do some sorting first or you may not get the values you were expecting.

Transform is yet one more way to do cool, pandas one-liner operations on your dataframes. Give it a whirl!