A friend of mine posed a challenge to me recently: how do you calculate a moving average on a field, by group, and add the calculation as a new column back to the original dataframe?

Moving averages calculate an average of a value over a range of time as that “window” shifts over time. They’re often used to smooth out fluctuations in real data.

For example, let’s take a look at the COVID-19 data I used in my last post. Recall what my Ohio dataframe (df_ohio) looked like:

df_ohio.head()

Before I can even think about calculating moving averages on this data, I need to first tidy it up a bit, but pandas makes that pretty easy:

date_cols = df_ohio.columns.tolist()
rename_cols = {'variable': 'obs_date', 'value': 'confirmed_cases'}

df_ohio_tidy = pd.melt(df_ohio.reset_index(), id_vars=['county'], value_vars=date_cols).rename(columns=rename_cols)
df_ohio_tidy['obs_date'] = pd.to_datetime(df_ohio_tidy.obs_date)

df_ohio_tidy = df_ohio_tidy.set_index('obs_date')
df_ohio_tidy

Now, I’m ready to calculate moving averages. The pandas rolling function is generally used for that purpose. It’s quite a powerful and versatile function, so be sure to check out the documentation. Normally, I just draw the moving average values in a chart along side the actual observations:

fig, ax = plt.subplots(figsize=(8,8))
rename_col = {'confirmed_cases': '7 day moving avg'}
title = 'Confirmed COVID-19 cases in Cuyahoga, Ohio as of {0:%d %b %Y}'.format(df_ohio_tidy.index.max())

_ = df_ohio_tidy[df_ohio_tidy.county=='Cuyahoga, Ohio'][['confirmed_cases']].plot(ax=ax, title=title)
_ = df_ohio_tidy[df_ohio_tidy.county=='Cuyahoga, Ohio'][['confirmed_cases']].rolling(7).mean().\
    rename(columns=rename_col).plot(ax=ax, color='gray')

_ = ax.set_ylabel('Confirmed Cases')

But in this case, I need to calculate moving averages for each county in Ohio and add those calculations to the dataframe as a new column. For this, I use a combination of the rolling function and the equally powerful transform function. With help from this post, pandas has no issue doing that (in one line, no less):

df_ohio_tidy['7ma'] = df_ohio_tidy.groupby('county').confirmed_cases.transform(lambda c: c.rolling(7).mean())

Now, let’s do some spot checking to make sure the results are as expected:

df_ohio_tidy.sort_values(['county', 'obs_date']).iloc[1170:1190,:]

Above, we can see that the 7 day moving average for Crawford County stops at the last entry for Crawford County on March 30 and resets to start calculating for Cuyahoga County.

df_ohio_tidy.sort_values(['county', 'obs_date']).iloc[1235:1250,:]

Above, we spot check the change from Cuyahoga County to Darke County. Again, the calculation for Cuyahoga County stops with the last entry on March 30 and starts over calculating on Darke County.

So, yes, he can both calculate and group the moving average, Mr. Waturi! All the code behind my posts on the COVID-19 data can be found here.