Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Category: technology (Page 15 of 36)

Grouping moving averages with Pandas

A friend of mine posed a challenge to me recently: how do you calculate a moving average on a field, by group, and add the calculation as a new column back to the original dataframe?

Moving averages calculate an average of a value over a range of time as that “window” shifts over time. They’re often used to smooth out fluctuations in real data.

For example, let’s take a look at the COVID-19 data I used in my last post. Recall what my Ohio dataframe (df_ohio) looked like:

df_ohio.head()

Before I can even think about calculating moving averages on this data, I need to first tidy it up a bit, but pandas makes that pretty easy:

date_cols = df_ohio.columns.tolist()
rename_cols = {'variable': 'obs_date', 'value': 'confirmed_cases'}

df_ohio_tidy = pd.melt(df_ohio.reset_index(), id_vars=['county'], value_vars=date_cols).rename(columns=rename_cols)
df_ohio_tidy['obs_date'] = pd.to_datetime(df_ohio_tidy.obs_date)

df_ohio_tidy = df_ohio_tidy.set_index('obs_date')
df_ohio_tidy

Now, I’m ready to calculate moving averages. The pandas rolling function is generally used for that purpose. It’s quite a powerful and versatile function, so be sure to check out the documentation. Normally, I just draw the moving average values in a chart along side the actual observations:

fig, ax = plt.subplots(figsize=(8,8))
rename_col = {'confirmed_cases': '7 day moving avg'}
title = 'Confirmed COVID-19 cases in Cuyahoga, Ohio as of {0:%d %b %Y}'.format(df_ohio_tidy.index.max())

_ = df_ohio_tidy[df_ohio_tidy.county=='Cuyahoga, Ohio'][['confirmed_cases']].plot(ax=ax, title=title)
_ = df_ohio_tidy[df_ohio_tidy.county=='Cuyahoga, Ohio'][['confirmed_cases']].rolling(7).mean().\
    rename(columns=rename_col).plot(ax=ax, color='gray')

_ = ax.set_ylabel('Confirmed Cases')

But in this case, I need to calculate moving averages for each county in Ohio and add those calculations to the dataframe as a new column. For this, I use a combination of the rolling function and the equally powerful transform function. With help from this post, pandas has no issue doing that (in one line, no less):

df_ohio_tidy['7ma'] = df_ohio_tidy.groupby('county').confirmed_cases.transform(lambda c: c.rolling(7).mean())

Now, let’s do some spot checking to make sure the results are as expected:

df_ohio_tidy.sort_values(['county', 'obs_date']).iloc[1170:1190,:]

Above, we can see that the 7 day moving average for Crawford County stops at the last entry for Crawford County on March 30 and resets to start calculating for Cuyahoga County.

df_ohio_tidy.sort_values(['county', 'obs_date']).iloc[1235:1250,:]

Above, we spot check the change from Cuyahoga County to Darke County. Again, the calculation for Cuyahoga County stops with the last entry on March 30 and starts over calculating on Darke County.

So, yes, he can both calculate and group the moving average, Mr. Waturi! All the code behind my posts on the COVID-19 data can be found here.

Pictures and Words

A chart may be worth 1000 words, but sometimes embedding a few words in your chart can convey additional, helpful information. For example, take this chart that I built from the incredible COVID-19 data collected by Johns Hopkins University:

Pretty telling as it is. Now, let’s add some words to it:

Adding some words to the chart, 1) conveys additional, helpful information and 2) fills in some awkward whitespace. Seems like a win to me. For completeness sake, here’s what I did to build this chart:

Step 1: Import the packages

import pandas as pd
import matplotlib.pyplot as plt

Step 2: Load up the JHU dataset

df = pd.read_csv('./data/time_series_covid19_confirmed_US.csv')

Step 3: Trim the data down to just counties in Ohio

cols = [i for i, v in enumerate(df.columns) if v in ['Admin2', 'Province_State'] or v.endswith('2020')]
df_ohio = df[df.Province_State=='Ohio'].iloc[:,cols].copy()
df_ohio['county'] = df_ohio.Admin2 + ', ' + df_ohio.Province_State  # combine county and State together in a field
df_ohio = df_ohio.drop(columns=['Admin2', 'Province_State']).set_index('county')

Step 4: Build the chart

fig, ax = plt.subplots(figsize=(12,10))
title = 'Top 10 Ohio Counties with Confirmed COVID-19 Cases as of ' + df.columns[-1]
worst_county, worst_co_cases = [(k, v) for k, v in df_ohio['3/30/2020'].sort_values().tail(1).items()][0]

inset = """
There are {0} counties and other Ohio 
entities in this dataset.  As of {1}, 
there are {2:,} confirmed cases of COVID-19.  
{3} represents {4:.1f}% of that population.
""".format(df_ohio.shape[0], df.columns[-1], df_ohio['3/30/2020'].sum(), worst_county, 
           (worst_co_cases/df_ohio['3/30/2020'].sum())*100)

_ = df_ohio['3/30/2020'].sort_values().tail(10).plot(kind='barh', ax=ax, title=title)
_ = ax.set_ylabel('Ohio counties')
_ = ax.set_xlabel('Confirmed Cases')

# you have to experiment a little with the x, y positioning to get your word inset positioned just right
text = fig.text(0.30, 0.35, inset, va='center', ha='left', size=18)

Pretty darn slick!

Loguru, ftw

In virtually all the applications and operations I write, I try to incorporate some level of logging so that my code can be adequately supported, particularly in Production environments. Some time ago, I wrote about how I generally log in my Python applications. Well, lately, I’ve switched from that approach to using Loguru and I must say I’m rather satisfied with its ease of use. Here’s a quick example I put together recently of the package:

Step 1: Do your standard imports

As I explained in other posts on logging, I like adding a “run id” to each log line so that I can easily group lines together belonging to a single instance of my application, so I import the uuid package to help in that regard:

import os
import sys
import uuid
from loguru import logger

Step 2: Setup/customize my logging context

In one line, I can customize how each log line is written and set logging behavior like rolling the file when it hits 10 Mb in size:

runid = str(uuid.uuid4()).split('-')[-1]
logger.add('loguru_example.log', format='{time}|{extra[runid]}|{level}|{message}', level='INFO', rotation='10 MB')
logger_ctx = logger.bind(runid=runid)

Step 3: Start logging

def main(argv):
    logger_ctx.info('Starting run of the loguru_example.py script')
    # do some stuff
    logger_ctx.info('Completing run of the loguru_example.py script')


if __name__ == '__main__':
    main(sys.argv[1:])

And now you have a nice and easy log for your application:

Pretty darn simple! So now there’s no excuse: start logging today!

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑