Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: python (Page 11 of 26)

Cleaning up Stacked Bar Charts, Part 1

Stacked bar charts are a handy way of conveying a lot of information in a single visual and pandas makes it pretty easy to generate these charts by setting the stacked property to True in the plot function.

As great as this operation is, though, you still need to do some cleanup work on your chart afterwards. In this post, I’ll talk about how I clean up the legend generated in a stacked bar chart. For my example, I’ll take a small sample of email data I gathered recently from one of my email accounts.

Bring in the data and do some standard cleanup

import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline


df_email = pd.read_csv('./data/email_data.csv', names=['email_ts', 'subject', 'category'])

df_email['email_ts'] = pd.to_datetime(df_email.email_ts)
df_email['email_dt'] = df_email.email_ts.dt.date
df_email['dow'] = df_email.email_ts.dt.dayofweek
# just look at 30 or so days of data
df_email = df_email[(df_email.email_ts>'2020-04-14 00:00:00') & (df_email.email_ts<'2020-05-15 00:00:00')]

As a standard practice, whenever I have timestamp data, I always add a “date” column and a “day of week” column. If my data spans multiple months, I’ll even add a “month” column. These columns make is much easier to group the data by day, day of week, and month later on.

Chart the data

Here’s a quick glimpse of the data in my dataset:

fig, ax = plt.subplots(figsize=(12,8))
title = 'Email counts: {0:%d %b %Y} - {1:%d %b %Y}'.format(df_email.email_dt.min(), df_email.email_dt.max())

df_email[['email_dt','dow']].groupby('email_dt').count().plot(ax=ax)
_ = ax.set_title(title)
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Email Count')
_ = fig.autofmt_xdate()

Now, create a stacked bar chart

Here’s the type of code I normally write to generate a stacked bar chart:

fig, ax = plt.subplots(figsize=(12,8))
title = 'Email counts: {0:%d %b %Y} - {1:%d %b %Y}'.format(df_email.email_dt.min(), df_email.email_dt.max())

df_email[['email_dt','category','dow']].groupby(['email_dt','category']).count().unstack().\
    plot(stacked=True, kind='bar', ax=ax)

_ = ax.set_title(title)
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Email Count')
_ = ax.set_ylim([0,46])  # just to give some space to the legend
_ = fig.autofmt_xdate()
Decent chart, but what’s the deal with that legend?

So, this chart is pretty decent, but that legend needs work. The good news is that three lines of code will clean it up nicely. Here’s my better version:

fig, ax = plt.subplots(figsize=(12,8))
title = 'Email counts: {0:%d %b %Y} - {1:%d %b %Y}'.format(df_email.email_dt.min(), df_email.email_dt.max())

df_email[['email_dt','category','dow']].groupby(['email_dt','category']).count().unstack().\
    plot(stacked=True, kind='bar', ax=ax)

_ = ax.set_title(title)
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Email Count')
_ = fig.autofmt_xdate()
_ = ax.set_ylim([0,46])  # just to give some space to the legend

original_legend = [t.get_text() for t in ax.legend().get_texts()]
new_legend = [t.replace('(dow, ', '').replace(')', '') for t in original_legend]
_ = ax.legend(new_legend, title='Category')
Nicer looking legend

So, with stacked bar charts, this is one approach I take to make the end product look a little nicer. In upcoming posts, I’ll show even more techniques to clean up your charts.

pandas and literal_eval

Occasionally, I have to work with more complicated forms of data where a given column might contain an array of values instead of conventional data types like Strings, Integers, Timestamps and such. Often, such complex fields can easily be processed with the excellent literal_eval function from the Abstract Syntax Trees (ast) package. For example, take this movie database from Kaggle:

import pandas as pd
from ast import literal_eval

df = pd.read_csv('./data/tmdb_5000_credits.csv')
df['cast'] = df.cast.apply(literal_eval)
df.head()

And when we take a look at that cast column, specifically:

type(df.loc[0,'cast'])

# this prints out "list" as the type

Basically, literal_eval converted the string of brackets and curly braces into lists of dictionary objects. Very cool and handy.

However, in real life, you don’t always have such clean data to work with. Recently, I was working with some complex data, tried to use literal_eval on it, and got an error like the following:

ValueError: malformed node or string: <_ast.Name object at 0x000001F62A851FC8>

Woh. There’s something about this data that literal_eval doesn’t like. I wonder what it is? Let me do a little exception handling around literal_eval to see if I can find the row(s) where the function is breaking:

def try_the_eval(row):
    try:
        literal_eval(row.cast)
    except:
        print('Found bad data at row: {0}'.format(row.name))


df_dirty = pd.read_csv('./data/dirty_data.csv')
_ = df_dirty.apply(try_the_eval, axis=1)

This code told me I had badly formatted data in row 32 of my dataset. I can take a closer look at that data like so:

df_dirty.loc[32,'cast']
Careful review of the data (and a handy red box) reveals the problem

literal_eval only works with certain types of data–strings, bytes, numbers, even None–but not null.

So, how do you fix such an issue? I can think of at least two ways:

  1. Wrap null values like these in double quotes or
  2. Replace null values like these with None.

Here, I go with Option #2:

df_dirty['cast'] = df_dirty.cast.str.replace('null', 'None')
df_dirty['cast'] = df_dirty.cast.apply(literal_eval)

# no exceptions are thrown and the below line prints out "list"
print(type(df_dirty.loc[32,'cast']))

One caveat here is that you need to be careful how you search and replace the errant data. For example, the word null without surround quotation marks might appear legitimately in other fields like descriptions. You don’t want to accidentally replace such legitimate uses of the word with None or surround it with quotation marks as that will introduce new problems. So, you may need to come up with slightly more rigorous ways of finding and replacing your errant data, such as using regular expressions. Once you deal with that data, though, literal_eval works like a charm and saves you lots of time working with complex datasets!

Grouping moving averages with Pandas

A friend of mine posed a challenge to me recently: how do you calculate a moving average on a field, by group, and add the calculation as a new column back to the original dataframe?

Moving averages calculate an average of a value over a range of time as that “window” shifts over time. They’re often used to smooth out fluctuations in real data.

For example, let’s take a look at the COVID-19 data I used in my last post. Recall what my Ohio dataframe (df_ohio) looked like:

df_ohio.head()

Before I can even think about calculating moving averages on this data, I need to first tidy it up a bit, but pandas makes that pretty easy:

date_cols = df_ohio.columns.tolist()
rename_cols = {'variable': 'obs_date', 'value': 'confirmed_cases'}

df_ohio_tidy = pd.melt(df_ohio.reset_index(), id_vars=['county'], value_vars=date_cols).rename(columns=rename_cols)
df_ohio_tidy['obs_date'] = pd.to_datetime(df_ohio_tidy.obs_date)

df_ohio_tidy = df_ohio_tidy.set_index('obs_date')
df_ohio_tidy

Now, I’m ready to calculate moving averages. The pandas rolling function is generally used for that purpose. It’s quite a powerful and versatile function, so be sure to check out the documentation. Normally, I just draw the moving average values in a chart along side the actual observations:

fig, ax = plt.subplots(figsize=(8,8))
rename_col = {'confirmed_cases': '7 day moving avg'}
title = 'Confirmed COVID-19 cases in Cuyahoga, Ohio as of {0:%d %b %Y}'.format(df_ohio_tidy.index.max())

_ = df_ohio_tidy[df_ohio_tidy.county=='Cuyahoga, Ohio'][['confirmed_cases']].plot(ax=ax, title=title)
_ = df_ohio_tidy[df_ohio_tidy.county=='Cuyahoga, Ohio'][['confirmed_cases']].rolling(7).mean().\
    rename(columns=rename_col).plot(ax=ax, color='gray')

_ = ax.set_ylabel('Confirmed Cases')

But in this case, I need to calculate moving averages for each county in Ohio and add those calculations to the dataframe as a new column. For this, I use a combination of the rolling function and the equally powerful transform function. With help from this post, pandas has no issue doing that (in one line, no less):

df_ohio_tidy['7ma'] = df_ohio_tidy.groupby('county').confirmed_cases.transform(lambda c: c.rolling(7).mean())

Now, let’s do some spot checking to make sure the results are as expected:

df_ohio_tidy.sort_values(['county', 'obs_date']).iloc[1170:1190,:]

Above, we can see that the 7 day moving average for Crawford County stops at the last entry for Crawford County on March 30 and resets to start calculating for Cuyahoga County.

df_ohio_tidy.sort_values(['county', 'obs_date']).iloc[1235:1250,:]

Above, we spot check the change from Cuyahoga County to Darke County. Again, the calculation for Cuyahoga County stops with the last entry on March 30 and starts over calculating on Darke County.

So, yes, he can both calculate and group the moving average, Mr. Waturi! All the code behind my posts on the COVID-19 data can be found here.

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑