In my final mini-series on cleaning up stacked bar charts (Part 1 and Part 2, in case you missed them), let’s talk about how you might order the bars of your chart.
In my last post, each bar in my chart represented a different day of the week and I allowed the bars to be ordered accordingly:
Most people would probably expect this sort of ordering. However, what if your groups don’t have an inherent order like day-of-the-week?
For my example, I generated some random email data for five fake email accounts:
import numpy as np
from datetime import date, timedelta
import pandas as pd
# names compliments of: https://frightanic.com/goodies_content/docker-names.php
email_accounts = ['fervent_saha@test.com', 'serene_cori@test.com', 'agitated_pike@test.com',
'cocky_turing@test.com', 'sad_babbage@test.com']
email_data = []
for acct in email_accounts:
for cat in ['primary', 'promotions', 'social']:
nbr_of_email = np.random.randint(50, high=100)
for i in range(0, nbr_of_email):
email_dt = date(2020, 6, 1) + timedelta(days=np.random.randint(0, high=30))
email_data.append([email_dt, acct, cat])
df_email_accts = pd.DataFrame(email_data, columns=['email_date', 'email_account', 'email_category'])
df_email_accts['email_date'] = pd.to_datetime(df_email_accts.email_date)
df_email_accts.head()
Now, let’s use a stacked bar chart to compare the emails counts, by category, of the five different email accounts:
fig, ax = plt.subplots(figsize=(12,8))
_ = df_email_accts.groupby(['email_account', 'email_category']).count().unstack().plot(kind='barh', stacked=True, ax=ax)
_ = ax.set_title('Email counts by category, June 2020')
_ = ax.set_xlabel('Email Count')
_ = ax.set_ylabel('Email Account')
Technically, matplotlib has ordered the email accounts alphabetically–from agitated_pike@test.com to serene_cori@test.com–but most folks probably don’t care about that: they’ll likely want the chart ordered either greatest count to least or least count to greatest.
How can you then order your stacked bar chart by the total count? There may be a more elegant way to do this in pandas, but I came up with three lines to code to get the order right.
To start with, take a look at the dataframe we get with my standard groupby and unstack approach:
df_email_accts.groupby(['email_account', 'email_category']).count().unstack()
What I need is a way to total the counts of the three categories–primary, promotions, and social–for each of the five email accounts and then sort the dataframe by that total.
No problem! I can use the pandas sum function with axis=1–meaning, sum across the columns–to get that total:
df_rpt = df_email_accts.groupby(['email_account', 'email_category']).count().unstack()
df_rpt['total'] = df_rpt.sum(axis=1)
df_rpt.head()
Putting it all together, then, here’s the code I came up with to nicely sorted my stacked bar chart in a meaningful way:
# two lines of code to provide a "total" column that can be used for sorting
df_rpt = df_email_accts.groupby(['email_account', 'email_category']).count().unstack()
df_rpt['total'] = df_rpt.sum(axis=1)
fig, ax = plt.subplots(figsize=(12,8))
# sort the dataframe by the "total" column, then drop it before rendering the chart
_ = df_rpt.sort_values('total')[df_rpt.columns.tolist()[:-1]].plot(kind='barh', stacked=True, ax=ax)
_ = ax.set_title('Email counts by category, June 2020')
_ = ax.set_xlabel('Email Count')
_ = ax.set_ylabel('Email Account')
# and, of course, clean up the legend
original_legend = [t.get_text() for t in ax.legend().get_texts()]
new_legend = [t.replace('(email_date, ', '').replace(')', '') for t in original_legend]
_ = ax.legend(new_legend, title='Category')
Recent Comments