Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: tools (Page 13 of 35)

Cleaning up Stacked Bar Charts, Part 3

In my final mini-series on cleaning up stacked bar charts (Part 1 and Part 2, in case you missed them), let’s talk about how you might order the bars of your chart.

In my last post, each bar in my chart represented a different day of the week and I allowed the bars to be ordered accordingly:

The bars are ordered Monday – Sunday (starting at the bottom left)

Most people would probably expect this sort of ordering. However, what if your groups don’t have an inherent order like day-of-the-week?

For my example, I generated some random email data for five fake email accounts:

import numpy as np
from datetime import date, timedelta
import pandas as pd


# names compliments of: https://frightanic.com/goodies_content/docker-names.php
email_accounts = ['fervent_saha@test.com', 'serene_cori@test.com', 'agitated_pike@test.com', 
                  'cocky_turing@test.com', 'sad_babbage@test.com']
email_data = []

for acct in email_accounts:
    for cat in ['primary', 'promotions', 'social']:
        nbr_of_email = np.random.randint(50, high=100)
        for i in range(0, nbr_of_email):
            email_dt = date(2020, 6, 1) + timedelta(days=np.random.randint(0, high=30))
            email_data.append([email_dt, acct, cat])
            
df_email_accts = pd.DataFrame(email_data, columns=['email_date', 'email_account', 'email_category'])
df_email_accts['email_date'] = pd.to_datetime(df_email_accts.email_date)
df_email_accts.head()
A bunch of random, fake email data

Now, let’s use a stacked bar chart to compare the emails counts, by category, of the five different email accounts:

fig, ax = plt.subplots(figsize=(12,8))
_ = df_email_accts.groupby(['email_account', 'email_category']).count().unstack().plot(kind='barh', stacked=True, ax=ax)

_ = ax.set_title('Email counts by category, June 2020')
_ = ax.set_xlabel('Email Count')
_ = ax.set_ylabel('Email Account')
Bar chart chaos!

Technically, matplotlib has ordered the email accounts alphabetically–from agitated_pike@test.com to serene_cori@test.com–but most folks probably don’t care about that: they’ll likely want the chart ordered either greatest count to least or least count to greatest.

How can you then order your stacked bar chart by the total count? There may be a more elegant way to do this in pandas, but I came up with three lines to code to get the order right.

To start with, take a look at the dataframe we get with my standard groupby and unstack approach:

df_email_accts.groupby(['email_account', 'email_category']).count().unstack()

What I need is a way to total the counts of the three categories–primary, promotions, and social–for each of the five email accounts and then sort the dataframe by that total.

No problem! I can use the pandas sum function with axis=1–meaning, sum across the columns–to get that total:

df_rpt = df_email_accts.groupby(['email_account', 'email_category']).count().unstack()
df_rpt['total'] = df_rpt.sum(axis=1)
df_rpt.head()
The sum function gives me a “total” value I can use for sorting

Putting it all together, then, here’s the code I came up with to nicely sorted my stacked bar chart in a meaningful way:

# two lines of code to provide a "total" column that can be used for sorting
df_rpt = df_email_accts.groupby(['email_account', 'email_category']).count().unstack()
df_rpt['total'] = df_rpt.sum(axis=1)

fig, ax = plt.subplots(figsize=(12,8))

# sort the dataframe by the "total" column, then drop it before rendering the chart
_ = df_rpt.sort_values('total')[df_rpt.columns.tolist()[:-1]].plot(kind='barh', stacked=True, ax=ax)
_ = ax.set_title('Email counts by category, June 2020')
_ = ax.set_xlabel('Email Count')
_ = ax.set_ylabel('Email Account')

# and, of course, clean up the legend
original_legend = [t.get_text() for t in ax.legend().get_texts()]
new_legend = [t.replace('(email_date, ', '').replace(')', '') for t in original_legend]
_ = ax.legend(new_legend, title='Category')
A nicely sorted, stacked bar chart where the high and low counts are immediately apparent

Cleaning up Stacked Bar Charts, Part 2

Here is the second installment in my mini-series on stacked bar charts.

Grouping in your stacked bar charts can be powerful and insightful. With time series data, grouping by the day of the week, by month, or even by year can provide an interesting perspective on your data.

Considering the email data I used in my previous post, I can use the following code to group my data by day of week:

fig, ax = plt.subplots(figsize=(12,8))
title = 'Email counts by day of week: {0:%d %b %Y} - {1:%d %b %Y}'.format(df_email.email_dt.min(), df_email.email_dt.max())

_ = df_email[['email_dt','category','dow']].groupby(['dow','category']).count().unstack().\
    plot(stacked=True, kind='barh', title=title, ax=ax)
Just what are those numbers in the Y column?

Interesting: I certainly receive more email on days 2 and 3 but…wait…what are days 2 and 3?!

Days 2 and 3 correspond to Wednesday and Thursday, respectively. I know this because I used the pandas dayofweek function to get those values and that’s what those numbers translate to. I may know that, but the average viewer of my chart won’t. So, I need a way to change those labels to ones the viewer can understand. I can do that with the following code (with the most pertinent code highlighted):

fig, ax = plt.subplots(figsize=(12,8))
title = 'Email counts by day of week: {0:%d %b %Y} - {1:%d %b %Y}'.format(df_email.email_dt.min(), df_email.email_dt.max())

df_email[['email_dt','category','dow']].groupby(['dow','category']).count().unstack().\
    plot(stacked=True, kind='barh', ax=ax)

_ = ax.set_title(title)
_ = ax.set_xlabel('Email Count')
_ = ax.set_ylabel('Day of Week')

# clean up the legend
original_legend = [t.get_text() for t in ax.legend().get_texts()]
new_legend = [t.replace('(email_dt, ', '').replace(')', '') for t in original_legend]
_ = ax.legend(new_legend, title='Category')

# now, replace the day numbers with their names
day_labels = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
curr_ylabels = [t.label.get_text() for t in ax.yaxis.get_major_ticks()]
new_ylabels = [day_labels[int(l)] for l in curr_ylabels]
_ = ax.set_yticklabels(new_ylabels)
Ahhh: much better!

Interestingly, pandas does have a day_name function that returns the name of the day instead of its number. The nice thing about my approach–using the dayofweek numbers and then replacing the numbers with the friendly names–is that matplotlib automatically sorts my bars numerically, so my bars are already in a natural order. In this case: Monday through Sunday. Were I to use the day_name function instead, matplotlib would want to sort the bars alphabetically, from Friday to Wednesday. That would make for an oddly arranged bar chart.

Family bingo

During the quarantine, one family activity we’ve begun is weekly virtual meetings with family members we’ve been prevented from seeing face-to-face. To add some structure and fun to the meetings, we play simple games like Bingo. It occurred to me that it might be even more fun and interesting to personalize our Bingo games.

For example, take my favorite TV family, The Bundys:

The Bundys

Now, suppose the Bundys were to reunite virtually for a family get together and decided to play a personalized game of Bingo in the manner I’m proposing. They might first create a list of their names: Al, Peggy, Kelly, and Bud. They might add other names to the list like Steve, Marcy, and Jefferson. They could add memorable events like “Polk High” and “Four Touchdowns”, family vacations including “Dumpwater, Florida” and “Lower Uncton, England” and possessions such as “the Dodge” and “Buck the dog”.

Based off a previous post of mine, they could generate personalized bingo cards like so:

import matplotlib.pyplot as plt
import matplotlib.style as style
import numpy as np
import random

%matplotlib inline
style.use('seaborn-poster')


bundy_data = ['Al', 'Peg', 'Kelly', 'Bud', 'Buck', 'Steve', 'Marcy', 'Jefferson', 'Griff', 'Gary\'s\nShoes', 'Polk High', 
              'Four\nTouchdowns', 'Shoe\nSalesman', 'Lucky', 'Dumpwater,\nFL', 'No Ma\'am', 'Wanker\nCounty', 'Dodge', 
              'Bob\nRooney', 'Officer\nDan', 'Psycho\nDad', 'Ike', 'Seven', 'Anthrax', 'Jim\nJupiter', 'Sticky\nthe Clown',
              'Love &\nMarriage', 'Grandmaster\nB', 'chicken', 'Lower\nUncton', '9674\nJeopardy Ln', 'Ferguson\ntoilets', 
              'Chicago']

rowlen = 5  # bingo cards are usually 5x5

fig = plt.figure(figsize=(8, 8))
ax = fig.gca()
ax.set_xticks(np.arange(0, rowlen + 1))
ax.set_yticks(np.arange(0, rowlen + 1))
plt.grid()
_ = ax.set_xticklabels([])
_ = ax.set_yticklabels([])

for i, ltr in enumerate('BUNDY'):
    x = (i % rowlen) + 0.4
    y = 5.0
    ax.annotate(ltr, xy=(x, y), xytext=(x, y), size=20, weight='bold')
    
random.shuffle(bundy_data)
for i, phrase in enumerate(bundy_data[:rowlen**2]):
    x = (i % rowlen) + 0.29
    y = int(i / rowlen) + 0.5
    ax.annotate(phrase, xy=(x, y), xytext=(x, y))
A personalized Bundy family bingo card

The host calling out the bingo squares to mark could simply run Python code like below to generate a random list of squares to call:

nbr_of_picks = 20  # generate, say, 20 squares to call

for i in np.arange(nbr_of_picks):
    print('{0} - {1}'.format(random.choice('BUNDY'), random.choice(bundy_data).replace('\n', ' ')))

This would generate a list like so:

Y - Marcy
Y - Steve
B - Dodge
N - Ike
U - Grandmaster B
U - Lucky
U - Gary's Shoes
N - Griff
U - Steve
U - Marcy
Y - Bud
B - Psycho Dad
B - Polk High
N - Officer Dan
B - Dodge
B - Wanker County
Y - Anthrax
U - chicken
Y - Shoe Salesman
B - Ferguson toilets

If your family name is not five characters long, you could of course use “BINGO” instead or make your cards larger or smaller accordingly. And, of course, come up with your own personal family names, events, and so on for the card data.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑