Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: tools (Page 14 of 35)

Cleaning up Stacked Bar Charts, Part 1

Stacked bar charts are a handy way of conveying a lot of information in a single visual and pandas makes it pretty easy to generate these charts by setting the stacked property to True in the plot function.

As great as this operation is, though, you still need to do some cleanup work on your chart afterwards. In this post, I’ll talk about how I clean up the legend generated in a stacked bar chart. For my example, I’ll take a small sample of email data I gathered recently from one of my email accounts.

Bring in the data and do some standard cleanup

import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline


df_email = pd.read_csv('./data/email_data.csv', names=['email_ts', 'subject', 'category'])

df_email['email_ts'] = pd.to_datetime(df_email.email_ts)
df_email['email_dt'] = df_email.email_ts.dt.date
df_email['dow'] = df_email.email_ts.dt.dayofweek
# just look at 30 or so days of data
df_email = df_email[(df_email.email_ts>'2020-04-14 00:00:00') & (df_email.email_ts<'2020-05-15 00:00:00')]

As a standard practice, whenever I have timestamp data, I always add a “date” column and a “day of week” column. If my data spans multiple months, I’ll even add a “month” column. These columns make is much easier to group the data by day, day of week, and month later on.

Chart the data

Here’s a quick glimpse of the data in my dataset:

fig, ax = plt.subplots(figsize=(12,8))
title = 'Email counts: {0:%d %b %Y} - {1:%d %b %Y}'.format(df_email.email_dt.min(), df_email.email_dt.max())

df_email[['email_dt','dow']].groupby('email_dt').count().plot(ax=ax)
_ = ax.set_title(title)
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Email Count')
_ = fig.autofmt_xdate()

Now, create a stacked bar chart

Here’s the type of code I normally write to generate a stacked bar chart:

fig, ax = plt.subplots(figsize=(12,8))
title = 'Email counts: {0:%d %b %Y} - {1:%d %b %Y}'.format(df_email.email_dt.min(), df_email.email_dt.max())

df_email[['email_dt','category','dow']].groupby(['email_dt','category']).count().unstack().\
    plot(stacked=True, kind='bar', ax=ax)

_ = ax.set_title(title)
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Email Count')
_ = ax.set_ylim([0,46])  # just to give some space to the legend
_ = fig.autofmt_xdate()
Decent chart, but what’s the deal with that legend?

So, this chart is pretty decent, but that legend needs work. The good news is that three lines of code will clean it up nicely. Here’s my better version:

fig, ax = plt.subplots(figsize=(12,8))
title = 'Email counts: {0:%d %b %Y} - {1:%d %b %Y}'.format(df_email.email_dt.min(), df_email.email_dt.max())

df_email[['email_dt','category','dow']].groupby(['email_dt','category']).count().unstack().\
    plot(stacked=True, kind='bar', ax=ax)

_ = ax.set_title(title)
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Email Count')
_ = fig.autofmt_xdate()
_ = ax.set_ylim([0,46])  # just to give some space to the legend

original_legend = [t.get_text() for t in ax.legend().get_texts()]
new_legend = [t.replace('(dow, ', '').replace(')', '') for t in original_legend]
_ = ax.legend(new_legend, title='Category')
Nicer looking legend

So, with stacked bar charts, this is one approach I take to make the end product look a little nicer. In upcoming posts, I’ll show even more techniques to clean up your charts.

pandas and literal_eval

Occasionally, I have to work with more complicated forms of data where a given column might contain an array of values instead of conventional data types like Strings, Integers, Timestamps and such. Often, such complex fields can easily be processed with the excellent literal_eval function from the Abstract Syntax Trees (ast) package. For example, take this movie database from Kaggle:

import pandas as pd
from ast import literal_eval

df = pd.read_csv('./data/tmdb_5000_credits.csv')
df['cast'] = df.cast.apply(literal_eval)
df.head()

And when we take a look at that cast column, specifically:

type(df.loc[0,'cast'])

# this prints out "list" as the type

Basically, literal_eval converted the string of brackets and curly braces into lists of dictionary objects. Very cool and handy.

However, in real life, you don’t always have such clean data to work with. Recently, I was working with some complex data, tried to use literal_eval on it, and got an error like the following:

ValueError: malformed node or string: <_ast.Name object at 0x000001F62A851FC8>

Woh. There’s something about this data that literal_eval doesn’t like. I wonder what it is? Let me do a little exception handling around literal_eval to see if I can find the row(s) where the function is breaking:

def try_the_eval(row):
    try:
        literal_eval(row.cast)
    except:
        print('Found bad data at row: {0}'.format(row.name))


df_dirty = pd.read_csv('./data/dirty_data.csv')
_ = df_dirty.apply(try_the_eval, axis=1)

This code told me I had badly formatted data in row 32 of my dataset. I can take a closer look at that data like so:

df_dirty.loc[32,'cast']
Careful review of the data (and a handy red box) reveals the problem

literal_eval only works with certain types of data–strings, bytes, numbers, even None–but not null.

So, how do you fix such an issue? I can think of at least two ways:

  1. Wrap null values like these in double quotes or
  2. Replace null values like these with None.

Here, I go with Option #2:

df_dirty['cast'] = df_dirty.cast.str.replace('null', 'None')
df_dirty['cast'] = df_dirty.cast.apply(literal_eval)

# no exceptions are thrown and the below line prints out "list"
print(type(df_dirty.loc[32,'cast']))

One caveat here is that you need to be careful how you search and replace the errant data. For example, the word null without surround quotation marks might appear legitimately in other fields like descriptions. You don’t want to accidentally replace such legitimate uses of the word with None or surround it with quotation marks as that will introduce new problems. So, you may need to come up with slightly more rigorous ways of finding and replacing your errant data, such as using regular expressions. Once you deal with that data, though, literal_eval works like a charm and saves you lots of time working with complex datasets!

Coding Google Apps Script

Like most people, I use a great many of the free applications provided by Google. Unlike a lot of people, I like to occasionally collect data on how I’m using some of these applications, especially Gmail.

Unfortunately, there do not appear to be easy ways to collect usage data from Gmail; however, Google does provide a scripting platform, Google Apps Script, that allows you to code against Google’s various APIs and get at most of the data you might be interested in.

Recently, I found myself interested in analyzing the categories of email coming into my inbox. How could I collect simple email properties like date, subject line, and category and write the results to a CSV for further analysis?

Step 1: Launch a new spreadsheet

Navigate to Google Docs in your browser and launch a new spreadsheet. This tutorial was excellent help.

Step 2: Launch the script editor

Following that tutorial, open up a new instance of the Script Editor under the Tools menu. This step is pretty important because it seems to “bind” your scripting work to the new spreadsheet. Initially, I simply started a new scripting project in Apps Script apart from any spreadsheet and found no way to get that work to write to any of my spreadsheets.

Step 3: Code away

In this endeavor, my goal was pretty simple: grab particular email metadata and write the data to a spreadsheet. This required me coding against both the GmailApp API and the Spreadsheet API. Here’s the code I ultimately came up with (I believe this is vanilla JavaScript):

function write_gmail_stats() {
  //get reference to active spreadsheet
  var sheet = SpreadsheetApp.getActiveSheet();
  // get collection of email categorized as "promotions" sent after 31 Mar 2020
  var promos = GmailApp.search('category:promotions after:2020/03/31');
  //iterate through the email collection and write date, subject, and category to spreadsheet
  for (var i = 0; i < promos.length; i++){
    var m = promos[i];
    var d = m.getLastMessageDate();
    var s = m.getFirstMessageSubject();
    sheet.appendRow([d, s, "promotions"]);
  }
  
  var social = GmailApp.search('category:social after:2020/03/31');
  //iterate through the email collection and write date, subject, and category to spreadsheet
  for (var i = 0; i < social.length; i++){
    var m = social[i];
    var d = m.getLastMessageDate();
    var s = m.getFirstMessageSubject();
    sheet.appendRow([d, s, "social"]);
  }
  
  var primary = GmailApp.search('category:primary after:2020/03/31');
  //iterate through the email collection and write date, subject, and category to spreadsheet
  for (var i = 0; i < primary.length; i++){
    var m = primary[i];
    var d = m.getLastMessageDate();
    var s = m.getFirstMessageSubject();
    sheet.appendRow([d, s, "primary"]);
  }
}

This code will take the date, subject, and category name of the three default categories in Gmail and write the results to my spreadsheet. To keep my results modest, I crafted a search query to only look at email from April (this support page was very helpful in figuring out the query I needed). The results worked out pretty well:

Sweet! This certainly beats screen scraping or manual data collecting.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑