Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Category: technology (Page 5 of 36)

Building clock-style radar charts

When you’re dealing with event data, one neat visualization option is to depict your data over a twenty-four hour period on a radar chart built to look like the face of a clock. Matplotlib’s polar chart capabilities makes this relatively simple.

As an example, I’ll chart crime incident data from the city of Cincinnati.

Step 1: Bring in the data

For starters, set up your standard package import statements and read in the dataset:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta, date


# data from : https://data.cincinnati-oh.gov/Safety/PDI-Police-Data-Initiative-Crime-Incidents/k59e-2pvf
df_crime = pd.read_csv('./data/PDI__Police_Data_Initiative__Crime_Incidents.csv')
df_crime['DATE_REPORTED'] = pd.to_datetime(df_crime.DATE_REPORTED)

Let’s try to find the day with the highest number of incidents in 2019:

df_crime[df_crime.DATE_REPORTED.dt.year==2019][['DATE_REPORTED','INSTANCEID']].groupby(df_crime.DATE_REPORTED.dt.date).\
    count().sort_values('INSTANCEID', ascending=False)

It looks like the most incidents took place on January 15. Now, when did those incidents occur over the course of the day?

df_crime[df_crime.DATE_REPORTED.dt.date==date(2019,1,15)][['DATE_REPORTED']].groupby(df_crime.DATE_REPORTED.dt.hour).count()

Grouping by hour, we can see that the vast majority of incidents occurred during the 9am hour. Now, let’s visualize that.

Step 2: Create a handy chart function

To make my work a little more portable, I created a “render_chart” function that takes as parameters the dataframe of hour data, the axis in which to place the chart, and the title:

def render_chart(df, axis, title):
    theta = np.arange(df.shape[0])/float(df.shape[0]) * 2 * np.pi
    _ = axis.bar(theta + theta[1]/2, df.event_count, width=theta[1], color='red')
    ticklabels = [(timedelta(hours=h) + datetime(2021,1,1)).strftime('%#I%p').lower() for h in range(0,24)]
    _ = axis.set_xticks(theta)
    _ = axis.set_xticklabels(ticklabels)
    _ = axis.set_yticklabels([])
    _ = axis.set_title(title)

    axis.set_theta_direction(-1)
    axis.set_theta_zero_location('N')

Some things to note with my function:

  1. The function expects the dataframe to contain a column called “event_count” that is a count of events for each hour over the day
  2. Finding the right time format string so that I could display 1am instead of 01am was actually a bit difficult. That hash mark (#) did the trick.
  3. Matplotlib polar charts, by default, render counter-clockwise. Setting the set_theta_direction to -1 let’s you reverse that behavior. Setting the set_theta_zero_location to North (N) allows you to start rendering the chart like a clock, at the top.

Step 3: Pad your data

For the January 15 data, there are several hours of the day with no reported incidents (eg. from 4am to 6am). In order to get the chart to render correctly, I need to pad those empty periods with 0. I solved that problem by creating an “empty hours” dataframe–a dataframe of 24 hours with 0 event counts–and then merged the real data with the empty one:

# get my real data
df_chart = df_crime[df_crime.DATE_REPORTED.dt.date==date(2019,1,15)][['DATE_REPORTED']].\
    groupby(df_crime.DATE_REPORTED.dt.hour).count().rename(columns={'DATE_REPORTED':'event_count'})
# create an "empty hour" dataframe
df_empty_hrs = pd.DataFrame(np.zeros(24), index=range(0, 24))
# merge the two together
df_chart = df_empty_hrs.join(df_chart, how='left').fillna(0).drop(columns=[0])

Step 4: Finally, render the chart

Now, we can produce the chart:

fig, ax = plt.subplots(figsize=(12, 7), subplot_kw={'projection': 'polar'})

render_chart(df_chart, ax, 'Cincinnati Crime Incidents: 15 Jan 2019')

It is interesting that an overwhelming majority of incidents on this day occurred at the 9am hour.

It would be further interesting to see what an average day in 2019 looked like: maybe weekday versus weekend. or Maybe average Monday through Sunday. Pandas makes it pretty simple to do these calculations and, with my function, you can easily visualize the results!

Functions in Pandas groupby

I’ve written about the pandas groupby function a few times in the past; it’s a valuable command that I use frequently. You typically want to pipe your “group by” operations to a calculation function like count, sum, mean, etc. This blog post has a great write-up on groupby and the calculations you can do with it.

Most examples of groupby depict grouping your dataframes by referencing the literal names of your various columns. For example, working with this movie dataset, suppose I wanted to know how many movies are in the data per year. Typically, I’d code something like the following:

import pandas as pd


df = pd.read_csv('./data/regex_imdb.csv').fillna(0)
df[['Year', 'Name']].groupby('Year').count()

Getting fancier, suppose I wanted to group by both year and genre. I could do this (note that in this dataset, a multi-genre movie has the multiple genres comma-delimited):

df[['Year', 'Genre', 'Name']].groupby(['Year', 'Genre']).count()

But what if I wanted to do something slightly trickier, like grouping by year and whether or not a film was a comedy? You could add a new boolean column and use that in your grouping:

df['is_comedy'] = df.Genre.str.contains('Comedy')
df[['Year', 'is_comedy', 'Name']].groupby(['Year', 'is_comedy']).count()

However, instead of taking the extra step of adding a new column to your dataframe, you could do that work inline with the pandas map function, especially if you don’t think you’ll use that new column elsewhere:

df[['Year', 'Genre']].groupby(['Year', df.Genre.map(lambda g: 'Comedy' if 'Comedy' in g else 'Some other genre')]).count()

I have especially found this approach helpful grouping with timestamps. Suppose you want to group your dataframe by date and hour. That now becomes pretty simple:

from datetime import datetime


d = {'dt':[datetime(2021,10,30,3,0,0),datetime(2021,10,30,3,0,0),datetime(2021,10,30,3,0,0),datetime(2021,10,30,4,0,0),
           datetime(2021,10,30,5,0,0),datetime(2021,10,30,5,0,0),datetime(2021,10,31,3,0,0),datetime(2021,10,31,3,0,0)],
     'desc':['some event','some other event','big event','small event','medium sized event',
             'nothing to see here','event A','event B']}

df_events = pd.DataFrame(d)
df_events.groupby([df_events.dt.map(lambda d: datetime.date(d)), df_events.dt.map(lambda d: d.hour)]).count()

One important note: initially, I assumed my datetime values had a date property that I could use as I use their hour properties:

df_events.groupby([df_events.dt.map(lambda d: d.date)]).count()

Unfortunately, that command will throw a strange error. Instead, you’ll have to cast your values to a date using the datetime package.

The pandas “transform” function

I had a challenge where I need to group a large dataset by a particular feature and then calculate a variety of statistics on those feature groups including a standard score for each record. My inclination was to loop through each group and run these calculations in each iteration–sorta outside my main dataframe–but then I thought, could there be an easier way in pandas to do this work?

Yes there is: transform.

As an example, take the movies dataset that I’ve used in the past:

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


# data from: https://www.kaggle.com/mysarahmadbhat/imdb-top-1000-movies
df = pd.read_csv('./data/regex_imdb.csv').fillna(0)
# filter out movies with no reported Gross
df = df[df.Gross != 0.0]

Suppose you wanted to know how each movie grossed against the average gross for their respective release years. To find this out, my inclination would be to loop through each year, calculate the mean for each year, then merge that value back into my main dataframe so that I could find the mean difference for each movie:

yr_means = []
for yr in df.Year.unique().tolist():
    yr_means.append({'Year': yr, 'year_mean': df[df.Year==yr].Gross.mean()})
    
# put my year mean calculations into a new dataframe
df_year_means = pd.DataFrame(yr_means)

df = df.merge(df_year_means, on='Year')

# and now I can calculate my difference from the mean
df['diff_from_year_mean'] = df.Gross - df.year_mean

But with the transform function, I can do all this work in a single line:

df['year_mean'] = df.groupby('Year').Gross.transform(np.mean)

# and now I can calculate my difference from the mean
df['diff_from_year_mean'] = df.Gross - df.year_mean

And from there you can do interesting work like diverging line charts:

fig, ax = plt.subplots(figsize=(8, 8))

year_to_plot = 2010
plot_data = df[df.Year==year_to_plot][['Name', 'diff_from_year_mean']].sort_values('diff_from_year_mean')
plot_data['color'] = plot_data.diff_from_year_mean.apply(lambda d: 'red' if d < 0 else 'green')

_ = ax.hlines(data=plot_data, y='Name', xmin=0, xmax=plot_data.diff_from_year_mean, color=plot_data.color)
_ = ax.table(cellText=[['${0:.2f} Million'.format(df[df.Year==year_to_plot].year_mean.values[0])]], 
             colLabels=['Avg Gross'], colWidths=[0.25], loc='center right')
_ = ax.set_xlabel('Gross earnings from the average (millions of dollars)')
_ = ax.set_title('Movie gross earnings from the average: top rated movies of {0}'.format(year_to_plot))

The transform function takes a variety of functions, both in the conventional function signature and, sometimes, as a string alias. For example, you can use:

  • np.min or ‘min’ to get the minimum value of the distribution
  • np.max or ‘max’ to get the maximum value of the distribution
  • np.std or ‘std’ to get the standard deviation of the distribution
  • len or ‘count’ to get a record count of your distribution
  • np.var or ‘var’ to get the variance of the distribution

You can even throw other functions/aliases at it like ‘first’ to get the first value of your distribution. However, you may need to do some sorting first or you may not get the values you were expecting.

Transform is yet one more way to do cool, pandas one-liner operations on your dataframes. Give it a whirl!

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑