Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: jupyter_notebook (Page 7 of 17)

Grouping moving averages with Pandas

A friend of mine posed a challenge to me recently: how do you calculate a moving average on a field, by group, and add the calculation as a new column back to the original dataframe?

Moving averages calculate an average of a value over a range of time as that “window” shifts over time. They’re often used to smooth out fluctuations in real data.

For example, let’s take a look at the COVID-19 data I used in my last post. Recall what my Ohio dataframe (df_ohio) looked like:

df_ohio.head()

Before I can even think about calculating moving averages on this data, I need to first tidy it up a bit, but pandas makes that pretty easy:

date_cols = df_ohio.columns.tolist()
rename_cols = {'variable': 'obs_date', 'value': 'confirmed_cases'}

df_ohio_tidy = pd.melt(df_ohio.reset_index(), id_vars=['county'], value_vars=date_cols).rename(columns=rename_cols)
df_ohio_tidy['obs_date'] = pd.to_datetime(df_ohio_tidy.obs_date)

df_ohio_tidy = df_ohio_tidy.set_index('obs_date')
df_ohio_tidy

Now, I’m ready to calculate moving averages. The pandas rolling function is generally used for that purpose. It’s quite a powerful and versatile function, so be sure to check out the documentation. Normally, I just draw the moving average values in a chart along side the actual observations:

fig, ax = plt.subplots(figsize=(8,8))
rename_col = {'confirmed_cases': '7 day moving avg'}
title = 'Confirmed COVID-19 cases in Cuyahoga, Ohio as of {0:%d %b %Y}'.format(df_ohio_tidy.index.max())

_ = df_ohio_tidy[df_ohio_tidy.county=='Cuyahoga, Ohio'][['confirmed_cases']].plot(ax=ax, title=title)
_ = df_ohio_tidy[df_ohio_tidy.county=='Cuyahoga, Ohio'][['confirmed_cases']].rolling(7).mean().\
    rename(columns=rename_col).plot(ax=ax, color='gray')

_ = ax.set_ylabel('Confirmed Cases')

But in this case, I need to calculate moving averages for each county in Ohio and add those calculations to the dataframe as a new column. For this, I use a combination of the rolling function and the equally powerful transform function. With help from this post, pandas has no issue doing that (in one line, no less):

df_ohio_tidy['7ma'] = df_ohio_tidy.groupby('county').confirmed_cases.transform(lambda c: c.rolling(7).mean())

Now, let’s do some spot checking to make sure the results are as expected:

df_ohio_tidy.sort_values(['county', 'obs_date']).iloc[1170:1190,:]

Above, we can see that the 7 day moving average for Crawford County stops at the last entry for Crawford County on March 30 and resets to start calculating for Cuyahoga County.

df_ohio_tidy.sort_values(['county', 'obs_date']).iloc[1235:1250,:]

Above, we spot check the change from Cuyahoga County to Darke County. Again, the calculation for Cuyahoga County stops with the last entry on March 30 and starts over calculating on Darke County.

So, yes, he can both calculate and group the moving average, Mr. Waturi! All the code behind my posts on the COVID-19 data can be found here.

Making Music with Pandas

This year I’ve started taking guitar lessons. While I’m anxious to jump into learning a bunch of songs, my instructor is keen on me developing foundational knowledge in music theory, scales, modes, and so forth–which I’m perfectly fine with, as well.

So far, we’ve covered several ways to play major scales, the pentatonic minor scale, and the natural minor scale. We also talked about scale “relatives:” how every major scale has a minor scale and every minor scale is a subset of a major scale, the two being relatives of each other.

My instructor then gave me this assignment: play any major scale from the low E string to the high E string, transition into the scale’s relative minor by dropping down three frets, and finish playing out the relative minor scale.

As I’ve been practicing this task, though, I often find myself off by a fret. I have to ask myself, “self, what major scale did you start in? C major? So why are you playing the G# natural minor scale?”

What would really help my practice is to have a handy cheatsheet to show me all the notes in each major scale and highlight the relative minor scale of each major. I could write it all out by hand, but why do that when I have Python and Pandas at my disposal! Here’s what I came up with:

Import my packages

I really only need pandas for this work:

import pandas as pd

Generate the twelve major scales

Here’s the code I came up with to calculate all the notes in each scale. Each scale consists of 15 notes spanning three octaves:

# make up my list of notes
chromatic_scale_ascending = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
# since I usually start on the low E string, rearrange the notes starting on E
scale_from_e = (chromatic_scale_ascending + chromatic_scale_ascending)[4:16]

# the scale pattern:
# root, whole step, whole step, half step, whole step, whole step, whole step, half step
key_steps = [2, 2, 1, 2, 2, 2]  # on the guitar, a whole step is two frets
major_keys = []
for root in scale_from_e:
    three_octaves = scale_from_e * 3
    steps_from_root = three_octaves.index(root)
    major_scale = [root]
    # construct the unique notes in the scale
    for step in key_steps:
        steps_from_root += step
        major_scale.append(three_octaves[steps_from_root])
        
    # span the scale across 3 octaves
    major_keys.append(major_scale * 2 + [root])

Drop the scales into Pandas for the looks

Writing my list of lists to a pandas dataframe and then writing that dataframe out in a jupyter notebook makes everything look nice. More importantly, I can use the style function in pandas to highlight the relative minor scales of each major scale:

df_major_keys = pd.DataFrame(major_keys)

# use this function to highlight the relative minor scales in orange
def highlight_natural_minor(data):
    df = data.copy()
    df.iloc[:,5:13] = 'background-color: orange'
    return df

df_major_keys.style.apply(highlight_natural_minor, axis=None)

…and here’s my handy major/minor scale cheatsheet:

My major-relative-minor-scale cheatsheet

Column 0 is the tonic/root of the major scale while columns 5 through 12 represent the relative minor scale of that major. So we can see that that the E major scale contains the C# minor scale. For example, Ozzy’s Crazy Train apparently moves between A major and F# minor scales which sound just great together–assuming you ignore Ozzy’s personal eccentricities.

So here’s a cool way to merge my interests in music and Python and Pandas into one large mash of goodness.

divmod, for the win!

I had a situation recently where I had a list of values laid out in a grid like so:

I had to figure out the row and column positions for each value.

So, let’s start with a list of numbers:

some_list = [i for i in range(15)]

First, how can I easily figure out what row each number belongs to? If you said “mod,” you’d be right! You take the mod of the number divided by the size of the group: in this case, 5:

group_size = 5
for n in some_list:
    print('Number {0} belongs to row {1}'.format(n, n % group_size))
Number 0 belongs to row 0
Number 1 belongs to row 1
Number 2 belongs to row 2
Number 3 belongs to row 3
Number 4 belongs to row 4
Number 5 belongs to row 0
Number 6 belongs to row 1
Number 7 belongs to row 2
Number 8 belongs to row 3
Number 9 belongs to row 4
Number 10 belongs to row 0
Number 11 belongs to row 1
Number 12 belongs to row 2
Number 13 belongs to row 3
Number 14 belongs to row 4

Now, how do I figure out what column each value belongs to? For that, I need to divide each number by the group size and take the int portion of the value. An easier way to do that is to use Python floor division:

group_size = 5
for n in some_list:
    print('Number {0} belongs to column {1}'.format(n, n // group_size))
Number 0 belongs to column 0
Number 1 belongs to column 0
Number 2 belongs to column 0
Number 3 belongs to column 0
Number 4 belongs to column 0
Number 5 belongs to column 1
Number 6 belongs to column 1
Number 7 belongs to column 1
Number 8 belongs to column 1
Number 9 belongs to column 1
Number 10 belongs to column 2
Number 11 belongs to column 2
Number 12 belongs to column 2
Number 13 belongs to column 2
Number 14 belongs to column 2

But I really need both the row and column values together. Sure, I could write my mod operation on one line and my floor division operation on another, but Python has a cool function to do both at the same time, divmod:

group_size = 5
for n in some_list:
    col, row = divmod(n, group_size)
    print('Number {0} belongs at row {1}, column {2}'.format(n, row, col))
Number 0 belongs at row 0, column 0
Number 1 belongs at row 1, column 0
Number 2 belongs at row 2, column 0
Number 3 belongs at row 3, column 0
Number 4 belongs at row 4, column 0
Number 5 belongs at row 0, column 1
Number 6 belongs at row 1, column 1
Number 7 belongs at row 2, column 1
Number 8 belongs at row 3, column 1
Number 9 belongs at row 4, column 1
Number 10 belongs at row 0, column 2
Number 11 belongs at row 1, column 2
Number 12 belongs at row 2, column 2
Number 13 belongs at row 3, column 2
Number 14 belongs at row 4, column 2

But now let’s get more real and use this feature to write out one of the greatest catalogs of all time: the albums of “Weird Al” Yankovic:

import matplotlib.style as style
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
style.use('seaborn-poster')

group_size = 5
albums = ['"Weird Al" Yankovic (1983)', 
          '"Weird Al" Yankovic in 3-D (1984)',
          'Dare to Be Stupid (1985)',
          'Polka Party! (1986)',
          'Even Worse (1988)',
          'Peter and the Wolf (1988)',
          'UHF - Original Motion Picture\nSoundtrack and Other Stuff (1989)',
          'Off the Deep End (1992)',
          'Alapalooza (1993)',
          'Bad Hair Day (1996)',
          'Running with Scissors (1999)',
          'Poodle Hat (2003)',
          'Straight Outta Lynwood (2006)',
          'Alpocalypse (2011)',
          'Mandatory Fun (2014)']

# set up my grid chart
fig, ax = plt.subplots()
ax.set_xticks(np.arange(0, (len(albums)/group_size) + 1))
ax.set_yticks(np.arange(0, group_size + 1))
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_title('The Catalog of "Weird Al" Yankovic')
plt.grid()

# now, enumerate through the album list and use divmod to get row and column values to write out the album names
for i, album in enumerate(albums):
    col, row = divmod(i, group_size)
    ax.annotate(album, xy=(col+.1, row+.4), xytext=(col+.1, row+.4))

plt.show()

So, divmod: love it, use it! Check out my full code here.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑