Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: jupyter_notebook (Page 1 of 17)

Recreating a 3d bar chart

As I try to educate myself on the new hotness of large language models, I ran across this post and associated chart and decided I wanted to see if I could recreate that chart programmatically with matplotlib:

A chart on LLMs and the datasets on which they were trained

I got “mostly” there. Here’s what I did.

Step 1: assemble the data

Is this data–the models, what datasets they trained on, and how big those datasets were–already assembled somewhere on the internet? I just recreated it by hand from looking at the chart:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


data = [{'model': 'GPT-1', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Books', 'size_gb':5}, 
        {'model': 'GPT-1', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Books', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Reddit links', 'size_gb':40}, 
        {'model': 'GPT-2', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-3', 'dataset':'Wikipedia', 'size_gb':11}, 
        {'model': 'GPT-3', 'dataset':'Books', 'size_gb':21}, 
        {'model': 'GPT-3', 'dataset':'Academic journals', 'size_gb':101}, 
        {'model': 'GPT-3', 'dataset':'Reddit links', 'size_gb':50}, 
        {'model': 'GPT-3', 'dataset':'CC', 'size_gb':570}, 
        {'model': 'GPT-3', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Wikipedia', 'size_gb':6}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Books', 'size_gb':118}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Academic journals', 'size_gb':244}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Reddit links', 'size_gb':63}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'CC', 'size_gb':227}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Other', 'size_gb':167}, 
        {'model': 'Megatron-11B', 'dataset':'Wikipedia', 'size_gb':11}, 
        {'model': 'Megatron-11B', 'dataset':'Books', 'size_gb':5}, 
        {'model': 'Megatron-11B', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'Megatron-11B', 'dataset':'Reddit links', 'size_gb':38}, 
        {'model': 'Megatron-11B', 'dataset':'CC', 'size_gb':107}, 
        {'model': 'Megatron-11B', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'MT-NLG', 'dataset':'Wikipedia', 'size_gb':6}, 
        {'model': 'MT-NLG', 'dataset':'Books', 'size_gb':118}, 
        {'model': 'MT-NLG', 'dataset':'Academic journals', 'size_gb':77}, 
        {'model': 'MT-NLG', 'dataset':'Reddit links', 'size_gb':63}, 
        {'model': 'MT-NLG', 'dataset':'CC', 'size_gb':983}, 
        {'model': 'MT-NLG', 'dataset':'Other', 'size_gb':127}, 
        {'model': 'Gopher', 'dataset':'Wikipedia', 'size_gb':12}, 
        {'model': 'Gopher', 'dataset':'Books', 'size_gb':2100}, 
        {'model': 'Gopher', 'dataset':'Academic journals', 'size_gb':164}, 
        {'model': 'Gopher', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'Gopher', 'dataset':'CC', 'size_gb':3450}, 
        {'model': 'Gopher', 'dataset':'Other', 'size_gb':4823}, 
        {'model': 'GPT-4', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Books', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Other', 'size_gb':None}]

df = pd.DataFrame(data).fillna(0)
df.head()

Note that I converted all NaN values to 0: bad things happen when you try to chart a NaN.

Step 2: code my chart

I borrowed pretty heavily from matplotlib’s 3d bar chart examples. Some points to consider:

  • The datasets are listed along the X axis. In a 3d bar chart, 0 on the X axis is at the furthest left and the values increase as the axis intersects the Y axis. To replicate the chart, then, the “Other” dataset would fall at around point 0 while the “Wikipedia” dataset would fall at around point 5. However, I assembled the data in my dataframe starting with “Wikipedia” and continuing on up to “Other”. So, when it came time for me to chart the datasets, I had to reverse their order with the handy negative index slicing. I also had to reverse the colors associated with the datasets, too.
  • I’m always learning new things about Python with these types of exercises. How cool is the Pandas iat function? (Not to mention the endless supply of numpy functions I’ve never heard of before.)
  • In my dataframe, my X and Y values are categories–datasets and models, respectively. However, matplotlib does not like plotting categories, so, I had to use placeholder index/integer values and then later replace those tick labels with the real categories.
  • Setting the width and depth to 0.9 helps to creating padding between the bars.
  • I had to play with the rotation property of my tick labels to get them to sort of align properly to the chart. I did this by eye but I wonder if there’s a better way to align them exactly?

So, here’s my code:

fig, ax = plt.subplots(subplot_kw={'projection': '3d'}, figsize=(6, 6))

models = df.model.unique().tolist()
datasets = df.dataset.unique().tolist()[::-1]
top = [df.loc[(df.model==m) & (df.dataset==d), 'size_gb'].iat[0] for m in models for d in datasets]
bottom = np.zeros_like(top)

# bar3d doesn't seem to like categories for x and y values
_x, _y = np.meshgrid(np.arange(len(datasets)), np.arange(len(models)))
x, y = _x.ravel(), _y.ravel()
width = depth = 0.9  # allows a little bit of padding between bars

ds_colors = ['greenyellow','fuchsia','turquoise','orangered','dodgerblue','goldenrod'][::-1]
#colors = list(np.array([[c]*len(models) for c in ds_colors]).flat)
colors = ds_colors * len(models)

_ = ax.bar3d(x, y, bottom, width, depth, top, color=colors, shade=True)
_ = ax.set_yticklabels(['', ''] + models + [''], rotation=-20, ha='left')
_ = ax.set_xticklabels([''] + datasets + ['', ''], rotation=50)

# annotate the size numbers onto the bars
for x1, y1, z1 in zip(x, y, top):
    if z1 > 0:
        _ = ax.text(x1, y1, z1, str(int(z1)), horizontalalignment='left', verticalalignment='bottom')
My attempt at the LLM chart

Ok. There’s still a lot to be done here. The Z axis and Z gridlines should go. The size annotations could be better aligned. The scale probably needs to be recalculated so that the 4823 value isn’t hiding all the other values. All the 0 length bars should disappear altogether. And a legend might be nice.

Anyway, I think I’ve accomplished the crux of what I set out to do, so I’ll leave it there. Hope this helps with some of your 3d charting endeavors!

Matplotlib markers

I’ve used Matplotlib for years and yet I’m always discovering new features. Recently, I was working with some data with date ranges and thought maybe visualizing that data in something like a gantt chart might help me understand it better. When it came time to jazz up the chart with markers, I was really impressed at the options.

For illustration, suppose I have a dataset like this:

import pandas as pd
from datetime import timedelta
import matplotlib.pyplot as plt


data = [{'name': 'Larry', 'start_dt': '2023-01-05', 'end_dt': '2023-01-09'}, 
        {'name': 'Moe', 'start_dt': '2023-01-07', 'end_dt': '2023-01-12'},
        {'name': 'Curly', 'start_dt': '2023-01-02', 'end_dt': '2023-01-07'},
        {'name': 'Al', 'start_dt': '2023-01-12', 'end_dt': '2023-01-15'},
        {'name': 'Peggy', 'start_dt': '2023-01-04', 'end_dt': '2023-01-09'},
        {'name': 'Kelly', 'start_dt': '2023-01-08', 'end_dt': '2023-01-12'},
        {'name': 'Bud', 'start_dt': '2023-01-11', 'end_dt': '2023-01-14'}]

df = pd.DataFrame(data)
df['start_dt'] = pd.to_datetime(df.start_dt)
df['end_dt'] = pd.to_datetime(df.end_dt)

Is there an elegant way in pandas to expand the dataset to add a record for each day a person worked? I couldn’t think of any, so I just looped over the dataframe and did it the hard way:

fig, ax = plt.subplots(figsize=(10, 6))
new_ylabels = ['']

for i, r in df.sort_values('name').reset_index(drop=True).iterrows():
    work_dates = pd.date_range(start=r['start_dt'], end=r['end_dt'], freq='1D')
    df_temp = pd.DataFrame([{'name': r['name'], 'ypos': i+1}], index=work_dates)
    _ = df_temp.plot(marker='d', markevery=[0,-1], markersize=5.0, ax=ax)
    new_ylabels.append(r['name'])
    
_ = ax.get_legend().remove()
_ = ax.set(yticklabels=new_ylabels)
_ = ax.set_title('Lame gantt-type chart')

This produced a pretty nifty gantt-type chart with the timelines from my dataset:

The idea I want to highlight with this post is the work I did in the code above on line 7. I used three marker properties to craft the chart I was after:

  • marker
  • markersize
  • markevery

Marker

You set the type of marker you want with the “marker” property–and there are tons of choices. I chose a lowercase “d” to get a “thin diamond”.

Markersize

The “markersize” property does what it says: sets the size of the marker. In my experience, I’ve just had to set a value, render the chart, and then adjust-and-repeat to get the size I want.

Markevery

I was actually pretty familiar with the “marker” and “markersize” properties–having used them extensively in the past–but I was pretty excited to learn about the “markevery” property. By default, a marker will appear at every datapoint in your chart. However, gantt charts normally only mark the beginning and end of a time range and not every point in between. With the “markevery” property, all I needed to do was pass it a list of [0, -1] to tell it to mark only the first and last points in each time range.

These properties really helped render the chart I wanted. It’s always great to learn more about the versatility of Matplotlib!

Finding sub-ranges in my dataset

File this under: there-has-to-be-a-simpler-way-to-do-this-in-pandas-but-I-haven’t-found-what-that-is

Recently, I’ve been playing with some financial data to get a better understanding of the yield curve. Related to yield and inverted yield curves are the periods of recession in the US economy. In my work, I wanted to first build a chart that indicated the periods of recession and ultimately overlay that with yield curve data. Little did I realize the challenge of just coding that first part.

I downloaded a dataset of recession data, which contains a record for every calendar quarter from the 1960s to present day and a 0 or 1 to indicate whether the economy was in recession for that quarter–“1” indicating that it was. What I need to do was pull all the records with a “1” indicator and find the start and end times for each of those ranges so that I could paint them onto a chart.

I’ve heard it said before that any time you have to write a loop over your pandas dataframe, you’re probably doing it wrong. I’m certainly doing a loop here and I have a nagging suspicion there’s probably a more elegant way to achieve the solution. Nevertheless, here’s what I came up with to solve my recession chart problem:

Step 1: Bring in the necessary packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline  # for easy chart display in jupyter notebook

Step 2: Load in my downloaded recession dataset and take a peek

# recession dates: https://fred.stlouisfed.org/series/JHDUSRGDPBR
df_recessions = pd.read_csv('./data/JHDUSRGDPBR_20220327.csv')

df_recessions['DATE'] = pd.to_datetime(df_recessions.DATE)
df_recessions.head()
The first records of the Recession dataset
df_recessions[df_recessions.JHDUSRGDPBR==1.0].head()
The first records in the dataset where the economy was in recession

Step 3: Mark the start of every period of recession in the dataset

So, now I’m asking myself, “how do I extract the start and stop dates for every period of recession identified in the dataset? Let’s start with first just finding the start dates of recessions.” That shouldn’t be too difficult. If I can filter in just the recession quarters and calculate the date differences from one row to the next, if the difference is greater than three months (I estimated 93 days as three months), then I know there was a gap in quarters prior to the current record indicating that current record is the start of a new recession. Here’s what I came up with [one further note: my yield curve data only starts in 1990, so I filtered the recession data for 1990 to present]:

df_spans = df_recessions[(df_recessions.DATE.dt.year>=1990) & (df_recessions.JHDUSRGDPBR==1.0)].copy()
df_spans['days_elapsed'] = df_spans.DATE - df_spans.shift(1).DATE
df_spans['ind'] = df_spans.days_elapsed.dt.days.apply(lambda d: 's' if d > 93 else '')
df_spans.iloc[0, 3] = 's'  # mark first row as a recession start
df_spans
“s” indicates the start of a new recession

Step 4: Find the end date of each recession

Here’s where my approach starts to go off the rails a little. The only way I could think to find the end dates of each recession is to:

  1. Loop through a list of the start dates
  2. In each loop, get the next start date and then grab the date of the record immediately before that one
  3. When I hit the last loop, just consider the last record to be the end date of the most recent recession
  4. With every stop date, add three months since the stop date is only the first day of the quarter and, presumably, the recession more or less lasts the entire quarter

Confusing? Here’s my code:

start_stop_dates = []
start_dates = df_spans.loc[df_spans.ind=='s', ].DATE.tolist()

for i, start_date in enumerate(start_dates):
    if i < len(start_dates)-1:
        stop_date = df_spans.loc[df_spans.DATE < start_dates[i+1]].iloc[-1].DATE
    else:
        stop_date = df_spans.iloc[-1].DATE
        
    # add 3 months to the end of each stop date to stretch the value to the full quarter
    start_stop_dates.append((start_date, stop_date + np.timedelta64(3,'M')))
    
start_stop_dates
Recessions from 1990 to the beginning of 2022

Step 5: Build my chart

With that start/stop list, I can build my underlying recession chart:

fig, ax = plt.subplots(figsize=(12,6))

_ = ax.plot()
_ = ax.set_xlim([date(1990, 1, 1), date(2022, 4, 1)])
_ = ax.set_ylim([0, 10])

for st, sp in start_stop_dates:
    _ = ax.axvspan(st, sp, alpha=0.2, color='gray')
US Recessions: 1990 – 2021

Phew. All that work and I’m only at the starting point of my yield curve exploration, but that will have to wait for a future post. However, if you can think of a more elegant way to identify these date ranges without having to resort to looping, I’d love to hear it!

« Older posts

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑