Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: python (Page 1 of 26)

Interesting way to use Pandas to_datetime

I do a lot of work with timestamps in Pandas dataframes and tend to use the to_datetime function quite a lot to cast my timestamps–usually read as strings–into proper datetime objects. Here’s a simple scenario where I cast my string timestamp column into a datetime:

import pandas as pd

d = [{'ts': '2023-01-01 12:05:00'}, {'ts': '2023-01-09 13:23:00'}, {'ts': '2023-01-11 08:37:00'}, {'ts': '2023-01-13 15:45:00'}]
df = pd.DataFrame(d)
pd.to_datetime(df.ts)

In the above, I explicitly pass the column “ts” to the function. However, I recently discovered another way to use to_datetime where you don’t have to be so explicit:

d1 = [{'year':2023,'month':1, 'day': 1},{'year':2023,'month':2, 'day': 9},{'year':2023,'month':3, 'day': 11},{'year':2023,'month':4, 'day': 13}]
df1 = pd.DataFrame(d1)
pd.to_datetime(df1)

Passing a dataframe to the function with columns named as year, month, and day seems to be enough to get the function to do its thing. That’s pretty cool!

The Assign function: at long last

I knew there had to be a way to add new columns inline (in a chain of commands) to a Pandas dataframe. The assign function is a way to do that.

Suppose I only have year and month columns in my dataframe. I can use assign to add a day column and perform my datetime conversion:

d2 = [{'year':2023,'month':1},{'year':2023,'month':2},{'year':2023,'month':3},{'year':2023,'month':4}]
df2 = pd.DataFrame(d2)
pd.to_datetime(df2.assign(day=1))

to_period: another useful datetime related function

Recently I was working with a dataset where the events only had year/month values. The to_period function is there to help with such situations:

d3 = [{'year':2023,'month':1},{'year':2023,'month':2},{'year':2023,'month':3},{'year':2023,'month':4}]
df3 = pd.DataFrame(d3)
pd.to_datetime(df3.assign(day=1)).dt.to_period('M')

How intuitive is to_datetime?

Just how intuitive is this less explicit way of using to_datetime? Can it read and cast month names? The answer is: it depends. Pandas version 1.3.5 doesn’t like month names:

d4 = [{'year':2023,'month':'January'},{'year':2023,'month':'February'},{'year':2023,'month':'March'},{'year':2023,'month':'April'}]
df4 = pd.DataFrame(d4)
pd.to_datetime(df4.assign(day=1)).dt.to_period('M')

However, I’ve found that earlier versions of Pandas will successfully parse month names. To resolve these value errors, you’ll have to add a line of code to convert your month names to their associated numeric values:

from datetime import datetime

d4 = [{'year':2023,'month':'January'},{'year':2023,'month':'February'},{'year':2023,'month':'March'},{'year':2023,'month':'April'}]
df4 = pd.DataFrame(d4)
df4['month'] = df4.month.apply(lambda m: datetime.strptime(m, '%B').month)
pd.to_datetime(df4.assign(day=1)).dt.to_period('M')

So, here are yet more ways to leverage Pandas with your timestamps!

Recreating a 3d bar chart

As I try to educate myself on the new hotness of large language models, I ran across this post and associated chart and decided I wanted to see if I could recreate that chart programmatically with matplotlib:

A chart on LLMs and the datasets on which they were trained

I got “mostly” there. Here’s what I did.

Step 1: assemble the data

Is this data–the models, what datasets they trained on, and how big those datasets were–already assembled somewhere on the internet? I just recreated it by hand from looking at the chart:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


data = [{'model': 'GPT-1', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Books', 'size_gb':5}, 
        {'model': 'GPT-1', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Books', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Reddit links', 'size_gb':40}, 
        {'model': 'GPT-2', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-3', 'dataset':'Wikipedia', 'size_gb':11}, 
        {'model': 'GPT-3', 'dataset':'Books', 'size_gb':21}, 
        {'model': 'GPT-3', 'dataset':'Academic journals', 'size_gb':101}, 
        {'model': 'GPT-3', 'dataset':'Reddit links', 'size_gb':50}, 
        {'model': 'GPT-3', 'dataset':'CC', 'size_gb':570}, 
        {'model': 'GPT-3', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Wikipedia', 'size_gb':6}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Books', 'size_gb':118}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Academic journals', 'size_gb':244}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Reddit links', 'size_gb':63}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'CC', 'size_gb':227}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Other', 'size_gb':167}, 
        {'model': 'Megatron-11B', 'dataset':'Wikipedia', 'size_gb':11}, 
        {'model': 'Megatron-11B', 'dataset':'Books', 'size_gb':5}, 
        {'model': 'Megatron-11B', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'Megatron-11B', 'dataset':'Reddit links', 'size_gb':38}, 
        {'model': 'Megatron-11B', 'dataset':'CC', 'size_gb':107}, 
        {'model': 'Megatron-11B', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'MT-NLG', 'dataset':'Wikipedia', 'size_gb':6}, 
        {'model': 'MT-NLG', 'dataset':'Books', 'size_gb':118}, 
        {'model': 'MT-NLG', 'dataset':'Academic journals', 'size_gb':77}, 
        {'model': 'MT-NLG', 'dataset':'Reddit links', 'size_gb':63}, 
        {'model': 'MT-NLG', 'dataset':'CC', 'size_gb':983}, 
        {'model': 'MT-NLG', 'dataset':'Other', 'size_gb':127}, 
        {'model': 'Gopher', 'dataset':'Wikipedia', 'size_gb':12}, 
        {'model': 'Gopher', 'dataset':'Books', 'size_gb':2100}, 
        {'model': 'Gopher', 'dataset':'Academic journals', 'size_gb':164}, 
        {'model': 'Gopher', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'Gopher', 'dataset':'CC', 'size_gb':3450}, 
        {'model': 'Gopher', 'dataset':'Other', 'size_gb':4823}, 
        {'model': 'GPT-4', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Books', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Other', 'size_gb':None}]

df = pd.DataFrame(data).fillna(0)
df.head()

Note that I converted all NaN values to 0: bad things happen when you try to chart a NaN.

Step 2: code my chart

I borrowed pretty heavily from matplotlib’s 3d bar chart examples. Some points to consider:

  • The datasets are listed along the X axis. In a 3d bar chart, 0 on the X axis is at the furthest left and the values increase as the axis intersects the Y axis. To replicate the chart, then, the “Other” dataset would fall at around point 0 while the “Wikipedia” dataset would fall at around point 5. However, I assembled the data in my dataframe starting with “Wikipedia” and continuing on up to “Other”. So, when it came time for me to chart the datasets, I had to reverse their order with the handy negative index slicing. I also had to reverse the colors associated with the datasets, too.
  • I’m always learning new things about Python with these types of exercises. How cool is the Pandas iat function? (Not to mention the endless supply of numpy functions I’ve never heard of before.)
  • In my dataframe, my X and Y values are categories–datasets and models, respectively. However, matplotlib does not like plotting categories, so, I had to use placeholder index/integer values and then later replace those tick labels with the real categories.
  • Setting the width and depth to 0.9 helps to creating padding between the bars.
  • I had to play with the rotation property of my tick labels to get them to sort of align properly to the chart. I did this by eye but I wonder if there’s a better way to align them exactly?

So, here’s my code:

fig, ax = plt.subplots(subplot_kw={'projection': '3d'}, figsize=(6, 6))

models = df.model.unique().tolist()
datasets = df.dataset.unique().tolist()[::-1]
top = [df.loc[(df.model==m) & (df.dataset==d), 'size_gb'].iat[0] for m in models for d in datasets]
bottom = np.zeros_like(top)

# bar3d doesn't seem to like categories for x and y values
_x, _y = np.meshgrid(np.arange(len(datasets)), np.arange(len(models)))
x, y = _x.ravel(), _y.ravel()
width = depth = 0.9  # allows a little bit of padding between bars

ds_colors = ['greenyellow','fuchsia','turquoise','orangered','dodgerblue','goldenrod'][::-1]
#colors = list(np.array([[c]*len(models) for c in ds_colors]).flat)
colors = ds_colors * len(models)

_ = ax.bar3d(x, y, bottom, width, depth, top, color=colors, shade=True)
_ = ax.set_yticklabels(['', ''] + models + [''], rotation=-20, ha='left')
_ = ax.set_xticklabels([''] + datasets + ['', ''], rotation=50)

# annotate the size numbers onto the bars
for x1, y1, z1 in zip(x, y, top):
    if z1 > 0:
        _ = ax.text(x1, y1, z1, str(int(z1)), horizontalalignment='left', verticalalignment='bottom')
My attempt at the LLM chart

Ok. There’s still a lot to be done here. The Z axis and Z gridlines should go. The size annotations could be better aligned. The scale probably needs to be recalculated so that the 4823 value isn’t hiding all the other values. All the 0 length bars should disappear altogether. And a legend might be nice.

Anyway, I think I’ve accomplished the crux of what I set out to do, so I’ll leave it there. Hope this helps with some of your 3d charting endeavors!

Matplotlib markers

I’ve used Matplotlib for years and yet I’m always discovering new features. Recently, I was working with some data with date ranges and thought maybe visualizing that data in something like a gantt chart might help me understand it better. When it came time to jazz up the chart with markers, I was really impressed at the options.

For illustration, suppose I have a dataset like this:

import pandas as pd
from datetime import timedelta
import matplotlib.pyplot as plt


data = [{'name': 'Larry', 'start_dt': '2023-01-05', 'end_dt': '2023-01-09'}, 
        {'name': 'Moe', 'start_dt': '2023-01-07', 'end_dt': '2023-01-12'},
        {'name': 'Curly', 'start_dt': '2023-01-02', 'end_dt': '2023-01-07'},
        {'name': 'Al', 'start_dt': '2023-01-12', 'end_dt': '2023-01-15'},
        {'name': 'Peggy', 'start_dt': '2023-01-04', 'end_dt': '2023-01-09'},
        {'name': 'Kelly', 'start_dt': '2023-01-08', 'end_dt': '2023-01-12'},
        {'name': 'Bud', 'start_dt': '2023-01-11', 'end_dt': '2023-01-14'}]

df = pd.DataFrame(data)
df['start_dt'] = pd.to_datetime(df.start_dt)
df['end_dt'] = pd.to_datetime(df.end_dt)

Is there an elegant way in pandas to expand the dataset to add a record for each day a person worked? I couldn’t think of any, so I just looped over the dataframe and did it the hard way:

fig, ax = plt.subplots(figsize=(10, 6))
new_ylabels = ['']

for i, r in df.sort_values('name').reset_index(drop=True).iterrows():
    work_dates = pd.date_range(start=r['start_dt'], end=r['end_dt'], freq='1D')
    df_temp = pd.DataFrame([{'name': r['name'], 'ypos': i+1}], index=work_dates)
    _ = df_temp.plot(marker='d', markevery=[0,-1], markersize=5.0, ax=ax)
    new_ylabels.append(r['name'])
    
_ = ax.get_legend().remove()
_ = ax.set(yticklabels=new_ylabels)
_ = ax.set_title('Lame gantt-type chart')

This produced a pretty nifty gantt-type chart with the timelines from my dataset:

The idea I want to highlight with this post is the work I did in the code above on line 7. I used three marker properties to craft the chart I was after:

  • marker
  • markersize
  • markevery

Marker

You set the type of marker you want with the “marker” property–and there are tons of choices. I chose a lowercase “d” to get a “thin diamond”.

Markersize

The “markersize” property does what it says: sets the size of the marker. In my experience, I’ve just had to set a value, render the chart, and then adjust-and-repeat to get the size I want.

Markevery

I was actually pretty familiar with the “marker” and “markersize” properties–having used them extensively in the past–but I was pretty excited to learn about the “markevery” property. By default, a marker will appear at every datapoint in your chart. However, gantt charts normally only mark the beginning and end of a time range and not every point in between. With the “markevery” property, all I needed to do was pass it a list of [0, -1] to tell it to mark only the first and last points in each time range.

These properties really helped render the chart I wanted. It’s always great to learn more about the versatility of Matplotlib!

« Older posts

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑