DadOverflow.com

Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Page 2 of 57

Recreating a 3d bar chart

As I try to educate myself on the new hotness of large language models, I ran across this post and associated chart and decided I wanted to see if I could recreate that chart programmatically with matplotlib:

A chart on LLMs and the datasets on which they were trained

I got “mostly” there. Here’s what I did.

Step 1: assemble the data

Is this data–the models, what datasets they trained on, and how big those datasets were–already assembled somewhere on the internet? I just recreated it by hand from looking at the chart:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


data = [{'model': 'GPT-1', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Books', 'size_gb':5}, 
        {'model': 'GPT-1', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Books', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Reddit links', 'size_gb':40}, 
        {'model': 'GPT-2', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-3', 'dataset':'Wikipedia', 'size_gb':11}, 
        {'model': 'GPT-3', 'dataset':'Books', 'size_gb':21}, 
        {'model': 'GPT-3', 'dataset':'Academic journals', 'size_gb':101}, 
        {'model': 'GPT-3', 'dataset':'Reddit links', 'size_gb':50}, 
        {'model': 'GPT-3', 'dataset':'CC', 'size_gb':570}, 
        {'model': 'GPT-3', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Wikipedia', 'size_gb':6}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Books', 'size_gb':118}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Academic journals', 'size_gb':244}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Reddit links', 'size_gb':63}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'CC', 'size_gb':227}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Other', 'size_gb':167}, 
        {'model': 'Megatron-11B', 'dataset':'Wikipedia', 'size_gb':11}, 
        {'model': 'Megatron-11B', 'dataset':'Books', 'size_gb':5}, 
        {'model': 'Megatron-11B', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'Megatron-11B', 'dataset':'Reddit links', 'size_gb':38}, 
        {'model': 'Megatron-11B', 'dataset':'CC', 'size_gb':107}, 
        {'model': 'Megatron-11B', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'MT-NLG', 'dataset':'Wikipedia', 'size_gb':6}, 
        {'model': 'MT-NLG', 'dataset':'Books', 'size_gb':118}, 
        {'model': 'MT-NLG', 'dataset':'Academic journals', 'size_gb':77}, 
        {'model': 'MT-NLG', 'dataset':'Reddit links', 'size_gb':63}, 
        {'model': 'MT-NLG', 'dataset':'CC', 'size_gb':983}, 
        {'model': 'MT-NLG', 'dataset':'Other', 'size_gb':127}, 
        {'model': 'Gopher', 'dataset':'Wikipedia', 'size_gb':12}, 
        {'model': 'Gopher', 'dataset':'Books', 'size_gb':2100}, 
        {'model': 'Gopher', 'dataset':'Academic journals', 'size_gb':164}, 
        {'model': 'Gopher', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'Gopher', 'dataset':'CC', 'size_gb':3450}, 
        {'model': 'Gopher', 'dataset':'Other', 'size_gb':4823}, 
        {'model': 'GPT-4', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Books', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Other', 'size_gb':None}]

df = pd.DataFrame(data).fillna(0)
df.head()

Note that I converted all NaN values to 0: bad things happen when you try to chart a NaN.

Step 2: code my chart

I borrowed pretty heavily from matplotlib’s 3d bar chart examples. Some points to consider:

  • The datasets are listed along the X axis. In a 3d bar chart, 0 on the X axis is at the furthest left and the values increase as the axis intersects the Y axis. To replicate the chart, then, the “Other” dataset would fall at around point 0 while the “Wikipedia” dataset would fall at around point 5. However, I assembled the data in my dataframe starting with “Wikipedia” and continuing on up to “Other”. So, when it came time for me to chart the datasets, I had to reverse their order with the handy negative index slicing. I also had to reverse the colors associated with the datasets, too.
  • I’m always learning new things about Python with these types of exercises. How cool is the Pandas iat function? (Not to mention the endless supply of numpy functions I’ve never heard of before.)
  • In my dataframe, my X and Y values are categories–datasets and models, respectively. However, matplotlib does not like plotting categories, so, I had to use placeholder index/integer values and then later replace those tick labels with the real categories.
  • Setting the width and depth to 0.9 helps to creating padding between the bars.
  • I had to play with the rotation property of my tick labels to get them to sort of align properly to the chart. I did this by eye but I wonder if there’s a better way to align them exactly?

So, here’s my code:

fig, ax = plt.subplots(subplot_kw={'projection': '3d'}, figsize=(6, 6))

models = df.model.unique().tolist()
datasets = df.dataset.unique().tolist()[::-1]
top = [df.loc[(df.model==m) & (df.dataset==d), 'size_gb'].iat[0] for m in models for d in datasets]
bottom = np.zeros_like(top)

# bar3d doesn't seem to like categories for x and y values
_x, _y = np.meshgrid(np.arange(len(datasets)), np.arange(len(models)))
x, y = _x.ravel(), _y.ravel()
width = depth = 0.9  # allows a little bit of padding between bars

ds_colors = ['greenyellow','fuchsia','turquoise','orangered','dodgerblue','goldenrod'][::-1]
#colors = list(np.array([[c]*len(models) for c in ds_colors]).flat)
colors = ds_colors * len(models)

_ = ax.bar3d(x, y, bottom, width, depth, top, color=colors, shade=True)
_ = ax.set_yticklabels(['', ''] + models + [''], rotation=-20, ha='left')
_ = ax.set_xticklabels([''] + datasets + ['', ''], rotation=50)

# annotate the size numbers onto the bars
for x1, y1, z1 in zip(x, y, top):
    if z1 > 0:
        _ = ax.text(x1, y1, z1, str(int(z1)), horizontalalignment='left', verticalalignment='bottom')
My attempt at the LLM chart

Ok. There’s still a lot to be done here. The Z axis and Z gridlines should go. The size annotations could be better aligned. The scale probably needs to be recalculated so that the 4823 value isn’t hiding all the other values. All the 0 length bars should disappear altogether. And a legend might be nice.

Anyway, I think I’ve accomplished the crux of what I set out to do, so I’ll leave it there. Hope this helps with some of your 3d charting endeavors!

Matplotlib markers

I’ve used Matplotlib for years and yet I’m always discovering new features. Recently, I was working with some data with date ranges and thought maybe visualizing that data in something like a gantt chart might help me understand it better. When it came time to jazz up the chart with markers, I was really impressed at the options.

For illustration, suppose I have a dataset like this:

import pandas as pd
from datetime import timedelta
import matplotlib.pyplot as plt


data = [{'name': 'Larry', 'start_dt': '2023-01-05', 'end_dt': '2023-01-09'}, 
        {'name': 'Moe', 'start_dt': '2023-01-07', 'end_dt': '2023-01-12'},
        {'name': 'Curly', 'start_dt': '2023-01-02', 'end_dt': '2023-01-07'},
        {'name': 'Al', 'start_dt': '2023-01-12', 'end_dt': '2023-01-15'},
        {'name': 'Peggy', 'start_dt': '2023-01-04', 'end_dt': '2023-01-09'},
        {'name': 'Kelly', 'start_dt': '2023-01-08', 'end_dt': '2023-01-12'},
        {'name': 'Bud', 'start_dt': '2023-01-11', 'end_dt': '2023-01-14'}]

df = pd.DataFrame(data)
df['start_dt'] = pd.to_datetime(df.start_dt)
df['end_dt'] = pd.to_datetime(df.end_dt)

Is there an elegant way in pandas to expand the dataset to add a record for each day a person worked? I couldn’t think of any, so I just looped over the dataframe and did it the hard way:

fig, ax = plt.subplots(figsize=(10, 6))
new_ylabels = ['']

for i, r in df.sort_values('name').reset_index(drop=True).iterrows():
    work_dates = pd.date_range(start=r['start_dt'], end=r['end_dt'], freq='1D')
    df_temp = pd.DataFrame([{'name': r['name'], 'ypos': i+1}], index=work_dates)
    _ = df_temp.plot(marker='d', markevery=[0,-1], markersize=5.0, ax=ax)
    new_ylabels.append(r['name'])
    
_ = ax.get_legend().remove()
_ = ax.set(yticklabels=new_ylabels)
_ = ax.set_title('Lame gantt-type chart')

This produced a pretty nifty gantt-type chart with the timelines from my dataset:

The idea I want to highlight with this post is the work I did in the code above on line 7. I used three marker properties to craft the chart I was after:

  • marker
  • markersize
  • markevery

Marker

You set the type of marker you want with the “marker” property–and there are tons of choices. I chose a lowercase “d” to get a “thin diamond”.

Markersize

The “markersize” property does what it says: sets the size of the marker. In my experience, I’ve just had to set a value, render the chart, and then adjust-and-repeat to get the size I want.

Markevery

I was actually pretty familiar with the “marker” and “markersize” properties–having used them extensively in the past–but I was pretty excited to learn about the “markevery” property. By default, a marker will appear at every datapoint in your chart. However, gantt charts normally only mark the beginning and end of a time range and not every point in between. With the “markevery” property, all I needed to do was pass it a list of [0, -1] to tell it to mark only the first and last points in each time range.

These properties really helped render the chart I wanted. It’s always great to learn more about the versatility of Matplotlib!

Python help with slideshows

As in years past, I continue to work on my annual family video as a year-end project. As is my tradition, I always end my videos with a sort-of “outro” segment where I play a slideshow of family photos from the past year over some upbeat song.

The software I use, Cyberlink PowerDirector, has a nifty Slideshow Creator tool that makes it easy for you to drop your photos and music into one of several slick, pre-created templates for a cool slideshow. While this tool produces a neat product in a short amount of time, I’ve encountered a few problems with it:

  • Slide order is not guaranteed. I often want particular photos to start the slideshow and particular ones to end it but no matter how I name my images alphabetically, the Slideshow Creator never seems to order my slides how I want them.
  • I always have a challenge matching the number of images I want in my project to the length of the background music I want playing in the montage. Often, I have too many images for the length of song I’ve chosen and Slideshow Creator will repeat my song until it’s cycled through all my photos. I usually play a game of building my project with a certain amount of images and then trying to guess how many I need to delete to avoid Slideshow Creator repeating my song.

This year, I finally explored a second option: Theme Designer. It seems like Slideshow Creator is a layer of abstraction over Theme Designer, but PowerDirector allows you to bypass the Creator tool and work directly with the Designer. There’s less automation, but more control; however, my problems still remain: can I order my slides as I see fit and just how many photos can I use to cover the length of my chosen music?

The tutorial video is helpful and shows you that you have full control over the order of your images, but I still have the question about how many images I can include to fill the length of my chosen music. Here’s how I solved that problem.

Step 1: Measure the length of each template you want to use

PowerDirector Theme Designer

In this example, I’ll focus on the Picture Frames theme. This theme has five templates:

  1. An Opening template that holds two images
  2. A Middle 1 template that holds three images
  3. A Middle 2 template that holds four images
  4. A Middle 3 template that holds five images
  5. and a Closing template that holds four images

What is the runtime for these templates? You can check the runtime in the preview on the right by dragging the timer all the way to the end of the segment–for the Opening template, the preview says it runs for seven seconds–but I’ve not found this preview to be completely honest.

I’ve found I’ve had to add each template to a new project, add images to all the templates, then drag the timer to the end of each template before I was confident in the true length of each sequence. In the case of the Picture Frames theme, I’ve found the runtimes of each template to be (rounding down to the nearest second):

  • The Opening template runs for seven seconds
  • The Middle 1 template runs for six seconds
  • The Middle 2 template runs for ten seconds
  • The Middle 3 template runs for 16 seconds
  • and the Closing template runs for eight seconds

Step 2: Figure out your song length

You can easily figure out the length of the song for your slideshow by right-clicking on the file, clicking the Details tab, and finding the Length property.

Tom Petty’s Free Fallin’ is runs for 255 seconds

Step 3: Let Python tell you the templates you need and the number of images to use

So, I know that my slideshow should run for 255 seconds. I know I want to use the Opening template only once at the beginning of the slideshow and the Closing template only once at the end. That’s 15 seconds out of 255: so I have 240 seconds to fill with some amount of the Middle templates. How many? Here’s some simple code I wrote to figure that out:

song_len_seconds = 255  # free fallin
opening_template = (7, 2)  # nbr of seconds long, nbr of pictures in template (Picture Frames)
middle_template1 = (6, 3)
middle_template2 = (10, 4)
middle_template3 = (16, 5)
closing_template = (8, 4)

remaining_time = song_len_seconds - opening_template[0] - closing_template[0]

print('After subtracting the runtime of the opening and closing templates, remaining secs to fill with middle templates: {0}'. \
      format(remaining_time))
print('Number of middle template combos to add to project: {0}'. \
      format(remaining_time / (middle_template1[0] + middle_template2[0] + middle_template3[0])))

middle_factor = int(remaining_time / (middle_template1[0] + middle_template2[0] + middle_template3[0]))

print('Total seconds consumed by adding {0} middle template combos: {1}'. \
      format(middle_factor, middle_factor*middle_template1[0] + middle_factor*middle_template2[0] + middle_factor*middle_template3[0]))

print('Remaining seconds to fill: {0}'. \
      format(remaining_time - (middle_factor*middle_template1[0] + middle_factor*middle_template2[0] + middle_factor*middle_template3[0])))

total_pics = opening_template[1] + closing_template[1] + \
    middle_factor*(middle_template1[1] + middle_template2[1] + middle_template3[1])

print('Number of pictures I\'ll need: {0}'.format(total_pics))

For simplicity, I just opted to use all three Middle templates in the same order: Middle 1, Middle 2, then Middle 3. By my calculations, after subtracting the Opening and Closing template runtimes, I will need to include seven Middle 1/Middle 2/Middle 3 combinations. Even after including seven of those combinations, I still have 16 additional seconds to fill–I didn’t write any code to recommend how to fill that time. I could fill it with one more Middle 3 template; of course, I’d want to make sure to place it in my project so that I don’t have two Middle 3 templates back-to-back.

My code also lets me know that an Opening, Closing, and seven Middle 1/Middle 2/Middle 3 combinations requires 90 images–which is nice for planning purposes.

Anyway, using some simple code like this will help me develop future slideshows more quickly and consistently.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑