Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Category: technology (Page 2 of 36)

Recreating a 3d bar chart

As I try to educate myself on the new hotness of large language models, I ran across this post and associated chart and decided I wanted to see if I could recreate that chart programmatically with matplotlib:

A chart on LLMs and the datasets on which they were trained

I got “mostly” there. Here’s what I did.

Step 1: assemble the data

Is this data–the models, what datasets they trained on, and how big those datasets were–already assembled somewhere on the internet? I just recreated it by hand from looking at the chart:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


data = [{'model': 'GPT-1', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Books', 'size_gb':5}, 
        {'model': 'GPT-1', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-1', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Books', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Reddit links', 'size_gb':40}, 
        {'model': 'GPT-2', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-2', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-3', 'dataset':'Wikipedia', 'size_gb':11}, 
        {'model': 'GPT-3', 'dataset':'Books', 'size_gb':21}, 
        {'model': 'GPT-3', 'dataset':'Academic journals', 'size_gb':101}, 
        {'model': 'GPT-3', 'dataset':'Reddit links', 'size_gb':50}, 
        {'model': 'GPT-3', 'dataset':'CC', 'size_gb':570}, 
        {'model': 'GPT-3', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Wikipedia', 'size_gb':6}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Books', 'size_gb':118}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Academic journals', 'size_gb':244}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Reddit links', 'size_gb':63}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'CC', 'size_gb':227}, 
        {'model': 'GPT-J/GPT-NeoX-20B', 'dataset':'Other', 'size_gb':167}, 
        {'model': 'Megatron-11B', 'dataset':'Wikipedia', 'size_gb':11}, 
        {'model': 'Megatron-11B', 'dataset':'Books', 'size_gb':5}, 
        {'model': 'Megatron-11B', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'Megatron-11B', 'dataset':'Reddit links', 'size_gb':38}, 
        {'model': 'Megatron-11B', 'dataset':'CC', 'size_gb':107}, 
        {'model': 'Megatron-11B', 'dataset':'Other', 'size_gb':None}, 
        {'model': 'MT-NLG', 'dataset':'Wikipedia', 'size_gb':6}, 
        {'model': 'MT-NLG', 'dataset':'Books', 'size_gb':118}, 
        {'model': 'MT-NLG', 'dataset':'Academic journals', 'size_gb':77}, 
        {'model': 'MT-NLG', 'dataset':'Reddit links', 'size_gb':63}, 
        {'model': 'MT-NLG', 'dataset':'CC', 'size_gb':983}, 
        {'model': 'MT-NLG', 'dataset':'Other', 'size_gb':127}, 
        {'model': 'Gopher', 'dataset':'Wikipedia', 'size_gb':12}, 
        {'model': 'Gopher', 'dataset':'Books', 'size_gb':2100}, 
        {'model': 'Gopher', 'dataset':'Academic journals', 'size_gb':164}, 
        {'model': 'Gopher', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'Gopher', 'dataset':'CC', 'size_gb':3450}, 
        {'model': 'Gopher', 'dataset':'Other', 'size_gb':4823}, 
        {'model': 'GPT-4', 'dataset':'Wikipedia', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Books', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Academic journals', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Reddit links', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'CC', 'size_gb':None}, 
        {'model': 'GPT-4', 'dataset':'Other', 'size_gb':None}]

df = pd.DataFrame(data).fillna(0)
df.head()

Note that I converted all NaN values to 0: bad things happen when you try to chart a NaN.

Step 2: code my chart

I borrowed pretty heavily from matplotlib’s 3d bar chart examples. Some points to consider:

  • The datasets are listed along the X axis. In a 3d bar chart, 0 on the X axis is at the furthest left and the values increase as the axis intersects the Y axis. To replicate the chart, then, the “Other” dataset would fall at around point 0 while the “Wikipedia” dataset would fall at around point 5. However, I assembled the data in my dataframe starting with “Wikipedia” and continuing on up to “Other”. So, when it came time for me to chart the datasets, I had to reverse their order with the handy negative index slicing. I also had to reverse the colors associated with the datasets, too.
  • I’m always learning new things about Python with these types of exercises. How cool is the Pandas iat function? (Not to mention the endless supply of numpy functions I’ve never heard of before.)
  • In my dataframe, my X and Y values are categories–datasets and models, respectively. However, matplotlib does not like plotting categories, so, I had to use placeholder index/integer values and then later replace those tick labels with the real categories.
  • Setting the width and depth to 0.9 helps to creating padding between the bars.
  • I had to play with the rotation property of my tick labels to get them to sort of align properly to the chart. I did this by eye but I wonder if there’s a better way to align them exactly?

So, here’s my code:

fig, ax = plt.subplots(subplot_kw={'projection': '3d'}, figsize=(6, 6))

models = df.model.unique().tolist()
datasets = df.dataset.unique().tolist()[::-1]
top = [df.loc[(df.model==m) & (df.dataset==d), 'size_gb'].iat[0] for m in models for d in datasets]
bottom = np.zeros_like(top)

# bar3d doesn't seem to like categories for x and y values
_x, _y = np.meshgrid(np.arange(len(datasets)), np.arange(len(models)))
x, y = _x.ravel(), _y.ravel()
width = depth = 0.9  # allows a little bit of padding between bars

ds_colors = ['greenyellow','fuchsia','turquoise','orangered','dodgerblue','goldenrod'][::-1]
#colors = list(np.array([[c]*len(models) for c in ds_colors]).flat)
colors = ds_colors * len(models)

_ = ax.bar3d(x, y, bottom, width, depth, top, color=colors, shade=True)
_ = ax.set_yticklabels(['', ''] + models + [''], rotation=-20, ha='left')
_ = ax.set_xticklabels([''] + datasets + ['', ''], rotation=50)

# annotate the size numbers onto the bars
for x1, y1, z1 in zip(x, y, top):
    if z1 > 0:
        _ = ax.text(x1, y1, z1, str(int(z1)), horizontalalignment='left', verticalalignment='bottom')
My attempt at the LLM chart

Ok. There’s still a lot to be done here. The Z axis and Z gridlines should go. The size annotations could be better aligned. The scale probably needs to be recalculated so that the 4823 value isn’t hiding all the other values. All the 0 length bars should disappear altogether. And a legend might be nice.

Anyway, I think I’ve accomplished the crux of what I set out to do, so I’ll leave it there. Hope this helps with some of your 3d charting endeavors!

Matplotlib markers

I’ve used Matplotlib for years and yet I’m always discovering new features. Recently, I was working with some data with date ranges and thought maybe visualizing that data in something like a gantt chart might help me understand it better. When it came time to jazz up the chart with markers, I was really impressed at the options.

For illustration, suppose I have a dataset like this:

import pandas as pd
from datetime import timedelta
import matplotlib.pyplot as plt


data = [{'name': 'Larry', 'start_dt': '2023-01-05', 'end_dt': '2023-01-09'}, 
        {'name': 'Moe', 'start_dt': '2023-01-07', 'end_dt': '2023-01-12'},
        {'name': 'Curly', 'start_dt': '2023-01-02', 'end_dt': '2023-01-07'},
        {'name': 'Al', 'start_dt': '2023-01-12', 'end_dt': '2023-01-15'},
        {'name': 'Peggy', 'start_dt': '2023-01-04', 'end_dt': '2023-01-09'},
        {'name': 'Kelly', 'start_dt': '2023-01-08', 'end_dt': '2023-01-12'},
        {'name': 'Bud', 'start_dt': '2023-01-11', 'end_dt': '2023-01-14'}]

df = pd.DataFrame(data)
df['start_dt'] = pd.to_datetime(df.start_dt)
df['end_dt'] = pd.to_datetime(df.end_dt)

Is there an elegant way in pandas to expand the dataset to add a record for each day a person worked? I couldn’t think of any, so I just looped over the dataframe and did it the hard way:

fig, ax = plt.subplots(figsize=(10, 6))
new_ylabels = ['']

for i, r in df.sort_values('name').reset_index(drop=True).iterrows():
    work_dates = pd.date_range(start=r['start_dt'], end=r['end_dt'], freq='1D')
    df_temp = pd.DataFrame([{'name': r['name'], 'ypos': i+1}], index=work_dates)
    _ = df_temp.plot(marker='d', markevery=[0,-1], markersize=5.0, ax=ax)
    new_ylabels.append(r['name'])
    
_ = ax.get_legend().remove()
_ = ax.set(yticklabels=new_ylabels)
_ = ax.set_title('Lame gantt-type chart')

This produced a pretty nifty gantt-type chart with the timelines from my dataset:

The idea I want to highlight with this post is the work I did in the code above on line 7. I used three marker properties to craft the chart I was after:

  • marker
  • markersize
  • markevery

Marker

You set the type of marker you want with the “marker” property–and there are tons of choices. I chose a lowercase “d” to get a “thin diamond”.

Markersize

The “markersize” property does what it says: sets the size of the marker. In my experience, I’ve just had to set a value, render the chart, and then adjust-and-repeat to get the size I want.

Markevery

I was actually pretty familiar with the “marker” and “markersize” properties–having used them extensively in the past–but I was pretty excited to learn about the “markevery” property. By default, a marker will appear at every datapoint in your chart. However, gantt charts normally only mark the beginning and end of a time range and not every point in between. With the “markevery” property, all I needed to do was pass it a list of [0, -1] to tell it to mark only the first and last points in each time range.

These properties really helped render the chart I wanted. It’s always great to learn more about the versatility of Matplotlib!

Avoiding duplicates in Hive with Anti Join

In the world of data engineering, when the engineer builds a data pipeline to copy data from one system to another, it becomes easy to accidentally insert duplicate records into your target system. For example, your pipeline might break and you have to take steps to backfill the missing information. If your pipeline didn’t break in a clear and obvious spot, you may end of reprocessing the same data more than once.

When I create tables in a conventional relational database, I normally create a primary key field to ensure uniqueness–that I don’t accidentally insert the same record twice into the table. That’s great if my data pipelines write to a relational database: if I end up having to backfill a broken operation, my database can reject data that I already successfully processed the first time around.

However, if my destination data repository is Apache Hive, I don’t have those same safeguards like primary key fields. So, how can you avoid inserting duplicate records into your Hive tables? Here’s and option: use ANTI JOIN.

For starters, suppose I have a table called my_db.people_table (note that I’m testing my code in a PySpark shell running in a jupyter/all-spark-notebook Docker container):

create_db_qry = 'CREATE DATABASE my_db'
create_table_qry = """
CREATE TABLE my_db.people_table (
    person_id INT,
    fname STRING,
    lname STRING
);
"""

spark.sql(create_db_qry)
spark.sql(create_table_qry)

And the results:

>>> spark.sql('SELECT * FROM my_db.people_table ORDER BY person_id').show()
+---------+-----+-----+
|person_id|fname|lname|
+---------+-----+-----+
+---------+-----+-----+

Now, let’s add some data to the table:

initial_data = [(1, 'Andy', 'Griffith'), (2, 'Bee', 'Taylor'), (3, 'Opie', 'Griffith'), (4, 'Barney', 'Fife')]
df = spark.createDataFrame(initial_data, ['person_id', 'fname', 'lname'])
df.write.mode('append').insertInto('my_db.people_table')

Now we have some initial data:

>>> spark.sql('SELECT * FROM my_db.people_table ORDER BY person_id').show()     
+---------+------+--------+
|person_id| fname|   lname|
+---------+------+--------+
|        1|  Andy|Griffith|
|        2|   Bee|  Taylor|
|        3|  Opie|Griffith|
|        4|Barney|    Fife|
+---------+------+--------+

Suppose we need to add more data to the table, but we’re not sure if the data is all original or if the new set contains records we previously processed. Here’s how we might normally do that:

more_data = [(3, 'Opie', 'Griffith'), (4, 'Barney', 'Fife'), (5, 'Floyd', 'Lawson'), (6, 'Gomer', 'Pyle'), (7, 'Otis', 'Campbell')]
df = spark.createDataFrame(more_data, ['person_id', 'fname', 'lname'])
df.write.mode('append').insertInto('my_db.people_table')

Uh-oh: looks like that new data did contain some records we already had:

>>> spark.sql('SELECT * FROM my_db.people_table ORDER BY person_id').show()     
+---------+------+--------+
|person_id| fname|   lname|
+---------+------+--------+
|        1|  Andy|Griffith|
|        2|   Bee|  Taylor|
|        3|  Opie|Griffith|
|        3|  Opie|Griffith|
|        4|Barney|    Fife|
|        4|Barney|    Fife|
|        5| Floyd|  Lawson|
|        6| Gomer|    Pyle|
|        7|  Otis|Campbell|
+---------+------+--------+

We can avoid that dilemma by using an ANTI JOIN statement in our insert operation. Here’s how that would look instead:

more_data = [(3, 'Opie', 'Griffith'), (4, 'Barney', 'Fife'), (5, 'Floyd', 'Lawson'), (6, 'Gomer', 'Pyle'), (7, 'Otis', 'Campbell')]
df = spark.createDataFrame(more_data, ['person_id', 'fname', 'lname'])

# write our new dataset to a temporary table
df.createOrReplaceTempView('people_table_tmp')

# now, craft our INSERT statement to "ANTI JOIN" the temp table to the destination table and only write the delta
antijoin_qry = """INSERT INTO my_db.people_table 
    SELECT t.person_id, t.fname, t.lname 
    FROM (SELECT person_id, fname, lname FROM people_table_tmp a LEFT ANTI JOIN my_db.people_table b ON (a.person_id=b.person_id)) t"""

# execute that anti join statement
spark.sql(antijoin_qry)

# cleanup by dropping the temp table
spark.catalog.dropTempView('people_table_tmp')

And the results:

>>> spark.sql('SELECT * FROM my_db.people_table ORDER BY person_id').show()
+---------+------+--------+
|person_id| fname|   lname|
+---------+------+--------+
|        1|  Andy|Griffith|
|        2|   Bee|  Taylor|
|        3|  Opie|Griffith|
|        4|Barney|    Fife|
|        5| Floyd|  Lawson|
|        6| Gomer|    Pyle|
|        7|  Otis|Campbell|
+---------+------+--------+

Wow! Looks so much better. So, if you suffer from duplicate data in your Hive tables, give ANTI JOIN a try!

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑