Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Month: May 2020

Now this is a commencement address

Here’s a rather unconventional commencement address from Mike Rowe:

I like both his gentle irreverence and the simple truths of his points.

How do you succeed professionally? Mr. Rowe makes some suggestions:

  • Practice your craft everyday.
  • Become indispensable to your employer.
  • Show up early.
  • Stay late.
  • Distinguish yourself on the job at every opportunity.

No matter what your child does after high school, these are solid points to be successful.

If you’re a skilled tradesperson with an entrepreneurial spirit, a willingness to get dirty, a disposition to travel, and a burning curiosity to learn all that you can…I’m telling you, your opportunity to prosper has never been better.

Mike Rowe

I think it’s great that he references an “entrepreneurial spirit”…I wish he would have repeated that point a few more times. It shouldn’t be a given that we all must toil for an employer, when there are opportunities to be your own. Along with equally giving voice to the trades, it would be great if our educational institutions would give ample voice to entrepreneurship, as well.

pandas and literal_eval

Occasionally, I have to work with more complicated forms of data where a given column might contain an array of values instead of conventional data types like Strings, Integers, Timestamps and such. Often, such complex fields can easily be processed with the excellent literal_eval function from the Abstract Syntax Trees (ast) package. For example, take this movie database from Kaggle:

import pandas as pd
from ast import literal_eval

df = pd.read_csv('./data/tmdb_5000_credits.csv')
df['cast'] = df.cast.apply(literal_eval)
df.head()

And when we take a look at that cast column, specifically:

type(df.loc[0,'cast'])

# this prints out "list" as the type

Basically, literal_eval converted the string of brackets and curly braces into lists of dictionary objects. Very cool and handy.

However, in real life, you don’t always have such clean data to work with. Recently, I was working with some complex data, tried to use literal_eval on it, and got an error like the following:

ValueError: malformed node or string: <_ast.Name object at 0x000001F62A851FC8>

Woh. There’s something about this data that literal_eval doesn’t like. I wonder what it is? Let me do a little exception handling around literal_eval to see if I can find the row(s) where the function is breaking:

def try_the_eval(row):
    try:
        literal_eval(row.cast)
    except:
        print('Found bad data at row: {0}'.format(row.name))


df_dirty = pd.read_csv('./data/dirty_data.csv')
_ = df_dirty.apply(try_the_eval, axis=1)

This code told me I had badly formatted data in row 32 of my dataset. I can take a closer look at that data like so:

df_dirty.loc[32,'cast']
Careful review of the data (and a handy red box) reveals the problem

literal_eval only works with certain types of data–strings, bytes, numbers, even None–but not null.

So, how do you fix such an issue? I can think of at least two ways:

  1. Wrap null values like these in double quotes or
  2. Replace null values like these with None.

Here, I go with Option #2:

df_dirty['cast'] = df_dirty.cast.str.replace('null', 'None')
df_dirty['cast'] = df_dirty.cast.apply(literal_eval)

# no exceptions are thrown and the below line prints out "list"
print(type(df_dirty.loc[32,'cast']))

One caveat here is that you need to be careful how you search and replace the errant data. For example, the word null without surround quotation marks might appear legitimately in other fields like descriptions. You don’t want to accidentally replace such legitimate uses of the word with None or surround it with quotation marks as that will introduce new problems. So, you may need to come up with slightly more rigorous ways of finding and replacing your errant data, such as using regular expressions. Once you deal with that data, though, literal_eval works like a charm and saves you lots of time working with complex datasets!

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑