Occasionally, I have to work with more complicated forms of data where a given column might contain an array of values instead of conventional data types like Strings, Integers, Timestamps and such. Often, such complex fields can easily be processed with the excellent literal_eval function from the Abstract Syntax Trees (ast) package. For example, take this movie database from Kaggle:

import pandas as pd
from ast import literal_eval

df = pd.read_csv('./data/tmdb_5000_credits.csv')
df['cast'] = df.cast.apply(literal_eval)
df.head()

And when we take a look at that cast column, specifically:

type(df.loc[0,'cast'])

# this prints out "list" as the type

Basically, literal_eval converted the string of brackets and curly braces into lists of dictionary objects. Very cool and handy.

However, in real life, you don’t always have such clean data to work with. Recently, I was working with some complex data, tried to use literal_eval on it, and got an error like the following:

ValueError: malformed node or string: <_ast.Name object at 0x000001F62A851FC8>

Woh. There’s something about this data that literal_eval doesn’t like. I wonder what it is? Let me do a little exception handling around literal_eval to see if I can find the row(s) where the function is breaking:

def try_the_eval(row):
    try:
        literal_eval(row.cast)
    except:
        print('Found bad data at row: {0}'.format(row.name))


df_dirty = pd.read_csv('./data/dirty_data.csv')
_ = df_dirty.apply(try_the_eval, axis=1)

This code told me I had badly formatted data in row 32 of my dataset. I can take a closer look at that data like so:

df_dirty.loc[32,'cast']
Careful review of the data (and a handy red box) reveals the problem

literal_eval only works with certain types of data–strings, bytes, numbers, even None–but not null.

So, how do you fix such an issue? I can think of at least two ways:

  1. Wrap null values like these in double quotes or
  2. Replace null values like these with None.

Here, I go with Option #2:

df_dirty['cast'] = df_dirty.cast.str.replace('null', 'None')
df_dirty['cast'] = df_dirty.cast.apply(literal_eval)

# no exceptions are thrown and the below line prints out "list"
print(type(df_dirty.loc[32,'cast']))

One caveat here is that you need to be careful how you search and replace the errant data. For example, the word null without surround quotation marks might appear legitimately in other fields like descriptions. You don’t want to accidentally replace such legitimate uses of the word with None or surround it with quotation marks as that will introduce new problems. So, you may need to come up with slightly more rigorous ways of finding and replacing your errant data, such as using regular expressions. Once you deal with that data, though, literal_eval works like a charm and saves you lots of time working with complex datasets!