Often when I’m working with data, I need to filter the data down into sub-groups for more detailed analysis. Usually, my needs are simple, where I only need to parse on fields independently. However, sometimes I need to filter on field combinations and that work can get complicated.
For example, suppose I’m working with a movie dataset and I want to compare Francis Ford Coppola dramas to Alfred Hitchcock mysteries to Martin Scorsese biography pictures. Initially, I might try something like this:
import pandas as pd
# data from: https://www.kaggle.com/mysarahmadbhat/imdb-top-1000-movies
df = pd.read_csv('./data/regex_imdb.csv').fillna(0)
# for simplicity, I'm just going to use the first, comma-delimited genre value
df['Genre'] = df.Genre.apply(lambda g: g.split(',')[0])
directors = ['Francis Ford Coppola', 'Alfred Hitchcock', 'Martin Scorsese']
genres = ['Drama', 'Mystery', 'Biography']
df[(df.Director.isin(directors)) & (df.Genre.isin(genres))]
Ok. Let’s then filter the hard way:
df[((df.Director=='Francis Ford Coppola') & (df.Genre=='Drama')) |
((df.Director=='Alfred Hitchcock') & (df.Genre=='Mystery')) |
((df.Director=='Martin Scorsese') & (df.Genre=='Biography'))]
This code does get me the data I want, but it’s long and a little unreadable. Is there a better way? Yes: tuples!
First, let’s take the director/genre combinations we want and put them together in a single list of tuples:
dirs_genres = [('Francis Ford Coppola', 'Drama'), ('Alfred Hitchcock', 'Mystery'), ('Martin Scorsese', 'Biography')]
Now, we can use the apply function to create an on-the-fly director/genre tuple field that we can parse on:
df[df.apply(lambda row: (row.Director, row.Genre), axis=1).isin(dirs_genres)]
We get the same results as before, but with slightly more readable code. Here’s a neat alternative option using Python’s tuple function:
df[df[['Director', 'Genre']].apply(tuple, axis=1).isin(dirs_genres)]
If you anticipate that you might use this director/genre combination a lot in your work, you could consider adding this tuple as a new column to your dataframe:
df['pg'] = df[['Director', 'Genre']].apply(tuple, axis=1)
df.head()
Then, you could do all your filtering directly against that single field.
Happy filtering!
Recent Comments