Filtering dataframes with tuples

Often when I’m working with data, I need to filter the data down into sub-groups for more detailed analysis. Usually, my needs are simple, where I only need to parse on fields independently. However, sometimes I need to filter on field combinations and that work can get complicated.

Even Mr. Hitchcock finds filtering by field combinations challenging

For example, suppose I’m working with a movie dataset and I want to compare Francis Ford Coppola dramas to Alfred Hitchcock mysteries to Martin Scorsese biography pictures. Initially, I might try something like this:

import pandas as pd


# data from: https://www.kaggle.com/mysarahmadbhat/imdb-top-1000-movies
df = pd.read_csv('./data/regex_imdb.csv').fillna(0)
# for simplicity, I'm just going to use the first, comma-delimited genre value
df['Genre'] = df.Genre.apply(lambda g: g.split(',')[0])

directors = ['Francis Ford Coppola', 'Alfred Hitchcock', 'Martin Scorsese']
genres = ['Drama', 'Mystery', 'Biography']
df[(df.Director.isin(directors)) & (df.Genre.isin(genres))]

This code allowed in Scorsese “mysteries”, for example, when I only wanted Scorsese “biographies”

Ok. Let’s then filter the hard way:

df[((df.Director=='Francis Ford Coppola') & (df.Genre=='Drama')) | 
   ((df.Director=='Alfred Hitchcock') & (df.Genre=='Mystery')) | 
   ((df.Director=='Martin Scorsese') & (df.Genre=='Biography'))]

This code does get me the data I want, but it’s long and a little unreadable. Is there a better way? Yes: tuples!

First, let’s take the director/genre combinations we want and put them together in a single list of tuples:

dirs_genres = [('Francis Ford Coppola', 'Drama'), ('Alfred Hitchcock', 'Mystery'), ('Martin Scorsese', 'Biography')]

Now, we can use the apply function to create an on-the-fly director/genre tuple field that we can parse on:

df[df.apply(lambda row: (row.Director, row.Genre), axis=1).isin(dirs_genres)]

We get the same results as before, but with slightly more readable code. Here’s a neat alternative option using Python’s tuple function:

df[df[['Director', 'Genre']].apply(tuple, axis=1).isin(dirs_genres)]

If you anticipate that you might use this director/genre combination a lot in your work, you could consider adding this tuple as a new column to your dataframe:

df['pg'] = df[['Director', 'Genre']].apply(tuple, axis=1)
df.head()

Then, you could do all your filtering directly against that single field.

Happy filtering!

Filtering dataframes with tuples

Recent Posts

Recent Comments

Archives

Meta

Filtering dataframes with tuples

Recent Posts

Recent Comments

Archives

Tags

Meta