Often when I’m working with data, I need to filter the data down into sub-groups for more detailed analysis. Usually, my needs are simple, where I only need to parse on fields independently. However, sometimes I need to filter on field combinations and that work can get complicated.

Even Mr. Hitchcock finds filtering by field combinations challenging

For example, suppose I’m working with a movie dataset and I want to compare Francis Ford Coppola dramas to Alfred Hitchcock mysteries to Martin Scorsese biography pictures. Initially, I might try something like this:

import pandas as pd


# data from: https://www.kaggle.com/mysarahmadbhat/imdb-top-1000-movies
df = pd.read_csv('./data/regex_imdb.csv').fillna(0)
# for simplicity, I'm just going to use the first, comma-delimited genre value
df['Genre'] = df.Genre.apply(lambda g: g.split(',')[0])

directors = ['Francis Ford Coppola', 'Alfred Hitchcock', 'Martin Scorsese']
genres = ['Drama', 'Mystery', 'Biography']
df[(df.Director.isin(directors)) & (df.Genre.isin(genres))]
This code allowed in Scorsese “mysteries”, for example, when I only wanted Scorsese “biographies”

Ok. Let’s then filter the hard way:

df[((df.Director=='Francis Ford Coppola') & (df.Genre=='Drama')) | 
   ((df.Director=='Alfred Hitchcock') & (df.Genre=='Mystery')) | 
   ((df.Director=='Martin Scorsese') & (df.Genre=='Biography'))]

This code does get me the data I want, but it’s long and a little unreadable. Is there a better way? Yes: tuples!

First, let’s take the director/genre combinations we want and put them together in a single list of tuples:

dirs_genres = [('Francis Ford Coppola', 'Drama'), ('Alfred Hitchcock', 'Mystery'), ('Martin Scorsese', 'Biography')]

Now, we can use the apply function to create an on-the-fly director/genre tuple field that we can parse on:

df[df.apply(lambda row: (row.Director, row.Genre), axis=1).isin(dirs_genres)]

We get the same results as before, but with slightly more readable code. Here’s a neat alternative option using Python’s tuple function:

df[df[['Director', 'Genre']].apply(tuple, axis=1).isin(dirs_genres)]

If you anticipate that you might use this director/genre combination a lot in your work, you could consider adding this tuple as a new column to your dataframe:

df['pg'] = df[['Director', 'Genre']].apply(tuple, axis=1)
df.head()

Then, you could do all your filtering directly against that single field.

Happy filtering!