Not the catchiest blog title, but stick with me here.

Recently, a friend came to me with an interesting pandas problem. He basically had a set of survey questions and answers. The surveys were of different software applications, but each application had one or more versions of survey results. His goal was to filter on the most recent survey results of each application.

Spinal Tap: too many drummer changes to keep track

For demo purposes, consider a dataset of different rock bands and their lineup changes over the years. Van Halen has had about four lineup changes: from the early David Lee Roth years, to the Sammy years, to the Gary Cherone era, and back to David Lee Roth. ZZ Top, on the other hand, has had only one lineup: the original three members.

The challenge: given a dataset of different bands and their different lineup changes over the years, how might you filter the dataframe to only show the most recent lineup of each band? (Here’s the dataset I’m using.)

I found two ways to solve this problem and a third potential option that bears some consideration. Let’s take a look at these solutions!

Of course, we must first import pandas and load our dataset

import pandas as pd
df = pd.read_csv('bands.csv')

Solution 1: idxmax and Merge

The idxmax function returns the row locations where a given column contains the maximum value. With our dataset, then, we can group by the band then run idxmax on the version column. Unfortunately, though, idxmax only returns one row per band; otherwise, it might have solved our problem straight-away:

df.loc[df.groupby('band').version.idxmax()]

With idxmax, we can still build a “most recent” dataframe: as in the most recent version numbers for each band. Then, we can use pandas merge to join our full dataset to the one representing the most recent versions of each band. This will effectively filter down our full dataframe to one that only shows the most recent versions of each band and, thus, solve our problem:

df_most_recent = df.loc[df.groupby('band').version.idxmax()]
df.merge(df_most_recent, how='right', on=['band', 'version'])
You would drop the “member_y” column, of course

Solution 2: Max and clever filtering

Similar to idxmax, we can use the pandas max function to also find the latest version numbers of each band, but this time as a Series.

df.groupby('band').version.max()

Series objects can be easily casted to a dictionary objects, which will be handy in a minute:

dict(most_recent).items()

This next part is really cool: I was not aware you could do this, but you can use the apply function to convert dataframe columns into a Series of tuples. Then, you can use the isin function and some clever list comprehension to match the band/version tuple of the main dataframe to the band/version tuple of the “most recent” Series. The result is the solution we seek: a dataframe of the most recent band lineups:

most_recent = dict(df.groupby('band').version.max())
df[ (df[['band','version']].apply(tuple, axis=1)).isin([(b,v) for b, v in most_recent.items()]) ]
No columns to drop for this solution

A third, half-baked idea

The above two approaches seem to solve this problem decently. One other option to explore is changing the index of the dataframe to a multi-index dataframe on the band and lineup version:

df_indexed = df.set_index(['band', 'version'])

Pandas can do some interesting slicing based on indices, including making use of the “cross section” (xs) function. In my example below, I’m filtering the dataframe on the second index and requesting all rows where the second index has a value of “1”. Basically, give me a dataframe of all the original band lineups:

df_indexed.xs(1, level=1, drop_level=False)

Pretty slick! Now, how do I use this multi-index approach to return the most recent band lineups? Unfortunately, I have not found a way to do that with the xs function. Nevertheless, it would not surprise me if such a way were possible with indexing and multi-indexing.

You can download my full code solution here.