Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Author: Brad (Page 29 of 57)

Dad. Technologist. Fan of English poet of John Lillison.

Filtering on the most recent version of groups in Pandas

Not the catchiest blog title, but stick with me here.

Recently, a friend came to me with an interesting pandas problem. He basically had a set of survey questions and answers. The surveys were of different software applications, but each application had one or more versions of survey results. His goal was to filter on the most recent survey results of each application.

Spinal Tap: too many drummer changes to keep track

For demo purposes, consider a dataset of different rock bands and their lineup changes over the years. Van Halen has had about four lineup changes: from the early David Lee Roth years, to the Sammy years, to the Gary Cherone era, and back to David Lee Roth. ZZ Top, on the other hand, has had only one lineup: the original three members.

The challenge: given a dataset of different bands and their different lineup changes over the years, how might you filter the dataframe to only show the most recent lineup of each band? (Here’s the dataset I’m using.)

I found two ways to solve this problem and a third potential option that bears some consideration. Let’s take a look at these solutions!

Of course, we must first import pandas and load our dataset

import pandas as pd
df = pd.read_csv('bands.csv')

Solution 1: idxmax and Merge

The idxmax function returns the row locations where a given column contains the maximum value. With our dataset, then, we can group by the band then run idxmax on the version column. Unfortunately, though, idxmax only returns one row per band; otherwise, it might have solved our problem straight-away:

df.loc[df.groupby('band').version.idxmax()]

With idxmax, we can still build a “most recent” dataframe: as in the most recent version numbers for each band. Then, we can use pandas merge to join our full dataset to the one representing the most recent versions of each band. This will effectively filter down our full dataframe to one that only shows the most recent versions of each band and, thus, solve our problem:

df_most_recent = df.loc[df.groupby('band').version.idxmax()]
df.merge(df_most_recent, how='right', on=['band', 'version'])
You would drop the “member_y” column, of course

Solution 2: Max and clever filtering

Similar to idxmax, we can use the pandas max function to also find the latest version numbers of each band, but this time as a Series.

df.groupby('band').version.max()

Series objects can be easily casted to a dictionary objects, which will be handy in a minute:

dict(most_recent).items()

This next part is really cool: I was not aware you could do this, but you can use the apply function to convert dataframe columns into a Series of tuples. Then, you can use the isin function and some clever list comprehension to match the band/version tuple of the main dataframe to the band/version tuple of the “most recent” Series. The result is the solution we seek: a dataframe of the most recent band lineups:

most_recent = dict(df.groupby('band').version.max())
df[ (df[['band','version']].apply(tuple, axis=1)).isin([(b,v) for b, v in most_recent.items()]) ]
No columns to drop for this solution

A third, half-baked idea

The above two approaches seem to solve this problem decently. One other option to explore is changing the index of the dataframe to a multi-index dataframe on the band and lineup version:

df_indexed = df.set_index(['band', 'version'])

Pandas can do some interesting slicing based on indices, including making use of the “cross section” (xs) function. In my example below, I’m filtering the dataframe on the second index and requesting all rows where the second index has a value of “1”. Basically, give me a dataframe of all the original band lineups:

df_indexed.xs(1, level=1, drop_level=False)

Pretty slick! Now, how do I use this multi-index approach to return the most recent band lineups? Unfortunately, I have not found a way to do that with the xs function. Nevertheless, it would not surprise me if such a way were possible with indexing and multi-indexing.

You can download my full code solution here.

Teaching the kid to drive

I’m now on to teaching my second child to drive. Here are some thoughts I’ve accumulated over the time that might be of some assistance to you novice parent instructors out there:

Review the car mechanics

Your child should obviously know the gas pedal from the brake, but don’t forget all the other knobs and switches that we know by heart but your child doesn’t. Gear shift, turn signals, headlights, bright lights, windshield wipers/fluid, parking brake, hazard lights…make sure to review all these features and how to activate them. As an advanced topic, I show my child how to halfway hold the turn signal arm up or down–thus, engaging the turn signal but not locking the arm in place, which can be handy for short-duration signaling like changing lanes.

Your safety radar

I try to get my kids to imagine they’re detectives or air traffic controllers. They need to be looking all around them, looking for new threats, obstacles, and other information appearing on their radars; they should be looking for clues to let them know how these objects might cause safety concerns. Is a given car in the vicinity driving erratically, not using turn signals, or otherwise providing evidence of potential danger? Add that car to your radar and keep an eye in it–and some distance. Are there children nearby that have the potential of running in front of you? Add them to your radar and take caution.

The first lesson: the parking lot

It’s probably a no-brainer, but my first lesson with the kids is in a large, unpopulated parking lot. In fact, the first several lessons will be in said parking lot. I have them drive in circles, clockwise and counter-clockwise. Several lessons in, I will start having them drive in reverse–maybe even doing a full circle or two in reverse. Of course, I also have the kids practice parking: both nose-first and rear-first. In every lesson, I insist we pretend that the parking lot is full and my young driver is not allowed to hit any imaginary car.

Residential Driving

Lazy suburbs are ideal starting points for getting your child on a real road. Hopefully, your chosen suburb has limited cars parked on the side of the roads. Initially, you may want to target early mornings or evenings to limit pedestrian traffic, as well. One alternative to the suburb is to practice on the access roads of a medium-sized cemetery.

U.S. Routes and State Highways

Once my child seems comfortable driving on the suburbs and local roads, we graduate to the nearby U.S. route. U.S. routes are a fantastic resource for beginner drivers. They traverse a variety of driving conditions: small town driving at 25-35 mph, mid-town driving at 40-45 mph, and even highway driving at 60+ mph. State routes have stop sign intersections, stoplights, railroad crossings, and multi-lane roadways. They’re also great to log lots of drive-time hours. Often, the kid and I will get up early on a Saturday or Sunday, drive two hours out then turn around and come back home.

Interstate Highway Driving

Interstate highway driving is almost like a final exam for my kids. Things get real on the highways. Obviously, the less traffic the better when starting out on a highway, so early mornings and even the weekday gaps between rush hours seem to work well. Here’s one technique I heard of recently that I may incorporate in my future lessons: drive to a convenient beltway, then proceed to take the exit. From the exit, work your way over to the closest entrance ramp to get right back on the beltway. Rinse and repeat at every exit until you’ve made your way around the entire highway. That’s sure to train your child up on highway exits and entrances.

Nighttime Driving

Repeat all the above, but this time, at night.

Odds and Ends

Don’t forget some of these other aspects of driving:

  • Gassing up the car (or charging, as the case may be)
  • Parking in full parking lots
  • Drive-up ATMs and fast food drive-throughs
  • What to do if you have an accident or other emergency
  • How to read your dashboard and where to look for errors and warnings
  • How to listen to the sound of the car under normal operating conditions and how that sound might change if the car is malfunctioning
  • Parallel parking?!

So, those are a few of the items I work through with my children as I teach them to drive. Any ideas I missed? Feel free to add them in the comments!

How do you transpose a Question/Answer dataset?

Recently, a friend came to me with an interesting challenge. He had a dataset of questions and answers where each record contained a single question and the answer to the question. Arguably, this dataset was already in a tidy format, but my friend wanted to transpose the data such that each unique question became a column of its own with the answers as values.

Before I could come to his aid, my friend already found a great answer at Medium.com using the pandas function: pivot_table.

Here’s what he did:

Let’s suppose you have this table of question/answer, tab-delimited data:

person	question	answer
Sir Robin	What is your name?	Sir Robin of Camelot
Sir Robin	What is your quest?	To seek the Holy Grail
Sir Robin	What is the capital of Assyria?	I don't know that
Sir Lancelot	What is your name?	Sir Lancelot of Camelot
Sir Lancelot	What is your quest?	To seek the Holy Grail
Sir Lancelot	What is your favorite colour?	Blue
Sir Galahad	What is your name?	Sir Galahad of Camelot
Sir Galahad	What is your quest?	I seek the Grail
Sir Galahad	What is your favorite colour?	"Blue, no Yellow"
King Arthur	What is your name?	"Arthur, King of the Britons"
King Arthur	What is your quest?	I seek the Holy Grail
King Arthur	What is the air speed of an unladened swallow?	What do you mean?  An African or European swallow?

Step 1: Import pandas and read in your data

import pandas as pd

df = pd.read_csv('questions.csv', sep='\t')

Step 2: pivot_table

df_pivotted = df.pivot_table(index='person', values=['answer'], 
                             columns=['question'], aggfunc=lambda x: ' '.join(str(v) for v in x))
df_pivotted.head()
pivot_table does the job nicely

The trick here is the aggfunc operation. The aggfunc parameter is normally used to sum, average, or perform some other type of numeric operation on your values columns. Interestingly, though, you can apparently supply your own custom function to this parameter instead. Here, the Medium.com author found that he could simply loop through every letter of the answer and re-join them with spaces, effectively return the original answer.

That seems pretty complicated

The use of pivot_table certainly works in this example and it’s pretty sweet to see that you can pass your own custom function to it. However, pandas also has a more generic, pivot function. Could that have worked here?

The answer is: yes. When you google pandas pivot vs pivot_table, one of the top responses is this Stackoverflow.com post that suggests pivot_table only allows numerically-typed columns in the values parameter while pivot will take strings. I don’t think this is quite true, since the above example passed a string column to the values parameter, but it does suggest that pivot might be more disposed to working with strings than pivot_table. Let’s give it a try:

df.pivot(index='person', values='answer', columns='question')
Whaddya know?! Pivot can do the job, too!

Not only can pivot do the transformation, it certainly seems less complicated. Check out my full code here.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑