I’ve written about the pandas groupby function a few times in the past; it’s a valuable command that I use frequently. You typically want to pipe your “group by” operations to a calculation function like count, sum, mean, etc. This blog post has a great write-up on groupby and the calculations you can do with it.
Most examples of groupby depict grouping your dataframes by referencing the literal names of your various columns. For example, working with this movie dataset, suppose I wanted to know how many movies are in the data per year. Typically, I’d code something like the following:
import pandas as pd
df = pd.read_csv('./data/regex_imdb.csv').fillna(0)
df[['Year', 'Name']].groupby('Year').count()
Getting fancier, suppose I wanted to group by both year and genre. I could do this (note that in this dataset, a multi-genre movie has the multiple genres comma-delimited):
df[['Year', 'Genre', 'Name']].groupby(['Year', 'Genre']).count()
But what if I wanted to do something slightly trickier, like grouping by year and whether or not a film was a comedy? You could add a new boolean column and use that in your grouping:
df['is_comedy'] = df.Genre.str.contains('Comedy')
df[['Year', 'is_comedy', 'Name']].groupby(['Year', 'is_comedy']).count()
However, instead of taking the extra step of adding a new column to your dataframe, you could do that work inline with the pandas map function, especially if you don’t think you’ll use that new column elsewhere:
df[['Year', 'Genre']].groupby(['Year', df.Genre.map(lambda g: 'Comedy' if 'Comedy' in g else 'Some other genre')]).count()
I have especially found this approach helpful grouping with timestamps. Suppose you want to group your dataframe by date and hour. That now becomes pretty simple:
from datetime import datetime
d = {'dt':[datetime(2021,10,30,3,0,0),datetime(2021,10,30,3,0,0),datetime(2021,10,30,3,0,0),datetime(2021,10,30,4,0,0),
datetime(2021,10,30,5,0,0),datetime(2021,10,30,5,0,0),datetime(2021,10,31,3,0,0),datetime(2021,10,31,3,0,0)],
'desc':['some event','some other event','big event','small event','medium sized event',
'nothing to see here','event A','event B']}
df_events = pd.DataFrame(d)
df_events.groupby([df_events.dt.map(lambda d: datetime.date(d)), df_events.dt.map(lambda d: d.hour)]).count()
One important note: initially, I assumed my datetime values had a date property that I could use as I use their hour properties:
df_events.groupby([df_events.dt.map(lambda d: d.date)]).count()
Unfortunately, that command will throw a strange error. Instead, you’ll have to cast your values to a date using the datetime package.
Recent Comments