I work with a lot of time-based data and occasionally have to work with data where there are chunks of missing time. For example, consider my kindle “free time” reading data. May 2019 was a good reading month for me, so I’d like to take a closer look at that month and chart out the ebb and flow of my reading time by day. To start with, I’ll just take a look at my reading times by day for the month:
cols = ['accessdate', 'read_mins']
df_may = df_dayinfo[(df_dayinfo.accessdate >= '2019-05-01') & (df_dayinfo.accessdate <= '2019-05-31')][cols]
df_may
If I were to chart that data out, it wouldn’t quite be accurate as there are gaps in the month:
fig, ax = plt.subplots(figsize=(15, 6))
_ = df_may.groupby('accessdate').sum().plot(ax=ax, marker='o')
_ = ax.axhline(y=0.0, xmin=0.0, xmax=1.0, color='gray', ls='--')
_ = ax.set_title('My Reading Time for May 2019')
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Reading Time (minutes)')
In the past, to accommodate for these missing days, I’d build a second dataframe of all the days in the month and 0.0 read minutes. Then, I’d merge the dataframes together so that I would have entries for all the days of the month:
# create a dataframe with all the days of the month and 0.0 read times
start = datetime(2019, 5, 1)
end = datetime(2019, 6, 1)
may_zeros = [[start + timedelta(days=x), 0.0] for x in range(0, (end-start).days)]
df_may_zeros = pd.DataFrame(may_zeros, columns=['accessdate', 'read_mins'])
# now, merge my 0.0 read time df with my actual data to get a full representation of the month
df_may1 = pd.concat([df_may, df_may_zeros]).groupby('accessdate').sum()
# finally, create the chart
fig, ax = plt.subplots(figsize=(15, 6))
_ = df_may1.groupby('accessdate').sum().plot(ax=ax)
_ = ax.axhline(y=0.0, xmin=0.0, xmax=1.0, color='gray', ls='--')
_ = ax.set_title('My Reading Time for May 2019')
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Reading Time (minutes)')
So, problem solved, but it turns out Pandas has an even better way to solve this problem: use Pandas’ date_range function along with reindex:
# use date_range to create an index for every day in May 2019
idx = pd.date_range('05-01-2019', '05-31-2019')
# now, group my real data by day, reindex it with the days in May, and fill any missing values with 0
df_may2 = df_may.groupby('accessdate').sum().reindex(idx, fill_value=0)
# now, we can create the chart
fig, ax = plt.subplots(figsize=(15, 6))
# i can overlay my original chart to see the differences if I want
# _ = df_may.groupby('accessdate').sum().plot(ax=ax, color='r')
_ = df_may2.plot(ax=ax, marker='o')
_ = ax.axhline(y=0.0, xmin=0.0, xmax=1.0, color='gray', ls='--')
_ = ax.set_title('My Reading Time for May 2019')
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Reading Time (minutes)')
If you don’t want to assume your missing days are simply 0.0 values, reindex will, by default, fill the values with NaN. You could then run interpolate over your dataframe and calculate some other value.
In addition to date_range, Pandas has lots of other general-purpose functions worth checking out. So now you have two ways to fill in missing dates in your dataframes!
Recent Comments