Dealing with missing dates in your dataframes

I work with a lot of time-based data and occasionally have to work with data where there are chunks of missing time. For example, consider my kindle “free time” reading data. May 2019 was a good reading month for me, so I’d like to take a closer look at that month and chart out the ebb and flow of my reading time by day. To start with, I’ll just take a look at my reading times by day for the month:

cols = ['accessdate', 'read_mins']
df_may = df_dayinfo[(df_dayinfo.accessdate >= '2019-05-01') & (df_dayinfo.accessdate <= '2019-05-31')][cols]
df_may

My reading minutes for May 2019: there are a few days when I didn’t read at all

If I were to chart that data out, it wouldn’t quite be accurate as there are gaps in the month:

fig, ax = plt.subplots(figsize=(15, 6))
_ = df_may.groupby('accessdate').sum().plot(ax=ax, marker='o')
_ = ax.axhline(y=0.0, xmin=0.0, xmax=1.0, color='gray', ls='--')
_ = ax.set_title('My Reading Time for May 2019')
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Reading Time (minutes)')

In the past, to accommodate for these missing days, I’d build a second dataframe of all the days in the month and 0.0 read minutes. Then, I’d merge the dataframes together so that I would have entries for all the days of the month:

# create a dataframe with all the days of the month and 0.0 read times
start = datetime(2019, 5, 1)
end = datetime(2019, 6, 1)
may_zeros = [[start + timedelta(days=x), 0.0] for x in range(0, (end-start).days)]
df_may_zeros = pd.DataFrame(may_zeros, columns=['accessdate', 'read_mins'])

# now, merge my 0.0 read time df with my actual data to get a full representation of the month
df_may1 = pd.concat([df_may, df_may_zeros]).groupby('accessdate').sum()

# finally, create the chart
fig, ax = plt.subplots(figsize=(15, 6))
_ = df_may1.groupby('accessdate').sum().plot(ax=ax)
_ = ax.axhline(y=0.0, xmin=0.0, xmax=1.0, color='gray', ls='--')
_ = ax.set_title('My Reading Time for May 2019')
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Reading Time (minutes)')

Reading times in May including missed days

So, problem solved, but it turns out Pandas has an even better way to solve this problem: use Pandas’ date_range function along with reindex:

# use date_range to create an index for every day in May 2019
idx = pd.date_range('05-01-2019', '05-31-2019')
# now, group my real data by day, reindex it with the days in May, and fill any missing values with 0
df_may2 = df_may.groupby('accessdate').sum().reindex(idx, fill_value=0)

# now, we can create the chart
fig, ax = plt.subplots(figsize=(15, 6))

# i can overlay my original chart to see the differences if I want
# _ = df_may.groupby('accessdate').sum().plot(ax=ax, color='r')
_ = df_may2.plot(ax=ax, marker='o')
_ = ax.axhline(y=0.0, xmin=0.0, xmax=1.0, color='gray', ls='--')
_ = ax.set_title('My Reading Time for May 2019')
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Reading Time (minutes)')

The same results with the aid of date_range

If you don’t want to assume your missing days are simply 0.0 values, reindex will, by default, fill the values with NaN. You could then run interpolate over your dataframe and calculate some other value.

In addition to date_range, Pandas has lots of other general-purpose functions worth checking out. So now you have two ways to fill in missing dates in your dataframes!

Dealing with missing dates in your dataframes

Recent Posts

Recent Comments

Archives

Meta

Dealing with missing dates in your dataframes

Recent Posts

Recent Comments

Archives

Tags

Meta