I do a lot of work with timestamps in Pandas dataframes and tend to use the to_datetime function quite a lot to cast my timestamps–usually read as strings–into proper datetime objects. Here’s a simple scenario where I cast my string timestamp column into a datetime:
import pandas as pd
d = [{'ts': '2023-01-01 12:05:00'}, {'ts': '2023-01-09 13:23:00'}, {'ts': '2023-01-11 08:37:00'}, {'ts': '2023-01-13 15:45:00'}]
df = pd.DataFrame(d)
pd.to_datetime(df.ts)

In the above, I explicitly pass the column “ts” to the function. However, I recently discovered another way to use to_datetime where you don’t have to be so explicit:
d1 = [{'year':2023,'month':1, 'day': 1},{'year':2023,'month':2, 'day': 9},{'year':2023,'month':3, 'day': 11},{'year':2023,'month':4, 'day': 13}]
df1 = pd.DataFrame(d1)
pd.to_datetime(df1)

Passing a dataframe to the function with columns named as year, month, and day seems to be enough to get the function to do its thing. That’s pretty cool!
The Assign function: at long last
I knew there had to be a way to add new columns inline (in a chain of commands) to a Pandas dataframe. The assign function is a way to do that.
Suppose I only have year and month columns in my dataframe. I can use assign to add a day column and perform my datetime conversion:
d2 = [{'year':2023,'month':1},{'year':2023,'month':2},{'year':2023,'month':3},{'year':2023,'month':4}]
df2 = pd.DataFrame(d2)
pd.to_datetime(df2.assign(day=1))

to_period: another useful datetime related function
Recently I was working with a dataset where the events only had year/month values. The to_period function is there to help with such situations:
d3 = [{'year':2023,'month':1},{'year':2023,'month':2},{'year':2023,'month':3},{'year':2023,'month':4}]
df3 = pd.DataFrame(d3)
pd.to_datetime(df3.assign(day=1)).dt.to_period('M')

How intuitive is to_datetime?
Just how intuitive is this less explicit way of using to_datetime? Can it read and cast month names? The answer is: it depends. Pandas version 1.3.5 doesn’t like month names:
d4 = [{'year':2023,'month':'January'},{'year':2023,'month':'February'},{'year':2023,'month':'March'},{'year':2023,'month':'April'}]
df4 = pd.DataFrame(d4)
pd.to_datetime(df4.assign(day=1)).dt.to_period('M')

However, I’ve found that earlier versions of Pandas will successfully parse month names. To resolve these value errors, you’ll have to add a line of code to convert your month names to their associated numeric values:
from datetime import datetime
d4 = [{'year':2023,'month':'January'},{'year':2023,'month':'February'},{'year':2023,'month':'March'},{'year':2023,'month':'April'}]
df4 = pd.DataFrame(d4)
df4['month'] = df4.month.apply(lambda m: datetime.strptime(m, '%B').month)
pd.to_datetime(df4.assign(day=1)).dt.to_period('M')

So, here are yet more ways to leverage Pandas with your timestamps!
Recent Comments