The end of the year is a traditional time to reflect back and assess one’s actions for the past twelve months. So, what better time to do a little analysis on what I’ve been posting on this blog.

Getting my blog data

As far as I can tell, I have no way to download summary information on my posts from the WordPress console; however, some information–title, category, tags, publishing date, etc.–is available in a table in the Posts section of the console. So, I used the handy Table-to-Excel browser extension to copy the contents of the table to a CSV file that I could later process with Python.

Parsing the raw data

The blog data from my administration console didn’t copy down so nicely. Here’s some code I wrote to clean up the data and get it into a dataframe for cleaner work later:

blog_data = []

with open('./data/raw_post_data.txt', 'rb') as f:
    for raw_line in f:
        line = raw_line.decode("utf-8")
        title = line.split('false')[0]  # do some initial trimming of the row
        data_part = line[line.find('Brad')+4:]  # splitting on the "author" value
        data_list = data_part.split('\t')
        blog_data.append([title.strip(), data_list[1].strip(), data_list[2].strip(), data_list[5].strip()])
    
df_blog_data = pd.DataFrame(blog_data[1:], columns=['title', 'categories', 'tags', 'published'])
df_blog_data = df_blog_data[df_blog_data.title!='All']  # remove the header row from the dataframe

Afterward, I cleaned up my dataframe a little and added a few more columns:

df_blog_data['publish_date'] = df_blog_data.published.apply(lambda p: datetime.strptime(p.split()[1], '%Y/%m/%d'))
df_blog_data['year'] = df_blog_data.publish_date.apply(lambda p: p.year)
df_blog_data['month'] = df_blog_data.publish_date.apply(lambda p: p.month)

Time for some analysis

With a relatively manageable dataframe, I can generate some charts and do a little analysis. With the following code, I take a look at how prolific I’ve been with blogging:

width =0.3
fig, ax = plt.subplots(figsize=(10, 6))

df_blog_data[df_blog_data.year==2019].groupby(['month']).count().iloc[:,[0]].plot(kind='bar', ax=ax, width=width, position=0, color='orange')
df_blog_data[df_blog_data.year==2018].groupby(['month']).count().iloc[:,[0]].plot(kind='bar', ax=ax, width=width, position=1, color='blue')

_ = ax.set_title('Number of Blog Posts: 2018 - 2019')
_ = ax.set_ylabel('Number of Blog Posts')
l = ax.legend()
l.get_texts()[0].set_text('2019')
l.get_texts()[1].set_text('2018')

…and the results:

The number of blog posts I’ve written over the last two years

Well, I clearly peaked six months into the life of this website and it’s been downhill from there. At least in 2019 I think I’ve pretty consistently delivered three posts a month.

So, what sort of content have I been delivering? Categories and tags should tell this story. For the most part, I’ve tried to assign only one category per blog post, but not always. So, to try to get an idea of how often I’ve used each category on the site, I had to do a little gymnastics to pull out each category separately and report each count. Here’s the code I came up with:

df_cats = pd.DataFrame( ','.join( df_blog_data.categories.tolist()).replace(' ', '').split(','), columns=['category'])
fig, ax = plt.subplots(figsize=(10, 6))

_ = df_cats.groupby('category').size().plot(kind='barh', ax=ax, color='mediumpurple')
_ = ax.set_title('Categories used for blog posts: 2018 - 2019')

This blog is clearly heavily weighted toward technology. I also have an Uncategorized category in there which means I forgot to categorize one of my previous posts. I definitely need to work on adding more general and genealogy-type posts just to keep things interesting.

To analyze my use of tags, I wrote roughly the same sort of code:

df_tags = pd.DataFrame( ','.join( df_blog_data.tags.tolist()).replace(' ', '').split(','), columns=['tag'])
fig, ax = plt.subplots(figsize=(10, 6))

_ = df_tags.groupby('tag').size().sort_values().plot(kind='barh', ax=ax, color='green')
_ = ax.set_title('Tags used for blog posts: 2018 - 2019')

Well, I do like tools–especially the software kind! I had feared that python would be a dominating topic, but it’s not as bad as I thought and even the parenting topic is a close fourth. In the future, I would like to write more about the college experience as I have recently become the parent of a college student and will add another to that list in the not-too-distant future. I must also write more on the podcast topic as I do make much use of that medium in my lengthy commutes to and from work. And, so here’s to more quality posts in 2020!