Tag: python (Page 14 of 26)

Tracking your reading time

January 13, 2020 / Brad

I’ve alluded to my interest in reading a few times in the past. Several years ago, I made the switch from physical books to digital and use an Amazon Kindle as my main reading vehicle.

One frustration I have with the Kindle, though, is either its inability to track the reading time metrics I’m interested in collecting or its poor way of sharing those metrics with data nerds like me.

Earlier in the year, I decided to spend more than five minutes solving this problem and found out about Kindle FreeTime. Kindle FreeTime is an application on Kindle devices with the primary focus of getting kids to read. Parents can use FreeTime to decide what books their children can read and what minimum daily reading goals they want their children to meet. A side benefit of FreeTime, though, is that it captures a lot of the metrics I’m interested in in a SQLite database: all you have to do is plug your kindle into your workstation, download the database at system\freetime\freetime.db, and start exploring.

Dayinfo

One of the tables in the FreeTime database is dayinfo. This is probably a good place to start gathering some general reading metrics. Here’s how I went about digging into the data.

Load all the standard packages

In my notebook, I started by loading all the normal packages I use including the sqlite3 package:

import pandas as pd
import numpy as np
import sqlite3
from datetime import datetime
import matplotlib.pyplot as plt

%matplotlib inline

Load and clean the data from the table

Next, I queried the data from the dayinfo table and added a few helpful columns:

conn = sqlite3.connect('./data/freetime.db')
query = "SELECT * FROM dayinfo;"

df_dayinfo = pd.read_sql_query(query,conn)

# clean up fields and do some feature engineering
df_dayinfo['accessdate'] = pd.to_datetime(df_dayinfo.accessdate)
df_dayinfo['access_month'] = df_dayinfo.accessdate.dt.month
df_dayinfo['access_dow'] = df_dayinfo.accessdate.dt.dayofweek
df_dayinfo['read_mins'] = df_dayinfo.timeread / 60
df_dayinfo['read_hours'] = df_dayinfo.timeread / 3600

Calculate some preliminary metrics

Finally, I wanted to calculate my total reading time for the year 2019 and my average daily reading time. I only started using FreeTime in March 2019, so I had to pro-rate some of my calculations. Here’s what I came up with:

df_dayinfo_2019 = df_dayinfo[(df_dayinfo.accessdate > datetime(2019, 1, 1)) & (df_dayinfo.accessdate < datetime(2020, 1, 1))]
days_in_2019 = (df_dayinfo_2019.accessdate.max() - df_dayinfo_2019.accessdate.min()).days

print('From {0:%d %b %Y} to {1:%d %b %Y} ({2} days):'.format(df_dayinfo_2019.accessdate.min(), 
                                                             df_dayinfo_2019.accessdate.max(), days_in_2019))
print('I read {0:.2f} hours'.format(df_dayinfo_2019.read_hours.sum()))
print("That's an average of {0:.2f} minutes per day".format((df_dayinfo_2019.read_mins.sum())/days_in_2019))

From 10 Mar 2019 to 29 Dec 2019 (294 days):
I read 111.18 hours
That's an average of 22.69 minutes per day

Bah! Only 22 minutes reading time per day on average?! Well, I know one goal I’ll need to work on for 2020. Lets see what this data looks like in some charts:

fig, ax = plt.subplots(figsize=(10, 8))
df_dayinfo_2019[['access_month', 'read_hours']].sort_values('access_month').groupby('access_month').sum().plot.bar(ax=ax)
_ = ax.set_title('Hours Read by Month: {0:%d %b %Y} to {1:%d %b %Y}'.format(df_dayinfo_2019.accessdate.min(), 
                                                                            df_dayinfo_2019.accessdate.max()))
_ = ax.set_xlabel('Month')
_ = ax.set_ylabel('Hours')

My monthly reading totals starting in March: May was a good month

fig, ax = plt.subplots(figsize=(10, 8))
df_dayinfo_2019[['access_dow', 'read_hours']].sort_values('access_dow').groupby('access_dow').sum().plot.bar(ax=ax)
_ = ax.set_title('Hours Read by Day of Week: {0:%d %b %Y} to {1:%d %b %Y}'.format(df_dayinfo_2019.accessdate.min(), 
                                                                                  df_dayinfo_2019.accessdate.max()))
_ = ax.set_xlabel('Day of Week')
_ = ax.set_ylabel('Hours')

_ = ax.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

I read the most on Wednesdays. That makes sense because most of my Wednesday evenings in Spring are sitting in the parking lot outside a bandroom while my kid practices with his middle school orchestra. I get a lot of reading time in on those days.

There are other tables in the database including details on each of the books that I’ve read over the year. Hopefully, at some point, I’ll dig in to those details, as well. But for now, this data is sufficient to get me motivated to read more in 2020.

Assessing my Posts

December 29, 2019 / Brad

The end of the year is a traditional time to reflect back and assess one’s actions for the past twelve months. So, what better time to do a little analysis on what I’ve been posting on this blog.

Getting my blog data

As far as I can tell, I have no way to download summary information on my posts from the WordPress console; however, some information–title, category, tags, publishing date, etc.–is available in a table in the Posts section of the console. So, I used the handy Table-to-Excel browser extension to copy the contents of the table to a CSV file that I could later process with Python.

Parsing the raw data

The blog data from my administration console didn’t copy down so nicely. Here’s some code I wrote to clean up the data and get it into a dataframe for cleaner work later:

blog_data = []

with open('./data/raw_post_data.txt', 'rb') as f:
    for raw_line in f:
        line = raw_line.decode("utf-8")
        title = line.split('false')[0]  # do some initial trimming of the row
        data_part = line[line.find('Brad')+4:]  # splitting on the "author" value
        data_list = data_part.split('\t')
        blog_data.append([title.strip(), data_list[1].strip(), data_list[2].strip(), data_list[5].strip()])
    
df_blog_data = pd.DataFrame(blog_data[1:], columns=['title', 'categories', 'tags', 'published'])
df_blog_data = df_blog_data[df_blog_data.title!='All']  # remove the header row from the dataframe

Afterward, I cleaned up my dataframe a little and added a few more columns:

df_blog_data['publish_date'] = df_blog_data.published.apply(lambda p: datetime.strptime(p.split()[1], '%Y/%m/%d'))
df_blog_data['year'] = df_blog_data.publish_date.apply(lambda p: p.year)
df_blog_data['month'] = df_blog_data.publish_date.apply(lambda p: p.month)

Time for some analysis

With a relatively manageable dataframe, I can generate some charts and do a little analysis. With the following code, I take a look at how prolific I’ve been with blogging:

width =0.3
fig, ax = plt.subplots(figsize=(10, 6))

df_blog_data[df_blog_data.year==2019].groupby(['month']).count().iloc[:,[0]].plot(kind='bar', ax=ax, width=width, position=0, color='orange')
df_blog_data[df_blog_data.year==2018].groupby(['month']).count().iloc[:,[0]].plot(kind='bar', ax=ax, width=width, position=1, color='blue')

_ = ax.set_title('Number of Blog Posts: 2018 - 2019')
_ = ax.set_ylabel('Number of Blog Posts')
l = ax.legend()
l.get_texts()[0].set_text('2019')
l.get_texts()[1].set_text('2018')

…and the results:

The number of blog posts I’ve written over the last two years

Well, I clearly peaked six months into the life of this website and it’s been downhill from there. At least in 2019 I think I’ve pretty consistently delivered three posts a month.

So, what sort of content have I been delivering? Categories and tags should tell this story. For the most part, I’ve tried to assign only one category per blog post, but not always. So, to try to get an idea of how often I’ve used each category on the site, I had to do a little gymnastics to pull out each category separately and report each count. Here’s the code I came up with:

df_cats = pd.DataFrame( ','.join( df_blog_data.categories.tolist()).replace(' ', '').split(','), columns=['category'])
fig, ax = plt.subplots(figsize=(10, 6))

_ = df_cats.groupby('category').size().plot(kind='barh', ax=ax, color='mediumpurple')
_ = ax.set_title('Categories used for blog posts: 2018 - 2019')

This blog is clearly heavily weighted toward technology. I also have an Uncategorized category in there which means I forgot to categorize one of my previous posts. I definitely need to work on adding more general and genealogy-type posts just to keep things interesting.

To analyze my use of tags, I wrote roughly the same sort of code:

df_tags = pd.DataFrame( ','.join( df_blog_data.tags.tolist()).replace(' ', '').split(','), columns=['tag'])
fig, ax = plt.subplots(figsize=(10, 6))

_ = df_tags.groupby('tag').size().sort_values().plot(kind='barh', ax=ax, color='green')
_ = ax.set_title('Tags used for blog posts: 2018 - 2019')

Well, I do like tools–especially the software kind! I had feared that python would be a dominating topic, but it’s not as bad as I thought and even the parenting topic is a close fourth. In the future, I would like to write more about the college experience as I have recently become the parent of a college student and will add another to that list in the not-too-distant future. I must also write more on the podcast topic as I do make much use of that medium in my lengthy commutes to and from work. And, so here’s to more quality posts in 2020!

divmod, for the win!

October 18, 2019 / Brad

I had a situation recently where I had a list of values laid out in a grid like so:

I had to figure out the row and column positions for each value.

So, let’s start with a list of numbers:

some_list = [i for i in range(15)]

First, how can I easily figure out what row each number belongs to? If you said “mod,” you’d be right! You take the mod of the number divided by the size of the group: in this case, 5:

group_size = 5
for n in some_list:
    print('Number {0} belongs to row {1}'.format(n, n % group_size))

Number 0 belongs to row 0
Number 1 belongs to row 1
Number 2 belongs to row 2
Number 3 belongs to row 3
Number 4 belongs to row 4
Number 5 belongs to row 0
Number 6 belongs to row 1
Number 7 belongs to row 2
Number 8 belongs to row 3
Number 9 belongs to row 4
Number 10 belongs to row 0
Number 11 belongs to row 1
Number 12 belongs to row 2
Number 13 belongs to row 3
Number 14 belongs to row 4

Now, how do I figure out what column each value belongs to? For that, I need to divide each number by the group size and take the int portion of the value. An easier way to do that is to use Python floor division:

group_size = 5
for n in some_list:
    print('Number {0} belongs to column {1}'.format(n, n // group_size))

Number 0 belongs to column 0
Number 1 belongs to column 0
Number 2 belongs to column 0
Number 3 belongs to column 0
Number 4 belongs to column 0
Number 5 belongs to column 1
Number 6 belongs to column 1
Number 7 belongs to column 1
Number 8 belongs to column 1
Number 9 belongs to column 1
Number 10 belongs to column 2
Number 11 belongs to column 2
Number 12 belongs to column 2
Number 13 belongs to column 2
Number 14 belongs to column 2

But I really need both the row and column values together. Sure, I could write my mod operation on one line and my floor division operation on another, but Python has a cool function to do both at the same time, divmod:

group_size = 5
for n in some_list:
    col, row = divmod(n, group_size)
    print('Number {0} belongs at row {1}, column {2}'.format(n, row, col))

Number 0 belongs at row 0, column 0
Number 1 belongs at row 1, column 0
Number 2 belongs at row 2, column 0
Number 3 belongs at row 3, column 0
Number 4 belongs at row 4, column 0
Number 5 belongs at row 0, column 1
Number 6 belongs at row 1, column 1
Number 7 belongs at row 2, column 1
Number 8 belongs at row 3, column 1
Number 9 belongs at row 4, column 1
Number 10 belongs at row 0, column 2
Number 11 belongs at row 1, column 2
Number 12 belongs at row 2, column 2
Number 13 belongs at row 3, column 2
Number 14 belongs at row 4, column 2

But now let’s get more real and use this feature to write out one of the greatest catalogs of all time: the albums of “Weird Al” Yankovic:

import matplotlib.style as style
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
style.use('seaborn-poster')

group_size = 5
albums = ['"Weird Al" Yankovic (1983)', 
          '"Weird Al" Yankovic in 3-D (1984)',
          'Dare to Be Stupid (1985)',
          'Polka Party! (1986)',
          'Even Worse (1988)',
          'Peter and the Wolf (1988)',
          'UHF - Original Motion Picture\nSoundtrack and Other Stuff (1989)',
          'Off the Deep End (1992)',
          'Alapalooza (1993)',
          'Bad Hair Day (1996)',
          'Running with Scissors (1999)',
          'Poodle Hat (2003)',
          'Straight Outta Lynwood (2006)',
          'Alpocalypse (2011)',
          'Mandatory Fun (2014)']

# set up my grid chart
fig, ax = plt.subplots()
ax.set_xticks(np.arange(0, (len(albums)/group_size) + 1))
ax.set_yticks(np.arange(0, group_size + 1))
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_title('The Catalog of "Weird Al" Yankovic')
plt.grid()

# now, enumerate through the album list and use divmod to get row and column values to write out the album names
for i, album in enumerate(albums):
    col, row = divmod(i, group_size)
    ax.annotate(album, xy=(col+.1, row+.4), xytext=(col+.1, row+.4))

plt.show()

So, divmod: love it, use it! Check out my full code here.