Here’s a rather unconventional commencement address from Mike Rowe:
I like both his gentle irreverence and the simple truths of his points.
How do you succeed professionally? Mr. Rowe makes some suggestions:
Practice your craft everyday.
Become indispensable to your employer.
Show up early.
Stay late.
Distinguish yourself on the job at every opportunity.
No matter what your child does after high school, these are solid points to be successful.
If you’re a skilled tradesperson with an entrepreneurial spirit, a willingness to get dirty, a disposition to travel, and a burning curiosity to learn all that you can…I’m telling you, your opportunity to prosper has never been better.
Mike Rowe
I think it’s great that he references an “entrepreneurial spirit”…I wish he would have repeated that point a few more times. It shouldn’t be a given that we all must toil for an employer, when there are opportunities to be your own. Along with equally giving voice to the trades, it would be great if our educational institutions would give ample voice to entrepreneurship, as well.
Occasionally, I have to work with more complicated forms of data where a given column might contain an array of values instead of conventional data types like Strings, Integers, Timestamps and such. Often, such complex fields can easily be processed with the excellent literal_eval function from the Abstract Syntax Trees (ast) package. For example, take this movie database from Kaggle:
import pandas as pd
from ast import literal_eval
df = pd.read_csv('./data/tmdb_5000_credits.csv')
df['cast'] = df.cast.apply(literal_eval)
df.head()
And when we take a look at that cast column, specifically:
type(df.loc[0,'cast'])
# this prints out "list" as the type
Basically, literal_eval converted the string of brackets and curly braces into lists of dictionary objects. Very cool and handy.
However, in real life, you don’t always have such clean data to work with. Recently, I was working with some complex data, tried to use literal_eval on it, and got an error like the following:
ValueError: malformed node or string: <_ast.Name object at 0x000001F62A851FC8>
Woh. There’s something about this data that literal_eval doesn’t like. I wonder what it is? Let me do a little exception handling around literal_eval to see if I can find the row(s) where the function is breaking:
def try_the_eval(row):
try:
literal_eval(row.cast)
except:
print('Found bad data at row: {0}'.format(row.name))
df_dirty = pd.read_csv('./data/dirty_data.csv')
_ = df_dirty.apply(try_the_eval, axis=1)
This code told me I had badly formatted data in row 32 of my dataset. I can take a closer look at that data like so:
So, how do you fix such an issue? I can think of at least two ways:
Wrap null values like these in double quotes or
Replace null values like these with None.
Here, I go with Option #2:
df_dirty['cast'] = df_dirty.cast.str.replace('null', 'None')
df_dirty['cast'] = df_dirty.cast.apply(literal_eval)
# no exceptions are thrown and the below line prints out "list"
print(type(df_dirty.loc[32,'cast']))
One caveat here is that you need to be careful how you search and replace the errant data. For example, the word null without surround quotation marks might appear legitimately in other fields like descriptions. You don’t want to accidentally replace such legitimate uses of the word with None or surround it with quotation marks as that will introduce new problems. So, you may need to come up with slightly more rigorous ways of finding and replacing your errant data, such as using regular expressions. Once you deal with that data, though, literal_eval works like a charm and saves you lots of time working with complex datasets!
Like most people, I use a great many of the free applications provided by Google. Unlike a lot of people, I like to occasionally collect data on how I’m using some of these applications, especially Gmail.
Unfortunately, there do not appear to be easy ways to collect usage data from Gmail; however, Google does provide a scripting platform, Google Apps Script, that allows you to code against Google’s various APIs and get at most of the data you might be interested in.
Recently, I found myself interested in analyzing the categories of email coming into my inbox. How could I collect simple email properties like date, subject line, and category and write the results to a CSV for further analysis?
Step 1: Launch a new spreadsheet
Navigate to Google Docs in your browser and launch a new spreadsheet. This tutorial was excellent help.
Step 2: Launch the script editor
Following that tutorial, open up a new instance of the Script Editor under the Tools menu. This step is pretty important because it seems to “bind” your scripting work to the new spreadsheet. Initially, I simply started a new scripting project in Apps Script apart from any spreadsheet and found no way to get that work to write to any of my spreadsheets.
Step 3: Code away
In this endeavor, my goal was pretty simple: grab particular email metadata and write the data to a spreadsheet. This required me coding against both the GmailApp API and the Spreadsheet API. Here’s the code I ultimately came up with (I believe this is vanilla JavaScript):
function write_gmail_stats() {
//get reference to active spreadsheet
var sheet = SpreadsheetApp.getActiveSheet();
// get collection of email categorized as "promotions" sent after 31 Mar 2020
var promos = GmailApp.search('category:promotions after:2020/03/31');
//iterate through the email collection and write date, subject, and category to spreadsheet
for (var i = 0; i < promos.length; i++){
var m = promos[i];
var d = m.getLastMessageDate();
var s = m.getFirstMessageSubject();
sheet.appendRow([d, s, "promotions"]);
}
var social = GmailApp.search('category:social after:2020/03/31');
//iterate through the email collection and write date, subject, and category to spreadsheet
for (var i = 0; i < social.length; i++){
var m = social[i];
var d = m.getLastMessageDate();
var s = m.getFirstMessageSubject();
sheet.appendRow([d, s, "social"]);
}
var primary = GmailApp.search('category:primary after:2020/03/31');
//iterate through the email collection and write date, subject, and category to spreadsheet
for (var i = 0; i < primary.length; i++){
var m = primary[i];
var d = m.getLastMessageDate();
var s = m.getFirstMessageSubject();
sheet.appendRow([d, s, "primary"]);
}
}
This code will take the date, subject, and category name of the three default categories in Gmail and write the results to my spreadsheet. To keep my results modest, I crafted a search query to only look at email from April (this support page was very helpful in figuring out the query I needed). The results worked out pretty well:
Sweet! This certainly beats screen scraping or manual data collecting.
Recent Comments