Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: tools (Page 12 of 35)

Pulling public data into dataframes

Some of what I write about here is inspired by challenges I encounter at work. Often the hardest part in describing those challenges is substituting public data and scenarios for my work-specific ones. Sites like kaggle.com and wikipedia.org really come to the rescue.

Recently, I had a circumstance where I need to process a dataframe in which one of the columns contained a list of values for each field. I think I came up with a clever way of dealing with that obstacle and would like to discuss it in these pages, but first…what sort of public data can I collect to replicate my scenario? How about some English football?! Let’s take some some football titles–say, the UEFA Champions League champion, the UEFA Europa League champion, and the UEFA Super Cup champion–and build a dataframe that shows which English football clubs have ever won any of these titles.

Step 1: Collect the raw data for each of these titles

I’ve written a few times before on how to use Python to collect public data from the Internet. This time around, I went with a slightly more manual approach. I went to each of the three Wikipedia pages, hit F12 to bring up my developer tools, and used those tools to copy just the HTML code for the data tables I was interested in–the ones listing each club team and the number of times they won the particular title. I simply copied the HTML to three different data files for subsequent processing.

Step 2: Read in the data files

The pandas read_html function is such a time saver here. Here’s the code I came up with to read in my data files:

import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline  # for jupyter notebook
matplotlib.style.use('seaborn')

df_champions = pd.read_html('./data/uefa_champions.txt')[0]
df_champions = df_champions[df_champions.Titles>0]
df_champions['title'] = 'UEFA Champions League champ'

df_europa = pd.read_html('./data/uefa_europa_league.txt')[0]
df_europa = df_europa[df_europa.Winners>0]
df_europa['title'] = 'UEFA Europa League champ'

df_super = pd.read_html('./data/uefa_super.txt')[0]
df_super = df_super[df_super.Winners>0]
df_super['title'] = 'UEFA Super Cup champ'

The df_champions dataframe looks like this:

The last five records in the df_champions dataframe. Which of these are English teams?

These dataframes are looking good, but how do I know which teams are the English teams? Wikipedia identifies each team’s nationality with a flag icon, but pandas isn’t pulling in that data. Time for a little HTML parsing with BeautifulSoup.

Step 3: Collect the names of the English teams

Since pandas didn’t pull in the nationality information, I had to revisit each of the HTML data files and parse out that information with BeautifulSoup:

epl_teams = []
for filepath in ['./data/uefa_champions.txt', './data/uefa_europa_league.txt', './data/uefa_super.txt']:
    with open(filepath, 'r') as f:
        soup = BeautifulSoup(f)

        for th in soup.findAll('th'):
            span = th.find('span', {'class':'flagicon'})
            if span:
                a = span.find('a', {'title':'England'})
                if a:
                    epl_teams.append(th.text.strip())

Step 4: Filter each title list down to just the English teams

With the names of the English teams, I can filter my dataframes down accordingly:

df_champions = df_champions[df_champions.Club.isin(epl_teams)]
df_europa = df_europa[df_europa.Club.isin(epl_teams)]
df_super = df_super[df_super.Club.isin(epl_teams)]

Step 5: Merge the dataframes together

I need to merge my three dataframes into a single one. The pandas merge function did the trick:

df_combo = df_champions[['Club','title']].merge(df_europa[['Club','title']], on='Club', how='outer')
df_combo = df_combo.merge(df_super[['Club','title']], on='Club', how='outer')

And the results:

Merging dataframes with columns of the same name forces pandas to add suffices to those column names

Step 6: Combine the “title” columns together into one

The final step–to just get this public data into a shape to replicate my original problem–is to merge the three “title” columns into a single one. Two lines do the deed:

df_combo['title'] = df_combo.apply(lambda r: [t for t in [r.title_x, r.title_y, r.title] if t is not np.nan], axis=1)
df_combo = df_combo[['Club', 'title']]
A dataframe with a list of values in the “title” column

Phew. That was a fair amount of work just to pull together some public data to replicate one of my work datasets. And the actual code I wrote to analyze a dataframe containing lists in a column? Well, that will have to wait for a future post.

Iterating over a date range

I leverage a number of different programming and scripting tools. Recently, I found myself in a situation where I had to write code to loop through a range of dates to do some operations, by month, in not one, not two…but three different languages: Scala, Python, and Bash. The coding principles are the same across the technologies, but the syntax sure is different.

Here are code examples in four technologies–I threw in PowerShell for good measure–for looping through a range of dates. I loop by month, but these could easily be adapted to loop by day or year or whatever increment fits your needs.

Scala

import java.time.LocalDate
import java.time.format.DateTimeFormatter
import java.util.Date

val start = LocalDate.of(2020, 1, 1) // inclusive in loop
val end = LocalDate.of(2020, 9, 1) // excluded from loop

val template = "This loop is for Year %1$d and Month (zero padded) %2$s \n"

val date_range = Iterator.iterate(start) { _.plusMonths(1) }.takeWhile(_.isBefore(end))
while(date_range.hasNext){
	val d = date_range.next
	val s = template.format(d.getYear, d.format(DateTimeFormatter.ofPattern("MM")))
	print(s)
}

Python

import datetime
import calendar

start = datetime.date(2020, 1, 1)
end = datetime.date(2020, 9, 1)
template = "This loop is for Year {0} and Month (zero padded) {1:%m}"

while start != end:
	s = template.format(start.year, start)
	print(s)
	days_in_month = calendar.monthrange(start.year, start.month)[1]
	start = start + datetime.timedelta(days=days_in_month)
	

Bash

start=2020-1-1
end=2020-9-1

while [ "$start" != "$end" ]; do
	s="`date -d "$start" +"This loop is for Year %Y and Month (zero padded) %m"`"
	echo s
	start=$(date -I -d "$start + 1 month")
done

PowerShell

$start = get-date "2020-1-1"
$end = Get-Date "2020-9-1"

while($start -ne $end){
    "This loop is for Year {0:yyyy} and Month (zero padded) {0:MM}" -f $start
    $start = $start.AddMonths(1)
}

Thinning out your tick labels

Have you ever rendered a chart with Pandas and/or Matplotlib where one or both of your axes (axises?) rendered as a smear of overlapping, unreadable black text?

https://youtu.be/d9kuDizrBPc?t=70
If you can read this, you don’t need glasses

As an example, let’s create a bar chart of COVID-19 data. [As an aside: I’ve noticed that line charts seem to automatically thin out any overlapping tick labels and tend not to fall prey to this problem.]

Load and clean up the data

After downloading the CSV data, I wrote the following code to load the data and prepare it for visualization:

df_covid_confirmed_us = pd.read_csv('./data/time_series_covid19_confirmed_US_20200720.csv')
df_covid_deaths_us = pd.read_csv('./data/time_series_covid19_deaths_US_20200720.csv')

cols_to_keep1 = [i for i, v in enumerate(df_covid_confirmed_us.columns) if v in ['Admin2', 'Province_State'] or v.endswith('20')]
cols_to_keep2 = [i for i, v in enumerate(df_covid_deaths_us.columns) if v in ['Admin2', 'Province_State'] or v.endswith('20')]
df_covid_confirmed_ohio = df_covid_confirmed_us[df_covid_confirmed_us.Province_State=='Ohio'].iloc[:,cols_to_keep1].copy()
df_covid_deaths_ohio = df_covid_deaths_us[df_covid_deaths_us.Province_State=='Ohio'].iloc[:,cols_to_keep2].copy()

df_covid_confirmed_ohio.head()

Tidy up the dataframes

The data is still a bit untidy, so I wrote this additional code to transform it into a more proper format:

date_cols = df_covid_confirmed_ohio.columns.tolist()[2:]
rename_cols_confirmed = {'variable': 'obs_date', 'value': 'confirmed_cases'}
rename_cols_deaths = {'variable': 'obs_date', 'value': 'deaths'}

df_covid_confirmed_ohio = pd.melt(df_covid_confirmed_ohio.reset_index(), id_vars=['Admin2', 'Province_State'], 
                                  value_vars=date_cols).rename(columns=rename_cols_confirmed)
df_covid_deaths_ohio = pd.melt(df_covid_deaths_ohio.reset_index(), id_vars=['Admin2', 'Province_State'], 
                               value_vars=date_cols).rename(columns=rename_cols_deaths)

df_covid_confirmed_ohio['obs_date'] = pd.to_datetime(df_covid_confirmed_ohio.obs_date)
df_covid_deaths_ohio['obs_date'] = pd.to_datetime(df_covid_deaths_ohio.obs_date)

print(df_covid_confirmed_ohio.head())
print(df_covid_deaths_ohio.head())

Concatenate the two dataframes together

I’d like to do a nice, side-by-side comparison, in bar chart form, of these two datasets. One way to do that is to concatenate both dataframes together and then render your chart from the single result. Here’s the code I wrote to concatenate both datasets together:

df_covid_confirmed_ohio['data_type'] = 'confirmed cases'
df_covid_confirmed_ohio['cnt'] = df_covid_confirmed_ohio.confirmed_cases
df_covid_deaths_ohio['data_type'] = 'deaths'
df_covid_deaths_ohio['cnt'] = df_covid_deaths_ohio.deaths
drop_cols = ['confirmed_cases', 'deaths', 'Admin2', 'Province_State']

df_combined_data = pd.concat([df_covid_confirmed_ohio[df_covid_confirmed_ohio.obs_date>='2020-5-1'], 
               df_covid_deaths_ohio[df_covid_deaths_ohio.obs_date>='2020-5-1']], sort=False).drop(columns=drop_cols)

Now, render the chart

Ok, I’m finally ready to create my chart:

fig, ax = plt.subplots(figsize=(12,8))
_ = df_combined_data.groupby(['obs_date', 'data_type']).sum().unstack().plot(kind='bar', ax=ax)

# draws the tick labels at an angle
fig.autofmt_xdate()

title = 'Number of COVID-19 cases/deaths in Ohio: {0:%d %b %Y} - {1:%d %b %Y}'.format(df_combined_data.obs_date.min(), 
                                                                                      df_combined_data.obs_date.max())
_ = ax.set_title(title)
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Count')

# clean up the legend
original_legend = [t.get_text() for t in ax.legend().get_texts()]
new_legend = [t.replace('(cnt, ', '').replace(')', '') for t in original_legend]
_ = ax.legend(new_legend)
Wow! Those dates along the X axis are completely unreadable!

The X axis is a mess! Fortunately, there are a variety of ways to fix this problem: I particularly like the approach mentioned in this solution. Basically, I’m going to thin out the labels at a designated frequency. In my solution, I only show every fourth date/label. So, here’s my new code with my label fix highlighted:

fig, ax = plt.subplots(figsize=(12,8))
_ = df_combined_data.groupby(['obs_date', 'data_type']).sum().unstack().plot(kind='bar', ax=ax)

# draws the tick labels at an angle
fig.autofmt_xdate()

title = 'Number of COVID-19 cases/deaths in Ohio: {0:%d %b %Y} - {1:%d %b %Y}'.format(df_combined_data.obs_date.min(), 
                                                                                     df_combined_data.obs_date.max())
_ = ax.set_title(title)
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Count')

# clean up the legend
original_legend = [t.get_text() for t in ax.legend().get_texts()]
new_legend = [t.replace('(cnt, ', '').replace(')', '') for t in original_legend]
_ = ax.legend(new_legend)

# tick label fix
tick_labels = [l.get_text().replace(' 00:00:00', '') for l in ax.get_xticklabels()]
new_tick_labels = [''] * len(tick_labels)
new_tick_labels[::4] = tick_labels[::4]
_ = ax.set_xticklabels(new_tick_labels)
Much better!

That X axis is much more readable now thanks to the power of Python list slicing.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑