Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: python (Page 3 of 26)

Finding sub-ranges in my dataset

File this under: there-has-to-be-a-simpler-way-to-do-this-in-pandas-but-I-haven’t-found-what-that-is

Recently, I’ve been playing with some financial data to get a better understanding of the yield curve. Related to yield and inverted yield curves are the periods of recession in the US economy. In my work, I wanted to first build a chart that indicated the periods of recession and ultimately overlay that with yield curve data. Little did I realize the challenge of just coding that first part.

I downloaded a dataset of recession data, which contains a record for every calendar quarter from the 1960s to present day and a 0 or 1 to indicate whether the economy was in recession for that quarter–“1” indicating that it was. What I need to do was pull all the records with a “1” indicator and find the start and end times for each of those ranges so that I could paint them onto a chart.

I’ve heard it said before that any time you have to write a loop over your pandas dataframe, you’re probably doing it wrong. I’m certainly doing a loop here and I have a nagging suspicion there’s probably a more elegant way to achieve the solution. Nevertheless, here’s what I came up with to solve my recession chart problem:

Step 1: Bring in the necessary packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline  # for easy chart display in jupyter notebook

Step 2: Load in my downloaded recession dataset and take a peek

# recession dates: https://fred.stlouisfed.org/series/JHDUSRGDPBR
df_recessions = pd.read_csv('./data/JHDUSRGDPBR_20220327.csv')

df_recessions['DATE'] = pd.to_datetime(df_recessions.DATE)
df_recessions.head()
The first records of the Recession dataset
df_recessions[df_recessions.JHDUSRGDPBR==1.0].head()
The first records in the dataset where the economy was in recession

Step 3: Mark the start of every period of recession in the dataset

So, now I’m asking myself, “how do I extract the start and stop dates for every period of recession identified in the dataset? Let’s start with first just finding the start dates of recessions.” That shouldn’t be too difficult. If I can filter in just the recession quarters and calculate the date differences from one row to the next, if the difference is greater than three months (I estimated 93 days as three months), then I know there was a gap in quarters prior to the current record indicating that current record is the start of a new recession. Here’s what I came up with [one further note: my yield curve data only starts in 1990, so I filtered the recession data for 1990 to present]:

df_spans = df_recessions[(df_recessions.DATE.dt.year>=1990) & (df_recessions.JHDUSRGDPBR==1.0)].copy()
df_spans['days_elapsed'] = df_spans.DATE - df_spans.shift(1).DATE
df_spans['ind'] = df_spans.days_elapsed.dt.days.apply(lambda d: 's' if d > 93 else '')
df_spans.iloc[0, 3] = 's'  # mark first row as a recession start
df_spans
“s” indicates the start of a new recession

Step 4: Find the end date of each recession

Here’s where my approach starts to go off the rails a little. The only way I could think to find the end dates of each recession is to:

  1. Loop through a list of the start dates
  2. In each loop, get the next start date and then grab the date of the record immediately before that one
  3. When I hit the last loop, just consider the last record to be the end date of the most recent recession
  4. With every stop date, add three months since the stop date is only the first day of the quarter and, presumably, the recession more or less lasts the entire quarter

Confusing? Here’s my code:

start_stop_dates = []
start_dates = df_spans.loc[df_spans.ind=='s', ].DATE.tolist()

for i, start_date in enumerate(start_dates):
    if i < len(start_dates)-1:
        stop_date = df_spans.loc[df_spans.DATE < start_dates[i+1]].iloc[-1].DATE
    else:
        stop_date = df_spans.iloc[-1].DATE
        
    # add 3 months to the end of each stop date to stretch the value to the full quarter
    start_stop_dates.append((start_date, stop_date + np.timedelta64(3,'M')))
    
start_stop_dates
Recessions from 1990 to the beginning of 2022

Step 5: Build my chart

With that start/stop list, I can build my underlying recession chart:

fig, ax = plt.subplots(figsize=(12,6))

_ = ax.plot()
_ = ax.set_xlim([date(1990, 1, 1), date(2022, 4, 1)])
_ = ax.set_ylim([0, 10])

for st, sp in start_stop_dates:
    _ = ax.axvspan(st, sp, alpha=0.2, color='gray')
US Recessions: 1990 – 2021

Phew. All that work and I’m only at the starting point of my yield curve exploration, but that will have to wait for a future post. However, if you can think of a more elegant way to identify these date ranges without having to resort to looping, I’d love to hear it!

Parsing Word documents with Python

If you ever had a need to programmatically examine the text in a Microsoft Word document, getting the text out in the first place can be challenging. Sure, you can manually save your document to a plain text file that’s much easier to process, but if you have multiple documents to examine, that can be painful.

Recently I had such a need and found this Toward Data Science article quite helpful. But let’s take the challenge a little further: suppose you had a document with multiple sections and need to pull the text from specific sections.

Page 1 has my table of contents
Page 2 contains a variety of sections

Let’s suppose I need to pull just the text from the “sub-sections”. In my example, I have three sub-sections: Sub-Section 1, Sub-Section 2, and Sub-Section 3. In my Word document, I’ve styled these headers as “Heading 2” text. Here’s how I went about pull out the text for each of these sections.

Step 1: Import your packages

For my needs, I only need to import zipfile and ElementTree, which is nice as I didn’t need to install any third party packages:

import zipfile
import xml.etree.ElementTree as ET

Step 2: Parse the document XML

doc = zipfile.ZipFile('./data/test.docx').read('word/document.xml')
root = ET.fromstring(doc)

Step 3: Explore the XML for the sections and text you want

You’ll spend most of your time here, trying to figure out what elements hold the contents in which you are interested. The XML of Microsoft documents follows the WordprocessingML standard, which can be quite complicated. I spent a lot of time manually reviewing my XML looking for the elements I needed. You can write out the XML like so:

ET.tostring(root)

Step 4: Find all the paragraphs

To solve my problem, I first decided to pull together a collection of all the paragraphs in the document so that I could later iterate across them and make decisions. To make that work a little easier, I also declared a namespace object used by Microsoft’s WordprocessingML standard:

# Microsoft's XML makes heavy use of XML namespaces; thus, we'll need to reference that in our code
ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
body = root.find('w:body', ns)  # find the XML "body" tag
p_sections = body.findall('w:p', ns)  # under the body tag, find all the paragraph sections

It can be helpful to actually see the text in each of these sections. Through researching Microsoft’s XML standard, I know that document text is usually contained in “t” elements. So, if I write an XPath query to find all the “t” elements within a given section, I can join the text of all those elements together to get the full text of the paragraph. This code does that:

for p in p_sections:
    text_elems = p.findall('.//w:t', ns)
    print(''.join([t.text for t in text_elems]))
    print()

Step 5: Find all the “Heading 2” sections

Now, let’s iterate through each paragraph section and see if we can figure out which sections have been styled with “Heading 2”. If we can find those Heading 2 sections, we’ll then know that the subsequent text is the text we need.

Through researching more the XML standard, I found that if I search for pStyle elements that contain the value “Heading2”, these will be the sections I’m after. To make my code a little cleaner, I wrote functions to both evaluate each section for the Heading 2 style and extract the full text of the section:

def is_heading2_section(p):
    """Returns True if the given paragraph section has been styled as a Heading2"""
    return_val = False
    heading_style_elem = p.find(".//w:pStyle[@w:val='Heading2']", ns)
    if heading_style_elem is not None:
        return_val = True
    return return_val


def get_section_text(p):
    """Returns the joined text of the text elements under the given paragraph tag"""
    return_val = ''
    text_elems = p.findall('.//w:t', ns)
    if text_elems is not None:
        return_val = ''.join([t.text for t in text_elems])
    return return_val


section_labels = [get_section_text(s) if is_heading2_section(s) else '' for s in p_sections]

Now, if I print out my section_labels list, I see this:

My section_labels list

Step 6: Finally, extract the Heading 2 headers and subsequent text

Now, I can use simple list comprehension to glue together both the section headers and associated text of the three sub-sections I’m after:

section_text = [{'title': t, 'text': get_section_text(p_sections[i+1])} for i, t in enumerate(section_labels) if len(t) > 0]

And that list looks like this:

My section_text list

You can download my code here.

Pandas and distance calculations

If you’ve ever had to calculate distances between sets of coordinates, this article is pretty helpful. The author covers a few different approaches, focusing a lot of attention on the Haversine distance calculation. He offers a handy function and an example of calculating the kilometers between different cities in India:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# handy function from the article
def haversine_vectorize(lon1, lat1, lon2, lat2):
    """Returns distance, in kilometers, between one set of longitude/latitude coordinates and another"""
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    newlon = lon2 - lon1
    newlat = lat2 - lat1

    haver_formula = np.sin(newlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(newlon/2.0)**2

    dist = 2 * np.arcsin(np.sqrt(haver_formula ))
    km = 6367 * dist #6367 for distance in KM for miles use 3958
    return km


# the article's example
orig_dest_df = pd.DataFrame({
    'origin_city':['Bangalore','Mumbai','Delhi','Kolkatta','Chennai','Bhopal'],
    'orig_lat':[12.9716,19.076,28.7041,22.5726,13.0827,23.2599],
    'orig_lon':[77.5946,72.877,77.1025,88.639,80.2707,77.4126],
    'dest_lat':[23.2599,12.9716,19.076,13.0827,28.7041,22.5726],
    'dest_lon':[77.4126,77.5946,72.877,80.2707,77.1025,88.639],
    'destination_city':['Bhopal','Bangalore','Mumbai','Chennai','Delhi','Kolkatta']})

orig_dest_df['haversine_dist'] = haversine_vectorize(orig_dest_df.orig_lon, orig_dest_df.orig_lat, 
                                                     orig_dest_df.dest_lon, orig_dest_df.dest_lat)

orig_dest_df.head()

This was all very helpful to me in some recent work; however, my two sets of coordinates are usually not in the same row. Typically, my coordinates will be in separate rows in my dataframe and then I’ll have to calculate the differences between rows. So let’s take the author’s data and reshape it to have only one city per row. For added measure, I’ll add an “arrival time” datetime value:

data = {'city': ['Bangalore', 'Bhopal', 'Mumbai', 'Delhi', 'Kolkatta', 'Chennai'], 
        'lat': [12.9716, 23.2599, 19.0760, 28.7041, 22.5726, 13.0827], 
        'lon': [77.5946, 77.4126, 72.8770, 77.1025, 88.6390, 80.2707], 
        'arrival_time': [datetime(2021,12,1,12,0,0), datetime(2021,12,3,13,30,0), datetime(2021,12,6,8,0,0), 
                         datetime(2021,12,7,20,30,0), datetime(2021,12,9,12,30,0), datetime(2021,12,15,7,30,0)]}
one_loc_per_row_df = pd.DataFrame(data)
one_loc_per_row_df.head()

Pandas has a fantastic diff function that let’s you calculate the difference between an element from row-to-row. For example, I can use it to calculate the travel times between each city:

one_loc_per_row_df['travel_time'] = one_loc_per_row_df.arrival_time.diff()
one_loc_per_row_df.head()

But to calculate my travel distances, I have to take two elements–latitude and longitude–from each row and run them through my haversine_vectorize function to get the distance difference. So far, I’ve found no way to extend the Pandas diff function to do this. No worries, though: with Pandas, there are often several ways to solve your problems. Enter the shift function.

The Pandas shift function allows you to offset your dataframe in one direction or another. For my purposes, I need to “shift” a copy of my dataframe forward by one row so that I can process a given row against the next row, like so:

one_loc_per_row_df['travel_dist'] = haversine_vectorize(one_loc_per_row_df.lon, one_loc_per_row_df.lat, 
                                                        one_loc_per_row_df.lon.shift(1), one_loc_per_row_df.lat.shift(1))
one_loc_per_row_df.head()

So that’s a way to calculate distances between coordinates when your beginning and ending coordinates are in separate records in your dataframe.

But wait, there’s more…

The above work assumes a single traveler, but what if you have data for multiple people in your dataset? Imagine this:

larry_data = {'traveler': ['Larry']*6, 
              'city': ['Bangalore', 'Bhopal', 'Mumbai', 'Delhi', 'Kolkatta', 'Chennai'], 
              'lat': [12.9716, 23.2599, 19.0760, 28.7041, 22.5726, 13.0827], 
              'lon': [77.5946, 77.4126, 72.8770, 77.1025, 88.6390, 80.2707], 
              'arrival_time': [datetime(2021,12,1,12,0,0), datetime(2021,12,3,13,30,0), datetime(2021,12,6,8,0,0), 
                               datetime(2021,12,7,20,30,0), datetime(2021,12,9,12,30,0), datetime(2021,12,15,7,30,0)]}
moe_data = {'traveler': ['Moe']*6,
            'city': ['Miami', 'Atlanta', 'Auburn', 'New Orleans', 'Dallas', 'Houston'], 
            'lat': [25.7616798, 33.7489954, 47.3073228, 29.951065, 32.779167, 29.749907], 
            'lon': [-80.1917902, -84.3879824, -122.2284532, -90.071533, -96.808891, -95.358421], 
            'arrival_time': [datetime(2021,12,1,9,15,0), datetime(2021,12,4,23,30,0), datetime(2021,12,5,8,0,0), 
                             datetime(2021,12,7,14,30,0), datetime(2021,12,10,12,30,0), datetime(2021,12,12,7,30,0)]}
curly_data = {'traveler': ['Curly']*6,
              'city': ['London', 'Liverpool', 'Cambridge', 'Birmingham', 'Oxford', 'Southampton'], 
              'lat':[51.509865, 53.400002, 52.205276, 52.489471, 51.752022, 50.909698], 
              'lon': [-0.118092, -2.983333, 0.119167, -1.898575, -1.257677, -1.404351], 
              'arrival_time': [datetime(2021,12,1,9,0,0), datetime(2021,12,2,13,30,0), datetime(2021,12,4,8,30,0), 
                               datetime(2021,12,6,18,30,0), datetime(2021,12,8,12,30,0), datetime(2021,12,9,7,30,0)]}

travelers_df = pd.concat([pd.DataFrame(larry_data), pd.DataFrame(moe_data), pd.DataFrame(curly_data)]).reset_index(drop=True)
travelers_df.head(20)

To get travel time differences for each of the travelers, we can still use the “diff” function, but first make sure we group by the traveler:

travelers_df['travel_time'] = travelers_df.groupby('traveler').arrival_time.diff()
travelers_df.head(20)

Calculating the travel distance, though, is slightly more complicated. Grouping by the traveler and then applying the haversine_vectorize function via a lambda expression yields this:

travel_dist = travelers_df.groupby('traveler').\
    apply(lambda r: haversine_vectorize(r.lon, r.lat, r.lon.shift(1), r.lat.shift(1)))

travel_dist

The result set is indexed by both the traveler value and each row’s index from the original dataframe. To add these values back to the original dataframe, then, all I need to do is get rid of the first index:

travelers_df['travel_dist'] = travel_dist.reset_index(level=0, drop=True)
travelers_df

Now, with these calculations, you can figure out the average travel speeds of people, look for anomalies such as if a person traveled a distance faster than normal, etc.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑