Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: tools (Page 5 of 35)

Pandas and distance calculations

If you’ve ever had to calculate distances between sets of coordinates, this article is pretty helpful. The author covers a few different approaches, focusing a lot of attention on the Haversine distance calculation. He offers a handy function and an example of calculating the kilometers between different cities in India:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# handy function from the article
def haversine_vectorize(lon1, lat1, lon2, lat2):
    """Returns distance, in kilometers, between one set of longitude/latitude coordinates and another"""
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    newlon = lon2 - lon1
    newlat = lat2 - lat1

    haver_formula = np.sin(newlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(newlon/2.0)**2

    dist = 2 * np.arcsin(np.sqrt(haver_formula ))
    km = 6367 * dist #6367 for distance in KM for miles use 3958
    return km


# the article's example
orig_dest_df = pd.DataFrame({
    'origin_city':['Bangalore','Mumbai','Delhi','Kolkatta','Chennai','Bhopal'],
    'orig_lat':[12.9716,19.076,28.7041,22.5726,13.0827,23.2599],
    'orig_lon':[77.5946,72.877,77.1025,88.639,80.2707,77.4126],
    'dest_lat':[23.2599,12.9716,19.076,13.0827,28.7041,22.5726],
    'dest_lon':[77.4126,77.5946,72.877,80.2707,77.1025,88.639],
    'destination_city':['Bhopal','Bangalore','Mumbai','Chennai','Delhi','Kolkatta']})

orig_dest_df['haversine_dist'] = haversine_vectorize(orig_dest_df.orig_lon, orig_dest_df.orig_lat, 
                                                     orig_dest_df.dest_lon, orig_dest_df.dest_lat)

orig_dest_df.head()

This was all very helpful to me in some recent work; however, my two sets of coordinates are usually not in the same row. Typically, my coordinates will be in separate rows in my dataframe and then I’ll have to calculate the differences between rows. So let’s take the author’s data and reshape it to have only one city per row. For added measure, I’ll add an “arrival time” datetime value:

data = {'city': ['Bangalore', 'Bhopal', 'Mumbai', 'Delhi', 'Kolkatta', 'Chennai'], 
        'lat': [12.9716, 23.2599, 19.0760, 28.7041, 22.5726, 13.0827], 
        'lon': [77.5946, 77.4126, 72.8770, 77.1025, 88.6390, 80.2707], 
        'arrival_time': [datetime(2021,12,1,12,0,0), datetime(2021,12,3,13,30,0), datetime(2021,12,6,8,0,0), 
                         datetime(2021,12,7,20,30,0), datetime(2021,12,9,12,30,0), datetime(2021,12,15,7,30,0)]}
one_loc_per_row_df = pd.DataFrame(data)
one_loc_per_row_df.head()

Pandas has a fantastic diff function that let’s you calculate the difference between an element from row-to-row. For example, I can use it to calculate the travel times between each city:

one_loc_per_row_df['travel_time'] = one_loc_per_row_df.arrival_time.diff()
one_loc_per_row_df.head()

But to calculate my travel distances, I have to take two elements–latitude and longitude–from each row and run them through my haversine_vectorize function to get the distance difference. So far, I’ve found no way to extend the Pandas diff function to do this. No worries, though: with Pandas, there are often several ways to solve your problems. Enter the shift function.

The Pandas shift function allows you to offset your dataframe in one direction or another. For my purposes, I need to “shift” a copy of my dataframe forward by one row so that I can process a given row against the next row, like so:

one_loc_per_row_df['travel_dist'] = haversine_vectorize(one_loc_per_row_df.lon, one_loc_per_row_df.lat, 
                                                        one_loc_per_row_df.lon.shift(1), one_loc_per_row_df.lat.shift(1))
one_loc_per_row_df.head()

So that’s a way to calculate distances between coordinates when your beginning and ending coordinates are in separate records in your dataframe.

But wait, there’s more…

The above work assumes a single traveler, but what if you have data for multiple people in your dataset? Imagine this:

larry_data = {'traveler': ['Larry']*6, 
              'city': ['Bangalore', 'Bhopal', 'Mumbai', 'Delhi', 'Kolkatta', 'Chennai'], 
              'lat': [12.9716, 23.2599, 19.0760, 28.7041, 22.5726, 13.0827], 
              'lon': [77.5946, 77.4126, 72.8770, 77.1025, 88.6390, 80.2707], 
              'arrival_time': [datetime(2021,12,1,12,0,0), datetime(2021,12,3,13,30,0), datetime(2021,12,6,8,0,0), 
                               datetime(2021,12,7,20,30,0), datetime(2021,12,9,12,30,0), datetime(2021,12,15,7,30,0)]}
moe_data = {'traveler': ['Moe']*6,
            'city': ['Miami', 'Atlanta', 'Auburn', 'New Orleans', 'Dallas', 'Houston'], 
            'lat': [25.7616798, 33.7489954, 47.3073228, 29.951065, 32.779167, 29.749907], 
            'lon': [-80.1917902, -84.3879824, -122.2284532, -90.071533, -96.808891, -95.358421], 
            'arrival_time': [datetime(2021,12,1,9,15,0), datetime(2021,12,4,23,30,0), datetime(2021,12,5,8,0,0), 
                             datetime(2021,12,7,14,30,0), datetime(2021,12,10,12,30,0), datetime(2021,12,12,7,30,0)]}
curly_data = {'traveler': ['Curly']*6,
              'city': ['London', 'Liverpool', 'Cambridge', 'Birmingham', 'Oxford', 'Southampton'], 
              'lat':[51.509865, 53.400002, 52.205276, 52.489471, 51.752022, 50.909698], 
              'lon': [-0.118092, -2.983333, 0.119167, -1.898575, -1.257677, -1.404351], 
              'arrival_time': [datetime(2021,12,1,9,0,0), datetime(2021,12,2,13,30,0), datetime(2021,12,4,8,30,0), 
                               datetime(2021,12,6,18,30,0), datetime(2021,12,8,12,30,0), datetime(2021,12,9,7,30,0)]}

travelers_df = pd.concat([pd.DataFrame(larry_data), pd.DataFrame(moe_data), pd.DataFrame(curly_data)]).reset_index(drop=True)
travelers_df.head(20)

To get travel time differences for each of the travelers, we can still use the “diff” function, but first make sure we group by the traveler:

travelers_df['travel_time'] = travelers_df.groupby('traveler').arrival_time.diff()
travelers_df.head(20)

Calculating the travel distance, though, is slightly more complicated. Grouping by the traveler and then applying the haversine_vectorize function via a lambda expression yields this:

travel_dist = travelers_df.groupby('traveler').\
    apply(lambda r: haversine_vectorize(r.lon, r.lat, r.lon.shift(1), r.lat.shift(1)))

travel_dist

The result set is indexed by both the traveler value and each row’s index from the original dataframe. To add these values back to the original dataframe, then, all I need to do is get rid of the first index:

travelers_df['travel_dist'] = travel_dist.reset_index(level=0, drop=True)
travelers_df

Now, with these calculations, you can figure out the average travel speeds of people, look for anomalies such as if a person traveled a distance faster than normal, etc.

Lorem Ipsum

If you’ve done any sort of user interface software development, you’re probably familiar with the use of “lorem ipsum”. The basic idea is that you have a page that will, eventually, display some word content; however, at the moment, you don’t have any of that content available. So, to at least get an idea of what that content will eventually look like on your page, you fill in the space with a bunch of nonsense words and sentences.

The internet makes available a number of “lorem ipsum” word generators, so it’s pretty easy to fill in your content holes with filler words when needed.

(As an aside, many years ago, I attended a security conference in Las Vegas. The main stage for the keynote speakers was huge and decked out like a concert hall:

A shot of the stage at the conference I attended

One of the keynote speakers was enumerating all the security breaches over the last year with a montage of images from news websites relaying all the events, each crime made clear in bold headlines. The funny part was, on most of the slides, if you tried to read the parts of the stories under the headlines, the text was simply lorem ipsum text! I wish I would have had the foresight to taken a photo!)

As with all things on the internet, there are many great and hilarious takes on lorem ispum. One of my favorites is Bacon Ipsum, which is a pork-ladened interpretation of the concept. If you find yourself in need of some nonsense text, check out this article that lists 25 great lorem ipsum generators.

One generator I don’t see listed there but I do use frequently is Corporate Ipsum, which injects all the latest corporate speak and buzzwords into your text. Even if you have no need for fake content like this, you’ll at least get a good laugh from these.

Building clock-style radar charts

When you’re dealing with event data, one neat visualization option is to depict your data over a twenty-four hour period on a radar chart built to look like the face of a clock. Matplotlib’s polar chart capabilities makes this relatively simple.

As an example, I’ll chart crime incident data from the city of Cincinnati.

Step 1: Bring in the data

For starters, set up your standard package import statements and read in the dataset:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta, date


# data from : https://data.cincinnati-oh.gov/Safety/PDI-Police-Data-Initiative-Crime-Incidents/k59e-2pvf
df_crime = pd.read_csv('./data/PDI__Police_Data_Initiative__Crime_Incidents.csv')
df_crime['DATE_REPORTED'] = pd.to_datetime(df_crime.DATE_REPORTED)

Let’s try to find the day with the highest number of incidents in 2019:

df_crime[df_crime.DATE_REPORTED.dt.year==2019][['DATE_REPORTED','INSTANCEID']].groupby(df_crime.DATE_REPORTED.dt.date).\
    count().sort_values('INSTANCEID', ascending=False)

It looks like the most incidents took place on January 15. Now, when did those incidents occur over the course of the day?

df_crime[df_crime.DATE_REPORTED.dt.date==date(2019,1,15)][['DATE_REPORTED']].groupby(df_crime.DATE_REPORTED.dt.hour).count()

Grouping by hour, we can see that the vast majority of incidents occurred during the 9am hour. Now, let’s visualize that.

Step 2: Create a handy chart function

To make my work a little more portable, I created a “render_chart” function that takes as parameters the dataframe of hour data, the axis in which to place the chart, and the title:

def render_chart(df, axis, title):
    theta = np.arange(df.shape[0])/float(df.shape[0]) * 2 * np.pi
    _ = axis.bar(theta + theta[1]/2, df.event_count, width=theta[1], color='red')
    ticklabels = [(timedelta(hours=h) + datetime(2021,1,1)).strftime('%#I%p').lower() for h in range(0,24)]
    _ = axis.set_xticks(theta)
    _ = axis.set_xticklabels(ticklabels)
    _ = axis.set_yticklabels([])
    _ = axis.set_title(title)

    axis.set_theta_direction(-1)
    axis.set_theta_zero_location('N')

Some things to note with my function:

  1. The function expects the dataframe to contain a column called “event_count” that is a count of events for each hour over the day
  2. Finding the right time format string so that I could display 1am instead of 01am was actually a bit difficult. That hash mark (#) did the trick.
  3. Matplotlib polar charts, by default, render counter-clockwise. Setting the set_theta_direction to -1 let’s you reverse that behavior. Setting the set_theta_zero_location to North (N) allows you to start rendering the chart like a clock, at the top.

Step 3: Pad your data

For the January 15 data, there are several hours of the day with no reported incidents (eg. from 4am to 6am). In order to get the chart to render correctly, I need to pad those empty periods with 0. I solved that problem by creating an “empty hours” dataframe–a dataframe of 24 hours with 0 event counts–and then merged the real data with the empty one:

# get my real data
df_chart = df_crime[df_crime.DATE_REPORTED.dt.date==date(2019,1,15)][['DATE_REPORTED']].\
    groupby(df_crime.DATE_REPORTED.dt.hour).count().rename(columns={'DATE_REPORTED':'event_count'})
# create an "empty hour" dataframe
df_empty_hrs = pd.DataFrame(np.zeros(24), index=range(0, 24))
# merge the two together
df_chart = df_empty_hrs.join(df_chart, how='left').fillna(0).drop(columns=[0])

Step 4: Finally, render the chart

Now, we can produce the chart:

fig, ax = plt.subplots(figsize=(12, 7), subplot_kw={'projection': 'polar'})

render_chart(df_chart, ax, 'Cincinnati Crime Incidents: 15 Jan 2019')

It is interesting that an overwhelming majority of incidents on this day occurred at the 9am hour.

It would be further interesting to see what an average day in 2019 looked like: maybe weekday versus weekend. or Maybe average Monday through Sunday. Pandas makes it pretty simple to do these calculations and, with my function, you can easily visualize the results!

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑