Pandas and distance calculations

If you’ve ever had to calculate distances between sets of coordinates, this article is pretty helpful. The author covers a few different approaches, focusing a lot of attention on the Haversine distance calculation. He offers a handy function and an example of calculating the kilometers between different cities in India:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# handy function from the article
def haversine_vectorize(lon1, lat1, lon2, lat2):
    """Returns distance, in kilometers, between one set of longitude/latitude coordinates and another"""
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    newlon = lon2 - lon1
    newlat = lat2 - lat1

    haver_formula = np.sin(newlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(newlon/2.0)**2

    dist = 2 * np.arcsin(np.sqrt(haver_formula ))
    km = 6367 * dist #6367 for distance in KM for miles use 3958
    return km


# the article's example
orig_dest_df = pd.DataFrame({
    'origin_city':['Bangalore','Mumbai','Delhi','Kolkatta','Chennai','Bhopal'],
    'orig_lat':[12.9716,19.076,28.7041,22.5726,13.0827,23.2599],
    'orig_lon':[77.5946,72.877,77.1025,88.639,80.2707,77.4126],
    'dest_lat':[23.2599,12.9716,19.076,13.0827,28.7041,22.5726],
    'dest_lon':[77.4126,77.5946,72.877,80.2707,77.1025,88.639],
    'destination_city':['Bhopal','Bangalore','Mumbai','Chennai','Delhi','Kolkatta']})

orig_dest_df['haversine_dist'] = haversine_vectorize(orig_dest_df.orig_lon, orig_dest_df.orig_lat, 
                                                     orig_dest_df.dest_lon, orig_dest_df.dest_lat)

orig_dest_df.head()

This was all very helpful to me in some recent work; however, my two sets of coordinates are usually not in the same row. Typically, my coordinates will be in separate rows in my dataframe and then I’ll have to calculate the differences between rows. So let’s take the author’s data and reshape it to have only one city per row. For added measure, I’ll add an “arrival time” datetime value:

data = {'city': ['Bangalore', 'Bhopal', 'Mumbai', 'Delhi', 'Kolkatta', 'Chennai'], 
        'lat': [12.9716, 23.2599, 19.0760, 28.7041, 22.5726, 13.0827], 
        'lon': [77.5946, 77.4126, 72.8770, 77.1025, 88.6390, 80.2707], 
        'arrival_time': [datetime(2021,12,1,12,0,0), datetime(2021,12,3,13,30,0), datetime(2021,12,6,8,0,0), 
                         datetime(2021,12,7,20,30,0), datetime(2021,12,9,12,30,0), datetime(2021,12,15,7,30,0)]}
one_loc_per_row_df = pd.DataFrame(data)
one_loc_per_row_df.head()

Pandas has a fantastic diff function that let’s you calculate the difference between an element from row-to-row. For example, I can use it to calculate the travel times between each city:

one_loc_per_row_df['travel_time'] = one_loc_per_row_df.arrival_time.diff()
one_loc_per_row_df.head()

But to calculate my travel distances, I have to take two elements–latitude and longitude–from each row and run them through my haversine_vectorize function to get the distance difference. So far, I’ve found no way to extend the Pandas diff function to do this. No worries, though: with Pandas, there are often several ways to solve your problems. Enter the shift function.

The Pandas shift function allows you to offset your dataframe in one direction or another. For my purposes, I need to “shift” a copy of my dataframe forward by one row so that I can process a given row against the next row, like so:

one_loc_per_row_df['travel_dist'] = haversine_vectorize(one_loc_per_row_df.lon, one_loc_per_row_df.lat, 
                                                        one_loc_per_row_df.lon.shift(1), one_loc_per_row_df.lat.shift(1))
one_loc_per_row_df.head()

So that’s a way to calculate distances between coordinates when your beginning and ending coordinates are in separate records in your dataframe.

But wait, there’s more…

The above work assumes a single traveler, but what if you have data for multiple people in your dataset? Imagine this:

larry_data = {'traveler': ['Larry']*6, 
              'city': ['Bangalore', 'Bhopal', 'Mumbai', 'Delhi', 'Kolkatta', 'Chennai'], 
              'lat': [12.9716, 23.2599, 19.0760, 28.7041, 22.5726, 13.0827], 
              'lon': [77.5946, 77.4126, 72.8770, 77.1025, 88.6390, 80.2707], 
              'arrival_time': [datetime(2021,12,1,12,0,0), datetime(2021,12,3,13,30,0), datetime(2021,12,6,8,0,0), 
                               datetime(2021,12,7,20,30,0), datetime(2021,12,9,12,30,0), datetime(2021,12,15,7,30,0)]}
moe_data = {'traveler': ['Moe']*6,
            'city': ['Miami', 'Atlanta', 'Auburn', 'New Orleans', 'Dallas', 'Houston'], 
            'lat': [25.7616798, 33.7489954, 47.3073228, 29.951065, 32.779167, 29.749907], 
            'lon': [-80.1917902, -84.3879824, -122.2284532, -90.071533, -96.808891, -95.358421], 
            'arrival_time': [datetime(2021,12,1,9,15,0), datetime(2021,12,4,23,30,0), datetime(2021,12,5,8,0,0), 
                             datetime(2021,12,7,14,30,0), datetime(2021,12,10,12,30,0), datetime(2021,12,12,7,30,0)]}
curly_data = {'traveler': ['Curly']*6,
              'city': ['London', 'Liverpool', 'Cambridge', 'Birmingham', 'Oxford', 'Southampton'], 
              'lat':[51.509865, 53.400002, 52.205276, 52.489471, 51.752022, 50.909698], 
              'lon': [-0.118092, -2.983333, 0.119167, -1.898575, -1.257677, -1.404351], 
              'arrival_time': [datetime(2021,12,1,9,0,0), datetime(2021,12,2,13,30,0), datetime(2021,12,4,8,30,0), 
                               datetime(2021,12,6,18,30,0), datetime(2021,12,8,12,30,0), datetime(2021,12,9,7,30,0)]}

travelers_df = pd.concat([pd.DataFrame(larry_data), pd.DataFrame(moe_data), pd.DataFrame(curly_data)]).reset_index(drop=True)
travelers_df.head(20)

To get travel time differences for each of the travelers, we can still use the “diff” function, but first make sure we group by the traveler:

travelers_df['travel_time'] = travelers_df.groupby('traveler').arrival_time.diff()
travelers_df.head(20)

Calculating the travel distance, though, is slightly more complicated. Grouping by the traveler and then applying the haversine_vectorize function via a lambda expression yields this:

travel_dist = travelers_df.groupby('traveler').\
    apply(lambda r: haversine_vectorize(r.lon, r.lat, r.lon.shift(1), r.lat.shift(1)))

travel_dist

The result set is indexed by both the traveler value and each row’s index from the original dataframe. To add these values back to the original dataframe, then, all I need to do is get rid of the first index:

travelers_df['travel_dist'] = travel_dist.reset_index(level=0, drop=True)
travelers_df

Now, with these calculations, you can figure out the average travel speeds of people, look for anomalies such as if a person traveled a distance faster than normal, etc.

Pandas and distance calculations

But wait, there’s more…

Recent Posts

Recent Comments

Archives

Meta

Pandas and distance calculations

But wait, there’s more…

Recent Posts

Recent Comments

Archives

Tags

Meta