Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: tools (Page 7 of 35)

Parsing inconsistent file data with Python

I’ve posted a few of my data file parsing challenges in the past–here and here, for example–and here’s a slight variation on those examples.

A friend of mine was challenged with parsing data from a file where not every field was guaranteed to be present. Take this for example:

A file with not-so-consistent data

In the above file, the “Nickname” field only appears in one record. Plus, the fields in the last record are in a different order than the others. All we know for sure is that we have a string (/****start record) that indicates when a new record begins and field keys and values are separated by “: “. How would I go about parsing such a file?

Step 1: Read in the file with pandas

I could pull in the file with Python’s open and read capabilities, but it’s probably even easier to do it with the pandas read_csv function (even though this is not a true CSV file):

import pandas as pd

df_raw = pd.read_csv('./data/another_odd_format.txt', names=['col1'])

Now we have a “raw” dataframe that looks like so:

Step 2: Find where your records start and end

Finding out where your records start is easy enough: find all the rows with your start record indicator (/****start record) and add 1. After that, finding where your records end is easy, too: subtract 1 from each of the “start record” rows beginning with the second start line. (Technically, I subtracted 2 to account for the empty line at the end of each record.) Then, you can just add that last index of your dataframe on at the end. Here’s what I did:

new_rec_str = '/****start record'
start_rows = [i+1 for i in df_raw[df_raw.col1.str.startswith(new_rec_str)].index.tolist()]
end_rows = [s-2 for s in start_rows[1:]] + [df_raw.index[-1]]

Step 3: Loop through each start/end record pair and parse away

recs = []  # list to contain my parsed records
for start_row, end_row in zip(start_rows, end_rows):
    new_rec = {}
    for i in range(start_row, end_row + 1):
        line = df_raw.loc[i, 'col1']
        new_rec[line.split(': ')[0]] = line.split(': ')[1]
    recs.append(new_rec)

This gives you a nice list of dictionary objects:

[{'Fname': 'al', 'LName': 'bundy', 'Address': 'chicago'},
 {'Fname': 'marcy', 'LName': 'darcy', 'Address': 'chicago'},
 {'Fname': 'theodore',
  'LName': 'cleaver',
  'Address': 'mayfield',
  'Nickname': 'the beaver'},
 {'Address': 'Hill Valley', 'LName': 'mcfly', 'Fname': 'marty'}]

Step 4: Load your list into a new dataframe

df_clean = pd.DataFrame(recs)

This gives you a neat and tidy dataframe:

If you don’t like the NaN in records missing a “Nickname” value, you can always add a fillna function to the end of the dataframe line and replace the NaN values with something like empty string (”). So, that’s how I would tackle this sort of challenge.

Filtering dataframes with tuples

Often when I’m working with data, I need to filter the data down into sub-groups for more detailed analysis. Usually, my needs are simple, where I only need to parse on fields independently. However, sometimes I need to filter on field combinations and that work can get complicated.

Even Mr. Hitchcock finds filtering by field combinations challenging

For example, suppose I’m working with a movie dataset and I want to compare Francis Ford Coppola dramas to Alfred Hitchcock mysteries to Martin Scorsese biography pictures. Initially, I might try something like this:

import pandas as pd


# data from: https://www.kaggle.com/mysarahmadbhat/imdb-top-1000-movies
df = pd.read_csv('./data/regex_imdb.csv').fillna(0)
# for simplicity, I'm just going to use the first, comma-delimited genre value
df['Genre'] = df.Genre.apply(lambda g: g.split(',')[0])

directors = ['Francis Ford Coppola', 'Alfred Hitchcock', 'Martin Scorsese']
genres = ['Drama', 'Mystery', 'Biography']
df[(df.Director.isin(directors)) & (df.Genre.isin(genres))]
This code allowed in Scorsese “mysteries”, for example, when I only wanted Scorsese “biographies”

Ok. Let’s then filter the hard way:

df[((df.Director=='Francis Ford Coppola') & (df.Genre=='Drama')) | 
   ((df.Director=='Alfred Hitchcock') & (df.Genre=='Mystery')) | 
   ((df.Director=='Martin Scorsese') & (df.Genre=='Biography'))]

This code does get me the data I want, but it’s long and a little unreadable. Is there a better way? Yes: tuples!

First, let’s take the director/genre combinations we want and put them together in a single list of tuples:

dirs_genres = [('Francis Ford Coppola', 'Drama'), ('Alfred Hitchcock', 'Mystery'), ('Martin Scorsese', 'Biography')]

Now, we can use the apply function to create an on-the-fly director/genre tuple field that we can parse on:

df[df.apply(lambda row: (row.Director, row.Genre), axis=1).isin(dirs_genres)]

We get the same results as before, but with slightly more readable code. Here’s a neat alternative option using Python’s tuple function:

df[df[['Director', 'Genre']].apply(tuple, axis=1).isin(dirs_genres)]

If you anticipate that you might use this director/genre combination a lot in your work, you could consider adding this tuple as a new column to your dataframe:

df['pg'] = df[['Director', 'Genre']].apply(tuple, axis=1)
df.head()

Then, you could do all your filtering directly against that single field.

Happy filtering!

Positioning Subplots

I’m certainly a big fan of visualizing data. Often, I like to present multiple types of visualizations together to offer a variety of perspectives on the data. For example, I might provide both a bar chart and scatter plot together to provide deeper insight than a single visual would:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

%matplotlib inline


# data from: https://www.kaggle.com/mysarahmadbhat/imdb-top-1000-movies
df = pd.read_csv('./data/regex_imdb.csv').fillna(0)

fig, ax = plt.subplots(1, 2, figsize=(10,6))

d1 = df[df.Year==2019][['Name', 'Gross']].sort_values('Gross').tail(10)
_ = ax[0].barh(d1.Name, d1.Gross)
_ = ax[0].set_xlabel('Gross Earnings')

d2 = df[df.Year==2019][['Run_time', 'Gross', 'Genre']].copy()
d2['Genre'] = d2.Genre.apply(lambda g: g.split(',')[0])

_ = sns.scatterplot(data=d2, x='Run_time', y='Gross', hue='Genre', ax=ax[1])
_ = ax[1].set_xlabel('Runtime (minutes)')

_ = fig.suptitle('Analysis of Movies from 2019')

In this sort of work, I will target specific axes to display specific charts. Thus, in my above example, I explicitly pushed a bar chart to ax[0] and a scatter plot to ax[1].

However, on occasion, circumstances demand that I write the same type of chart to multiple subplots where I change one variable for each. For example, suppose I want to get a quick view of the top 10 movies by gross earnings from 2010 to 2019:

I could write code to target each of these axes explicitly, but that would mean a lot of code and a lot of copy/paste. Instead, I’d rather just write a loop to iterate through the years and write the appropriate bar chart to the appropriate axis.

Looping and rendering the charts comes relatively easy to me. What usually trips me up in these efforts is targeting the right row and column. I often spend most of my time trying to remember how I solved this problem in the past.

Well no more! Hopefully this post will serve as a reference any time I need to do this type of work in the future. Ultimately, my solution is just three lines of code:

nbr_of_rows = 5
nbr_of_cols = 2

coords = [(r, c) for r in range(nbr_of_rows) for c in range(nbr_of_cols)]

Here, I set the number of rows and columns I want in my visual and do some list comprehension to pair those coordinates together in a list. Now, I have a nice, pre-built list of coordinates to leverage in my loop:

fig, ax = plt.subplots(nbr_of_rows, nbr_of_cols, figsize=(12,12))

for i, yr in enumerate(range(2010, 2020)):
    r,c = coords[i]  # grab the pre-built coordinates
    d = df[df.Year==yr][['Name', 'Gross']].sort_values('Gross').tail(10)
    _ = ax[r][c].barh(d.Name, d.Gross)
    _ = ax[r][c].set_title('Top 10 grossing movies in {0}'.format(yr))
    
fig.tight_layout()

Awesome: one less thing I have to think about!

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑