I’ve posted a few of my data file parsing challenges in the past–here and here, for example–and here’s a slight variation on those examples.

A friend of mine was challenged with parsing data from a file where not every field was guaranteed to be present. Take this for example:

A file with not-so-consistent data

In the above file, the “Nickname” field only appears in one record. Plus, the fields in the last record are in a different order than the others. All we know for sure is that we have a string (/****start record) that indicates when a new record begins and field keys and values are separated by “: “. How would I go about parsing such a file?

Step 1: Read in the file with pandas

I could pull in the file with Python’s open and read capabilities, but it’s probably even easier to do it with the pandas read_csv function (even though this is not a true CSV file):

import pandas as pd

df_raw = pd.read_csv('./data/another_odd_format.txt', names=['col1'])

Now we have a “raw” dataframe that looks like so:

Step 2: Find where your records start and end

Finding out where your records start is easy enough: find all the rows with your start record indicator (/****start record) and add 1. After that, finding where your records end is easy, too: subtract 1 from each of the “start record” rows beginning with the second start line. (Technically, I subtracted 2 to account for the empty line at the end of each record.) Then, you can just add that last index of your dataframe on at the end. Here’s what I did:

new_rec_str = '/****start record'
start_rows = [i+1 for i in df_raw[df_raw.col1.str.startswith(new_rec_str)].index.tolist()]
end_rows = [s-2 for s in start_rows[1:]] + [df_raw.index[-1]]

Step 3: Loop through each start/end record pair and parse away

recs = []  # list to contain my parsed records
for start_row, end_row in zip(start_rows, end_rows):
    new_rec = {}
    for i in range(start_row, end_row + 1):
        line = df_raw.loc[i, 'col1']
        new_rec[line.split(': ')[0]] = line.split(': ')[1]
    recs.append(new_rec)

This gives you a nice list of dictionary objects:

[{'Fname': 'al', 'LName': 'bundy', 'Address': 'chicago'},
 {'Fname': 'marcy', 'LName': 'darcy', 'Address': 'chicago'},
 {'Fname': 'theodore',
  'LName': 'cleaver',
  'Address': 'mayfield',
  'Nickname': 'the beaver'},
 {'Address': 'Hill Valley', 'LName': 'mcfly', 'Fname': 'marty'}]

Step 4: Load your list into a new dataframe

df_clean = pd.DataFrame(recs)

This gives you a neat and tidy dataframe:

If you don’t like the NaN in records missing a “Nickname” value, you can always add a fillna function to the end of the dataframe line and replace the NaN values with something like empty string (”). So, that’s how I would tackle this sort of challenge.