I’ve posted a few of my data file parsing challenges in the past–here and here, for example–and here’s a slight variation on those examples.
A friend of mine was challenged with parsing data from a file where not every field was guaranteed to be present. Take this for example:

In the above file, the “Nickname” field only appears in one record. Plus, the fields in the last record are in a different order than the others. All we know for sure is that we have a string (/****start record) that indicates when a new record begins and field keys and values are separated by “: “. How would I go about parsing such a file?
Step 1: Read in the file with pandas
I could pull in the file with Python’s open and read capabilities, but it’s probably even easier to do it with the pandas read_csv function (even though this is not a true CSV file):
1 2 3 | import pandas as pd df_raw = pd.read_csv( './data/another_odd_format.txt' , names = [ 'col1' ]) |
Now we have a “raw” dataframe that looks like so:

Step 2: Find where your records start and end
Finding out where your records start is easy enough: find all the rows with your start record indicator (/****start record) and add 1. After that, finding where your records end is easy, too: subtract 1 from each of the “start record” rows beginning with the second start line. (Technically, I subtracted 2 to account for the empty line at the end of each record.) Then, you can just add that last index of your dataframe on at the end. Here’s what I did:
1 2 3 | new_rec_str = '/****start record' start_rows = [i + 1 for i in df_raw[df_raw.col1. str .startswith(new_rec_str)].index.tolist()] end_rows = [s - 2 for s in start_rows[ 1 :]] + [df_raw.index[ - 1 ]] |
Step 3: Loop through each start/end record pair and parse away
1 2 3 4 5 6 7 | recs = [] # list to contain my parsed records for start_row, end_row in zip (start_rows, end_rows): new_rec = {} for i in range (start_row, end_row + 1 ): line = df_raw.loc[i, 'col1' ] new_rec[line.split( ': ' )[ 0 ]] = line.split( ': ' )[ 1 ] recs.append(new_rec) |
This gives you a nice list of dictionary objects:
1 2 3 4 5 6 7 | [{'Fname': 'al', 'LName': 'bundy', 'Address': 'chicago'}, {'Fname': 'marcy', 'LName': 'darcy', 'Address': 'chicago'}, {'Fname': 'theodore', 'LName': 'cleaver', 'Address': 'mayfield', 'Nickname': 'the beaver'}, {'Address': 'Hill Valley', 'LName': 'mcfly', 'Fname': 'marty'}] |
Step 4: Load your list into a new dataframe
1 | df_clean = pd.DataFrame(recs) |
This gives you a neat and tidy dataframe:

If you don’t like the NaN in records missing a “Nickname” value, you can always add a fillna function to the end of the dataframe line and replace the NaN values with something like empty string (”). So, that’s how I would tackle this sort of challenge.
Recent Comments