Python and pandas works well with conventionally formatted spreadsheets like this:
But how do you deal with spreadsheets formatted in unconventional ways, like this?
Here’s my approach to massaging this data into a dataframe I can work with.
Step 1: Go ahead and read in the wonky spreadsheet
Go ahead and read in the spreadsheet, warts and all, into a dataframe. I went ahead and skipped rows 0 and 1 as they were unnecessary:
import pandas as pd
df_raw = pd.read_excel('./data/odd_format.xlsx', skiprows=1).fillna('')
As you’d expect, the results are not very pretty:
Step 2: Figure out where each record starts and stops
Looking at the spreadsheet, I determined that each record starts with a field named “First Name:” and ends with a field named “State:”. If I can put together a list of row indexes that lets me know where each record begins and ends, I should be able to iterate through that list and reformat each record uniformly. Pandas filtering can help with that. To get a list of each “start” row, I can use this code:
df_raw[df_raw['Unnamed: 1']=='First Name:'].index.tolist()
To get a list of each “end” row, I can do this:
df_raw[df_raw['Unnamed: 1']=='State:'].index.tolist()
Finally, I can use Python’s handy zip function to glue both together in a list of tuples that I can easily loop through:
for start_row, end_row in zip(df_raw[df_raw['Unnamed: 1']=='First Name:'].index.tolist(), df_raw[df_raw['Unnamed: 1']=='State:'].index.tolist()):
# loop through each record
Step 3: Collect all key/value pairs per record
Now that I’m able to iterate over each record, I need to be able to capture each key/value pair in each record: each person’s first name, middle name (if available), last name, etc. I can use Python’s range function to loop from the starting row to the ending row of the record and pandas iloc function to zero in on each key and associated value:
person = {} # I need some place to store the keys/values, so let's use a dictionary
for i in range(start_row, end_row+1):
k = df_raw.iloc[i, 1] # the keys are in column 1
v = df_raw.iloc[i, 2] # the values are in column 2
Each record has an empty row in the middle of it, separating “name” properties from “address” properties. I don’t need those empty rows, so I do a quick check before writing the keys and values to my dictionary object:
if len(k.strip()) > 0:
person[k.strip().replace(':', '')] = v
Of course, I need to be writing each of these person objects to a master list, so I do that by appending each object:
people_list.append(person)
Step 4: Create a new dataframe from the people list
Finally, I can take that clean list of dictionaries and generate a new dataframe from it:
df_clean = pd.DataFrame(people_list)
Which renders a nice dataframe from which I can start my analysis:
So, putting it all together, my full code looks like this:
import pandas as pd
df_raw = pd.read_excel('./data/odd_format.xlsx', skiprows=1).fillna('')
people_list = []
for start_row, end_row in zip(df_raw[df_raw['Unnamed: 1']=='First Name:'].index.tolist(), df_raw[df_raw['Unnamed: 1']=='State:'].index.tolist()):
person = {}
for i in range(start_row, end_row+1):
k = df_raw.iloc[i, 1]
v = df_raw.iloc[i, 2]
if len(k.strip()) > 0:
person[k.strip().replace(':', '')] = v
people_list.append(person)
df_clean = pd.DataFrame(people_list)
So, should you encounter similarly unconventionally formatted spreadsheets in the future, hopefully this code will help you find a solution to deal with them!
Recent Comments