Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Category: technology (Page 9 of 36)

Parsing Oddly Formatted Spreadsheets

Python and pandas works well with conventionally formatted spreadsheets like this:

Conventional spreadsheet easily parsed in Python

But how do you deal with spreadsheets formatted in unconventional ways, like this?

Can you use pandas to parse an oddly formatted spreadsheet?

Here’s my approach to massaging this data into a dataframe I can work with.

Step 1: Go ahead and read in the wonky spreadsheet

Go ahead and read in the spreadsheet, warts and all, into a dataframe. I went ahead and skipped rows 0 and 1 as they were unnecessary:

import pandas as pd

df_raw = pd.read_excel('./data/odd_format.xlsx', skiprows=1).fillna('')

As you’d expect, the results are not very pretty:

Step 2: Figure out where each record starts and stops

Looking at the spreadsheet, I determined that each record starts with a field named “First Name:” and ends with a field named “State:”. If I can put together a list of row indexes that lets me know where each record begins and ends, I should be able to iterate through that list and reformat each record uniformly. Pandas filtering can help with that. To get a list of each “start” row, I can use this code:

df_raw[df_raw['Unnamed: 1']=='First Name:'].index.tolist()

To get a list of each “end” row, I can do this:

df_raw[df_raw['Unnamed: 1']=='State:'].index.tolist()

Finally, I can use Python’s handy zip function to glue both together in a list of tuples that I can easily loop through:

for start_row, end_row in zip(df_raw[df_raw['Unnamed: 1']=='First Name:'].index.tolist(), df_raw[df_raw['Unnamed: 1']=='State:'].index.tolist()):
    # loop through each record

Step 3: Collect all key/value pairs per record

Now that I’m able to iterate over each record, I need to be able to capture each key/value pair in each record: each person’s first name, middle name (if available), last name, etc. I can use Python’s range function to loop from the starting row to the ending row of the record and pandas iloc function to zero in on each key and associated value:

person = {}  # I need some place to store the keys/values, so let's use a dictionary
for i in range(start_row, end_row+1):
    k = df_raw.iloc[i, 1]  # the keys are in column 1
    v = df_raw.iloc[i, 2]  # the values are in column 2

Each record has an empty row in the middle of it, separating “name” properties from “address” properties. I don’t need those empty rows, so I do a quick check before writing the keys and values to my dictionary object:

if len(k.strip()) > 0:
    person[k.strip().replace(':', '')] = v

Of course, I need to be writing each of these person objects to a master list, so I do that by appending each object:

people_list.append(person)

Step 4: Create a new dataframe from the people list

Finally, I can take that clean list of dictionaries and generate a new dataframe from it:

df_clean = pd.DataFrame(people_list)

Which renders a nice dataframe from which I can start my analysis:

That’s a little more like it!

So, putting it all together, my full code looks like this:

import pandas as pd

df_raw = pd.read_excel('./data/odd_format.xlsx', skiprows=1).fillna('')
people_list = []
for start_row, end_row in zip(df_raw[df_raw['Unnamed: 1']=='First Name:'].index.tolist(), df_raw[df_raw['Unnamed: 1']=='State:'].index.tolist()):
    person = {}
    for i in range(start_row, end_row+1):
        k = df_raw.iloc[i, 1]
        v = df_raw.iloc[i, 2]
        
        if len(k.strip()) > 0:
            person[k.strip().replace(':', '')] = v
            
    people_list.append(person)
    
df_clean = pd.DataFrame(people_list)

So, should you encounter similarly unconventionally formatted spreadsheets in the future, hopefully this code will help you find a solution to deal with them!

Easy window positioning with PowerToys

A few years ago, I wrote about a solution I developed for neatly positioning windows–especially command shell windows–in a particular monitor of my multi-monitor setup. The script I wrote positioned windows evenly across the width of the screen. Recently, though, I bought one of those rather wide, curvy screens and decide that, instead of stretching my windows evenly across that width, I’d rather place my windows in a grid pattern. I set down to re-write my script and then remembered Microsoft PowerToys.

When I wrote about PowerToys in the past, it was still pretty fledgling. For example, the FancyZones tool didn’t play well with monitors that sat to the left of your primary monitor (the X coordinate was a negative number and that likely threw off the tool). To my delight, though, these issues have been addressed and now PowerToys and FancyZones in particular is my tool of choice for positioning windows on all my monitors.

The other option worth mentioning is Windows Terminal. Windows Terminal houses most/all the command shells you probably use: the standard command prompt, PowerShell, Windows Sub-System for Linux, etc. It also lets you layout these shells however you wish–sort of like a FancyZones for just command shells. I’ve yet to experiment with Windows Terminal, though, so until then, PowerToys will do.

Random numbers with PowerShell

Recently, I was writing some unit tests for a data transformation application I had been developing. I had a sample file of pre-transformed data and decided I wanted my unit tests to just test a few, randomly selected records from the file. My tests would pull in the data file as a list and would iterate through a list of randomly determined indices and test the transformation of each data row. Something like this:

val randomRows = Seq(1, 2, 3, 4, 5) 
for (i <-0 to randomRows.length-1) {
  val randomRow = randomRows(i)
  val dataToTest = dataList(randomRow)
  // transform the data; assert the results
}

But, instead of “1, 2, 3, 4, 5”, I wanted random indices like “432, 260, 397, 175, 98.” How could I quickly achieve this and get different sets of random numbers for the different unit tests I was writing?

Random.org is certainly a good option for picking random numbers. Suppose I had 10 unit tests to write, each needing to test 5 random rows of data. I could generate 50 random numbers like so:

108	62	221	275	342
303	475	234	283	343
184	42	454	102	423
48	348	289	37	493
258	471	461	212	278
175	56	224	405	354
374	124	328	17	171
416	266	415	436	414
93	155	140	382	235
83	382	449	302	170

That’s great, but, annoyingly, I still have to edit these numbers and type commas in between each when I paste them into my code. Is there a way to generate the random numbers I need and automatically format them with commas so that I can easily paste them into my unit test code? PowerShell can do that!

The Get-Random cmdlet

PowerShell has a fantastic cmdlet called Get-Random that allows you easy access to Microsoft’s random number generator features. To use Get-Random to randomly select 5 indices to use in one of my unit tests, I can execute this command at a PowerShell prompt:

0..500 | Get-Random -Count 5

Here, I’m piping a list of numbers–from 0 to 500–to Get-Random and telling the cmdlet to randomly select 5 of them. The result is this:

283
331
212
397
459

The problem is that I’m still no better off that with Random.org: I still must manually comma-delimit these numbers so that they can fit into my code.

Formatting my random numbers

Fortunately, PowerShell includes a handy join operator to make joining my list of random numbers a breeze. All I need to do is surround my original PowerShell command with parentheses and apply a join operation to that result set:

(0..500 | Get-Random -Count 5) -join ", "

And the result:

121, 123, 231, 45, 70

Easy-peasy! I can now drag my mouse over that result, right-click on it to copy the formatted numbers to my clipboard, and then paste the results into my unit test.

But wait, there’s more

That mouse highlighting and right-clicking still seems like a bit of work. Is there anything else I can do to shorten my steps further? Absolutely! PowerShell has another great cmdlet called Set-Clipboard allowing you push PowerShell results right into your clipboard. So, I can just pipe my formatted, random numbers right into the Windows clipboard:

(0..500 | Get-Random -Count 5) -join ", " | Set-Clipboard

Now, once I run the PowerShell command, I can just hop right into my code editor, position my cursor at the appropriate position, and paste in my random numbers. Quite a convenient little command!

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑