Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Category: technology (Page 19 of 36)

Parsing unconventional data files

One of these data files is not like the others

Recently, a friend presented me with an interesting challenge. He had a data file that he wanted to pull into a pandas dataframe; however, the file was substantially different from the CSVs and TSVs he normally parses. Here’s a representation of his data file:

Example of the unconventional data file

So, how would someone such as myself go about parsing such a file into a dataframe? Well, I would just do some parsing with Python first. Here’s the solution I came up with.

Step 1: Import the packages

import re
import datetime
import pandas as pd

Step 2: Import the file as text and parse it into a list

To start with, I split the file on those dotted lines. Then, I iterate over each entry line-by-line. With each iteration, I use a regular expression to find the timestamp value then I look for the other properties. Ultimately, I append a list of the timestamp, the price, and the commodity name to my master list.

delim = '-----------------------------------------------'
ld = re.compile(r'\d{4}-\d{2}-\d{2}')
log_list = []

with open('commodities.txt', 'r') as f:
    log = f.read()
    
for entry in log.split(delim):
    for line in entry.split('\n'):
        if ld.match(line):
            d = datetime.datetime.strptime(line, '%Y-%m-%d %H:%M:%S')
        elif len(line.strip()) > 0:
            price = line.strip().split()[0]
            commodity = line.strip().split()[1]
            log_list.append([d, price, commodity])

Step 3: Read the master list into a new dataframe

Once I finish iterating over the file and building out my log list, I can then properly pull it into a dataframe.

df1 = pd.DataFrame(log_list, columns=['log_date', 'price', 'commodity'])
df1.head()

But, wait, there’s more!

My friend happens to really like awk. While I was off coding my Python solution, he was busy writing an awk script to do the same. It occurred to me, though, that even if he wanted to solve his problem with awk, he could code it up and run it in Jupyter Notebook. Here’s how you might solve this same problem with awk.

Step 1: Develop the AWK script and write to disk

The writefile magic word basically turns your Jupyter Notebook cell into a text editor where you can easily save your work to a file. Here, I’m coding the awk script and then writing it to the file my_awk_script.awk.

%%writefile my_awk_script.awk
BEGIN {
    FS = " "
    OFS = ","
}
{
    if ( /^20/ )
    {
        dtstamp = $0
    }
    if ( NF = 2 )
    {
        price = $1
        commodity = $2
    }
    if ( commodity ~ /^[0-9]/ )
    {
        print dtstamp,price,commodity
    }
}
END {}

Step 2: Run the awk file in a bash shell

With my awk script done, I can execute it right from Jupyter Notebook with the help of the bash magic word. One cool thing about this magic word is that you can pipe the cell output to a variable for later processing. Here, I’m piping the results of the awk script to the variable awk_output.

%%bash --out awk_output
gawk -f my_awk_script.awk commodities.txt

Step 3: Clean up the output and load it into a dataframe

The output is one long string with return and newline characters denoting each new line. I can do some list comprehension work on that string, though, and easily get it ready for reading into a new dataframe:

parsed_log = [l.split(',') for l in awk_output.split('\r\n')]
df2 = pd.DataFrame(parsed_log, columns=['log_date', 'price', 'commodity'])
df2.head()

And there you have it: two ways to parse unconventional data files into a pandas dataframe. Check out my complete Jupyter Notebook here!

Filtering on the most recent version of groups in Pandas

Not the catchiest blog title, but stick with me here.

Recently, a friend came to me with an interesting pandas problem. He basically had a set of survey questions and answers. The surveys were of different software applications, but each application had one or more versions of survey results. His goal was to filter on the most recent survey results of each application.

Spinal Tap: too many drummer changes to keep track

For demo purposes, consider a dataset of different rock bands and their lineup changes over the years. Van Halen has had about four lineup changes: from the early David Lee Roth years, to the Sammy years, to the Gary Cherone era, and back to David Lee Roth. ZZ Top, on the other hand, has had only one lineup: the original three members.

The challenge: given a dataset of different bands and their different lineup changes over the years, how might you filter the dataframe to only show the most recent lineup of each band? (Here’s the dataset I’m using.)

I found two ways to solve this problem and a third potential option that bears some consideration. Let’s take a look at these solutions!

Of course, we must first import pandas and load our dataset

import pandas as pd
df = pd.read_csv('bands.csv')

Solution 1: idxmax and Merge

The idxmax function returns the row locations where a given column contains the maximum value. With our dataset, then, we can group by the band then run idxmax on the version column. Unfortunately, though, idxmax only returns one row per band; otherwise, it might have solved our problem straight-away:

df.loc[df.groupby('band').version.idxmax()]

With idxmax, we can still build a “most recent” dataframe: as in the most recent version numbers for each band. Then, we can use pandas merge to join our full dataset to the one representing the most recent versions of each band. This will effectively filter down our full dataframe to one that only shows the most recent versions of each band and, thus, solve our problem:

df_most_recent = df.loc[df.groupby('band').version.idxmax()]
df.merge(df_most_recent, how='right', on=['band', 'version'])
You would drop the “member_y” column, of course

Solution 2: Max and clever filtering

Similar to idxmax, we can use the pandas max function to also find the latest version numbers of each band, but this time as a Series.

df.groupby('band').version.max()

Series objects can be easily casted to a dictionary objects, which will be handy in a minute:

dict(most_recent).items()

This next part is really cool: I was not aware you could do this, but you can use the apply function to convert dataframe columns into a Series of tuples. Then, you can use the isin function and some clever list comprehension to match the band/version tuple of the main dataframe to the band/version tuple of the “most recent” Series. The result is the solution we seek: a dataframe of the most recent band lineups:

most_recent = dict(df.groupby('band').version.max())
df[ (df[['band','version']].apply(tuple, axis=1)).isin([(b,v) for b, v in most_recent.items()]) ]
No columns to drop for this solution

A third, half-baked idea

The above two approaches seem to solve this problem decently. One other option to explore is changing the index of the dataframe to a multi-index dataframe on the band and lineup version:

df_indexed = df.set_index(['band', 'version'])

Pandas can do some interesting slicing based on indices, including making use of the “cross section” (xs) function. In my example below, I’m filtering the dataframe on the second index and requesting all rows where the second index has a value of “1”. Basically, give me a dataframe of all the original band lineups:

df_indexed.xs(1, level=1, drop_level=False)

Pretty slick! Now, how do I use this multi-index approach to return the most recent band lineups? Unfortunately, I have not found a way to do that with the xs function. Nevertheless, it would not surprise me if such a way were possible with indexing and multi-indexing.

You can download my full code solution here.

How do you transpose a Question/Answer dataset?

Recently, a friend came to me with an interesting challenge. He had a dataset of questions and answers where each record contained a single question and the answer to the question. Arguably, this dataset was already in a tidy format, but my friend wanted to transpose the data such that each unique question became a column of its own with the answers as values.

Before I could come to his aid, my friend already found a great answer at Medium.com using the pandas function: pivot_table.

Here’s what he did:

Let’s suppose you have this table of question/answer, tab-delimited data:

person	question	answer
Sir Robin	What is your name?	Sir Robin of Camelot
Sir Robin	What is your quest?	To seek the Holy Grail
Sir Robin	What is the capital of Assyria?	I don't know that
Sir Lancelot	What is your name?	Sir Lancelot of Camelot
Sir Lancelot	What is your quest?	To seek the Holy Grail
Sir Lancelot	What is your favorite colour?	Blue
Sir Galahad	What is your name?	Sir Galahad of Camelot
Sir Galahad	What is your quest?	I seek the Grail
Sir Galahad	What is your favorite colour?	"Blue, no Yellow"
King Arthur	What is your name?	"Arthur, King of the Britons"
King Arthur	What is your quest?	I seek the Holy Grail
King Arthur	What is the air speed of an unladened swallow?	What do you mean?  An African or European swallow?

Step 1: Import pandas and read in your data

import pandas as pd

df = pd.read_csv('questions.csv', sep='\t')

Step 2: pivot_table

df_pivotted = df.pivot_table(index='person', values=['answer'], 
                             columns=['question'], aggfunc=lambda x: ' '.join(str(v) for v in x))
df_pivotted.head()
pivot_table does the job nicely

The trick here is the aggfunc operation. The aggfunc parameter is normally used to sum, average, or perform some other type of numeric operation on your values columns. Interestingly, though, you can apparently supply your own custom function to this parameter instead. Here, the Medium.com author found that he could simply loop through every letter of the answer and re-join them with spaces, effectively return the original answer.

That seems pretty complicated

The use of pivot_table certainly works in this example and it’s pretty sweet to see that you can pass your own custom function to it. However, pandas also has a more generic, pivot function. Could that have worked here?

The answer is: yes. When you google pandas pivot vs pivot_table, one of the top responses is this Stackoverflow.com post that suggests pivot_table only allows numerically-typed columns in the values parameter while pivot will take strings. I don’t think this is quite true, since the above example passed a string column to the values parameter, but it does suggest that pivot might be more disposed to working with strings than pivot_table. Let’s give it a try:

df.pivot(index='person', values='answer', columns='question')
Whaddya know?! Pivot can do the job, too!

Not only can pivot do the transformation, it certainly seems less complicated. Check out my full code here.

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑