One of these data files is not like the others

Recently, a friend presented me with an interesting challenge. He had a data file that he wanted to pull into a pandas dataframe; however, the file was substantially different from the CSVs and TSVs he normally parses. Here’s a representation of his data file:

Example of the unconventional data file

So, how would someone such as myself go about parsing such a file into a dataframe? Well, I would just do some parsing with Python first. Here’s the solution I came up with.

Step 1: Import the packages

import re
import datetime
import pandas as pd

Step 2: Import the file as text and parse it into a list

To start with, I split the file on those dotted lines. Then, I iterate over each entry line-by-line. With each iteration, I use a regular expression to find the timestamp value then I look for the other properties. Ultimately, I append a list of the timestamp, the price, and the commodity name to my master list.

delim = '-----------------------------------------------'
ld = re.compile(r'\d{4}-\d{2}-\d{2}')
log_list = []

with open('commodities.txt', 'r') as f:
    log = f.read()
    
for entry in log.split(delim):
    for line in entry.split('\n'):
        if ld.match(line):
            d = datetime.datetime.strptime(line, '%Y-%m-%d %H:%M:%S')
        elif len(line.strip()) > 0:
            price = line.strip().split()[0]
            commodity = line.strip().split()[1]
            log_list.append([d, price, commodity])

Step 3: Read the master list into a new dataframe

Once I finish iterating over the file and building out my log list, I can then properly pull it into a dataframe.

df1 = pd.DataFrame(log_list, columns=['log_date', 'price', 'commodity'])
df1.head()

But, wait, there’s more!

My friend happens to really like awk. While I was off coding my Python solution, he was busy writing an awk script to do the same. It occurred to me, though, that even if he wanted to solve his problem with awk, he could code it up and run it in Jupyter Notebook. Here’s how you might solve this same problem with awk.

Step 1: Develop the AWK script and write to disk

The writefile magic word basically turns your Jupyter Notebook cell into a text editor where you can easily save your work to a file. Here, I’m coding the awk script and then writing it to the file my_awk_script.awk.

%%writefile my_awk_script.awk
BEGIN {
    FS = " "
    OFS = ","
}
{
    if ( /^20/ )
    {
        dtstamp = $0
    }
    if ( NF = 2 )
    {
        price = $1
        commodity = $2
    }
    if ( commodity ~ /^[0-9]/ )
    {
        print dtstamp,price,commodity
    }
}
END {}

Step 2: Run the awk file in a bash shell

With my awk script done, I can execute it right from Jupyter Notebook with the help of the bash magic word. One cool thing about this magic word is that you can pipe the cell output to a variable for later processing. Here, I’m piping the results of the awk script to the variable awk_output.

%%bash --out awk_output
gawk -f my_awk_script.awk commodities.txt

Step 3: Clean up the output and load it into a dataframe

The output is one long string with return and newline characters denoting each new line. I can do some list comprehension work on that string, though, and easily get it ready for reading into a new dataframe:

parsed_log = [l.split(',') for l in awk_output.split('\r\n')]
df2 = pd.DataFrame(parsed_log, columns=['log_date', 'price', 'commodity'])
df2.head()

And there you have it: two ways to parse unconventional data files into a pandas dataframe. Check out my complete Jupyter Notebook here!