Recently, I asked a co-worker for a list of data on which I needed to work. Instead of sending me his spreadsheet as an email attachment, he pasted his spreadsheet directly into the body of an email. How in the world am I supposed to work with that? Pandas can help!
I saved his email out to disk as an HTML file. Outlook converted his pasted spreadsheet into a HTML table. Then, I just used Pandas’ read_html function to read the HTML file. It automatically found the table and converted it into a dataframe for me. Problem solved!
Step 1: Save your file as an HTML file
If the data you want to process is in a table in the body of an email, about your only option is to save that email to disk as an HTML file. Save the email, then I’d recommending opening the file in a text editor like Notepad++ and making sure the data you want to process was saved within a table element. In my example here, I simply grabbed three tables of data from the Internet and pasted them all into a single HTML file.
Step 2: Import pandas
import pandas as pd
Step 3: Read in your HTML file
Note that the read_html function returns a list of dataframes:
Now, with your list of dataframes, you can iterate over it, find the dataframe of the data you want to work with, and have at it.
for df in list_of_dfs:
print(df.head())
Your data might not be in quite the shape you want, but pandas has lots of ways to shape a dataframe to your particular specifications. The important point is that pandas was able to read in your data in seconds versus the time it would have taken to transform the data into a CSV or some other arrangement for parsing.
Python includes a standard logging API that provides all the basic functionality you usually need for logging information about your application. For the most part, I’ve implemented the API as follows:
Step 1: Do the standard imports
Not only do I import the logging package, I also import the os package to map the full path to my log file and the uuid package.
import logging
import uuid
import os
Step 2: Set up some global variables
I usually set up three global variables that I use for logging: current_dir, log_id, and extra. To provide the logging API a full path to my log file, I create a current_dir string that represents the full path to the current directory where my program is running.
Often, after my program has been running for a few weeks, I like to download the log file and gather different metrics on the program. One metric I’m always interested in is how long it takes for my program to run to perform its task (for programs that perform ETL tasks and the like) and is the script speeding up, slowing down, or running about the same over time. The way I do this is by generating a semi-unique value for every time the program runs. I include this unique value–I call it log_id–in every log entry. When I do my analysis, I can group by this log id, easily get the start and end times of the script, calculate the total run time per run, and determine how my script has been doing over time. The easy way to include that log_id in my log entries is to add my own user-definedLogRecord attribute. I do this by creating a dictionary called extra with my log_id key/value pair.
current_dir = os.path.dirname(os.path.realpath(__file__))
log_id = str(uuid.uuid4())[-12:]
extra = {'log_id': log_id}
My log_id helps separate one run of my program from another
Step 3: Create the configuration
Next, I create my configuration by setting the filename to the full path of my log file, my preferred date/time format, the format of the log file itself, and the minimum logging level to log. Traditionally, I’ve always just set up my configuration in the code.
Finally, I can start writing my log events. Since I’m including a user-defined LogRecord attribute, I have to always make sure to include the “extra” argument and pass my extra dictionary to it.
logging.debug('this is a degug statement', extra=extra)
logging.info('Did something', extra=extra)
A better way to do this
So, that approach to logging is fine, but I’d like to improve upon it in at least two ways:
I’d like to move my configuration out to a separate properties file so that I can more easily change aspects of the logging configuration, especially the logging level and
I’d like to implement rotating logs so that I can more easily manage the growth of my log files.
I’ve been able to achieve both goals by improving my code as follows:
Improvement Step 1: import logging.config
The first step in moving my logging configuration outside of my code and into a properties file is by importing logging.config:
import logging.config
Improvement Step 2: reference my configuration file
Next, I have to point the logging API to my logging configuration file. In this example, my configuration file is in the same directory as my Python script, so I don’t need to provide a full path to the file:
Check out the documentation to learn more about the configuration file format, what sections are required, and so forth. I’ve highlighted five lines of this configuration file to point out five interesting features:
In this example, I’ve set the logging level to INFO but I can now easily change that to any level I wish by editing this file.
The RotatingHandler class takes four arguments and I can easily pass those to the class in my configuration file with the “args” key. Here, I’m telling the class to write logs to the file logger_cfg_example.log, open that file for “writing”, rotate the file every 2000 bytes, and only keep three archived log files. Note that log size argument is in bytes. In practice, you’d probably want to roll your log file after so many megabytes. For my testing purposes, I’m just rolling my log file after 2 kilobytes.
I can now move my log file format to my configuration file by setting the “format” key. Note that I can still include my user-defined attribute, log_id.
Finally, reading the documentation, I discovered that the default date/time format basically matches the format I use most often, so I’ve opted to leave that formatting blank.
Improvement Step 4: Nothing else, just log
With that configuration file, my code looks a little cleaner. I can go ahead and log like normal:
logging.debug('this is a degug statement', extra=extra)
logging.info('Did something', extra=extra)
Conclusion
So, going forward, I will start taking advantage of Python’s rotating log file capabilities and the configuration file option. Check out my github project for the full code examples. That’s not quite the end of the story, though. Recently, I was listening to a Python Bytes podcast where the hosts were discussing a Python package called loguru. The hosts seemed pretty excited about the different features of the API. The Python community has authored other logging packages, as well. Clearly, people have found issue with the core API such that they’ve spent time crafting alternatives. Some day, I should explore some of these alternatives and if they’re worth making a change.
I’m continually trying to strengthen my data science skills, in particular making heavy use of the excellent DataCamp.com. Obviously, data science is steeped in math and any discussion of a machine learning algorithm will inevitably touch on the underlying mathematical concepts.
As I progress in my learning, I’ve been taking notes in Jupyter Notebook because…why not? With its markdown and code capabilities, Jupyter Notebook is a fantastic medium for taking notes on machine learning topics, as those topics are full of both prose explaining the concepts and code executing the algorithms.
In my note taking, when my training displays a formula, I’ve been trying to re-write the formula in LaTeX in my notebook. For the most part, I’ve been successful reproducing those formulas thanks in large part to sites like Overleaf.com. However, my LaTeX still didn’t quite represent the formulas I would see in the training slides. Here’s how I represented the Ridge Regression calculation:
My LaTeX notation looked like this:
$\alpha * \sum_{i=1}^{n}a^{2}_i$
The lower and upper bounds of sigma aren’t…quite…right.
As I was looking through some of Overleaf.com’s content, though, I ran across this statement:
Note, that integral expression may seems a little different in inline and display math mode – in inline mode the integral symbol and the limits are compressed.
I also noted that some of their LaTeX expressions used double dollar signs. So, I changed my Ridge Regression expression to this:
$$\alpha * \sum_{i=1}^{n}a^{2}_i$$
And my expression rendered as such:
This fixed my lower and upper bounds problem, but it shifted the expression to the center of the cell. If you look at the rendered HTML, you’ll see that Jupyter Notebook adds a “text-align: center” style to the expression by default. Changing that style to “left” makes the formula a little more readable:
But not a whole lot. At any rate, it’s interesting to note these two different behaviors of LaTeX and and the pros and cons of each option. If you’re so inclined, you can find my LaTeX examples here.
Recent Comments