Category: technology (Page 25 of 36)

Reading HTML into Dataframes, Part 1

February 10, 2019 / Brad

Recently, I asked a co-worker for a list of data on which I needed to work. Instead of sending me his spreadsheet as an email attachment, he pasted his spreadsheet directly into the body of an email. How in the world am I supposed to work with that? Pandas can help!

I saved his email out to disk as an HTML file. Outlook converted his pasted spreadsheet into a HTML table. Then, I just used Pandas’ read_html function to read the HTML file. It automatically found the table and converted it into a dataframe for me. Problem solved!

Step 1: Save your file as an HTML file

If the data you want to process is in a table in the body of an email, about your only option is to save that email to disk as an HTML file. Save the email, then I’d recommending opening the file in a text editor like Notepad++ and making sure the data you want to process was saved within a table element. In my example here, I simply grabbed three tables of data from the Internet and pasted them all into a single HTML file.

Step 2: Import pandas

import pandas as pd

Step 3: Read in your HTML file

Note that the read_html function returns a list of dataframes:

list_of_dfs = pd.read_html('multiple_tables.html')

Now, with your list of dataframes, you can iterate over it, find the dataframe of the data you want to work with, and have at it.

for df in list_of_dfs:
    print(df.head())

Your data might not be in quite the shape you want, but pandas has lots of ways to shape a dataframe to your particular specifications. The important point is that pandas was able to read in your data in seconds versus the time it would have taken to transform the data into a CSV or some other arrangement for parsing.

Logging in Python

February 3, 2019 / Brad

Python includes a standard logging API that provides all the basic functionality you usually need for logging information about your application. For the most part, I’ve implemented the API as follows:

Step 1: Do the standard imports

Not only do I import the logging package, I also import the os package to map the full path to my log file and the uuid package.

import logging
import uuid
import os

Step 2: Set up some global variables

I usually set up three global variables that I use for logging: current_dir, log_id, and extra. To provide the logging API a full path to my log file, I create a current_dir string that represents the full path to the current directory where my program is running.

Often, after my program has been running for a few weeks, I like to download the log file and gather different metrics on the program. One metric I’m always interested in is how long it takes for my program to run to perform its task (for programs that perform ETL tasks and the like) and is the script speeding up, slowing down, or running about the same over time. The way I do this is by generating a semi-unique value for every time the program runs. I include this unique value–I call it log_id–in every log entry. When I do my analysis, I can group by this log id, easily get the start and end times of the script, calculate the total run time per run, and determine how my script has been doing over time. The easy way to include that log_id in my log entries is to add my own user-defined LogRecord attribute. I do this by creating a dictionary called extra with my log_id key/value pair.

current_dir = os.path.dirname(os.path.realpath(__file__))
log_id = str(uuid.uuid4())[-12:]
extra = {'log_id': log_id}

My log_id helps separate one run of my program from another

Step 3: Create the configuration

Next, I create my configuration by setting the filename to the full path of my log file, my preferred date/time format, the format of the log file itself, and the minimum logging level to log. Traditionally, I’ve always just set up my configuration in the code.

logging.basicConfig(filename=current_dir + '/logger_example.log', datefmt='%Y-%m-%d %H:%M:%S', format='%(asctime)s|%(log_id)s|%(levelname)s|%(message)s', level=logging.INFO)

Step 4: Write your log events

Finally, I can start writing my log events. Since I’m including a user-defined LogRecord attribute, I have to always make sure to include the “extra” argument and pass my extra dictionary to it.

    logging.debug('this is a degug statement', extra=extra)
    logging.info('Did something', extra=extra)

A better way to do this

So, that approach to logging is fine, but I’d like to improve upon it in at least two ways:

I’d like to move my configuration out to a separate properties file so that I can more easily change aspects of the logging configuration, especially the logging level and
I’d like to implement rotating logs so that I can more easily manage the growth of my log files.

I’ve been able to achieve both goals by improving my code as follows:

Improvement Step 1: import logging.config

The first step in moving my logging configuration outside of my code and into a properties file is by importing logging.config:

import logging.config

Improvement Step 2: reference my configuration file

Next, I have to point the logging API to my logging configuration file. In this example, my configuration file is in the same directory as my Python script, so I don’t need to provide a full path to the file:

logging.config.fileConfig('logger_cfg_example.cfg')

Improvement Step 3: Setup my configuration file

Now that I’ve referenced my configuration file, I actually need to set it up.

[loggers]
keys=root

[handlers]
keys=fileHandler

[formatters]
keys=pipeFormatter

[logger_root]
level=INFO
handlers=fileHandler

[handler_fileHandler]
class=handlers.RotatingFileHandler
level=NOTSET
formatter=pipeFormatter
args=('logger_cfg_example.log', 'w', 2000, 3)

[formatter_pipeFormatter]
format=%(asctime)s|%(log_id)s|%(levelname)s|%(message)s
datefmt=
class=logging.Formatter

Check out the documentation to learn more about the configuration file format, what sections are required, and so forth. I’ve highlighted five lines of this configuration file to point out five interesting features:

In this example, I’ve set the logging level to INFO but I can now easily change that to any level I wish by editing this file.
I’m now able to achieve rotating logs by instructing Python to use the handlers.RotatingHandler class.
The RotatingHandler class takes four arguments and I can easily pass those to the class in my configuration file with the “args” key. Here, I’m telling the class to write logs to the file logger_cfg_example.log, open that file for “writing”, rotate the file every 2000 bytes, and only keep three archived log files. Note that log size argument is in bytes. In practice, you’d probably want to roll your log file after so many megabytes. For my testing purposes, I’m just rolling my log file after 2 kilobytes.
I can now move my log file format to my configuration file by setting the “format” key. Note that I can still include my user-defined attribute, log_id.
Finally, reading the documentation, I discovered that the default date/time format basically matches the format I use most often, so I’ve opted to leave that formatting blank.

Improvement Step 4: Nothing else, just log

With that configuration file, my code looks a little cleaner. I can go ahead and log like normal:

    logging.debug('this is a degug statement', extra=extra)
    logging.info('Did something', extra=extra)

Conclusion

So, going forward, I will start taking advantage of Python’s rotating log file capabilities and the configuration file option. Check out my github project for the full code examples. That’s not quite the end of the story, though. Recently, I was listening to a Python Bytes podcast where the hosts were discussing a Python package called loguru. The hosts seemed pretty excited about the different features of the API. The Python community has authored other logging packages, as well. Clearly, people have found issue with the core API such that they’ve spent time crafting alternatives. Some day, I should explore some of these alternatives and if they’re worth making a change.

LaTeX: inline versus display

January 20, 2019 / Brad

I’m continually trying to strengthen my data science skills, in particular making heavy use of the excellent DataCamp.com. Obviously, data science is steeped in math and any discussion of a machine learning algorithm will inevitably touch on the underlying mathematical concepts.

As I progress in my learning, I’ve been taking notes in Jupyter Notebook because…why not? With its markdown and code capabilities, Jupyter Notebook is a fantastic medium for taking notes on machine learning topics, as those topics are full of both prose explaining the concepts and code executing the algorithms.

In my note taking, when my training displays a formula, I’ve been trying to re-write the formula in LaTeX in my notebook. For the most part, I’ve been successful reproducing those formulas thanks in large part to sites like Overleaf.com. However, my LaTeX still didn’t quite represent the formulas I would see in the training slides. Here’s how I represented the Ridge Regression calculation:

My LaTeX notation looked like this:

$\alpha * \sum_{i=1}^{n}a^{2}_i$

The lower and upper bounds of sigma aren’t…quite…right.

As I was looking through some of Overleaf.com’s content, though, I ran across this statement:

Note, that integral expression may seems a little different in inline and display math mode – in inline mode the integral symbol and the limits are compressed.

I also noted that some of their LaTeX expressions used double dollar signs. So, I changed my Ridge Regression expression to this:

$$\alpha * \sum_{i=1}^{n}a^{2}_i$$

And my expression rendered as such:

This fixed my lower and upper bounds problem, but it shifted the expression to the center of the cell. If you look at the rendered HTML, you’ll see that Jupyter Notebook adds a “text-align: center” style to the expression by default. Changing that style to “left” makes the formula a little more readable:

But not a whole lot. At any rate, it’s interesting to note these two different behaviors of LaTeX and and the pros and cons of each option. If you’re so inclined, you can find my LaTeX examples here.

Category: technology (Page 25 of 36)

Reading HTML into Dataframes, Part 1

Step 1: Save your file as an HTML file

Step 2: Import pandas

Step 3: Read in your HTML file

Logging in Python

Step 1: Do the standard imports

Step 2: Set up some global variables

Step 3: Create the configuration

Step 4: Write your log events

A better way to do this

Improvement Step 1: import logging.config

Improvement Step 2: reference my configuration file

Improvement Step 3: Setup my configuration file

Improvement Step 4: Nothing else, just log

Conclusion

LaTeX: inline versus display

Recent Posts

Recent Comments

Archives

Meta

Reading HTML into Dataframes, Part 1

Step 1: Save your file as an HTML file

Step 2: Import pandas

Step 3: Read in your HTML file

Logging in Python

Step 1: Do the standard imports

Step 2: Set up some global variables

Step 3: Create the configuration

Step 4: Write your log events

A better way to do this

Improvement Step 1: import logging.config

Improvement Step 2: reference my configuration file

Improvement Step 3: Setup my configuration file

Improvement Step 4: Nothing else, just log

Conclusion

LaTeX: inline versus display

Recent Posts

Recent Comments

Archives

Tags

Meta