Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: python (Page 20 of 26)

Reading HTML into Dataframes, Part 1

Recently, I asked a co-worker for a list of data on which I needed to work. Instead of sending me his spreadsheet as an email attachment, he pasted his spreadsheet directly into the body of an email. How in the world am I supposed to work with that? Pandas can help!

I saved his email out to disk as an HTML file. Outlook converted his pasted spreadsheet into a HTML table. Then, I just used Pandas’ read_html function to read the HTML file. It automatically found the table and converted it into a dataframe for me. Problem solved!

Step 1: Save your file as an HTML file

If the data you want to process is in a table in the body of an email, about your only option is to save that email to disk as an HTML file. Save the email, then I’d recommending opening the file in a text editor like Notepad++ and making sure the data you want to process was saved within a table element. In my example here, I simply grabbed three tables of data from the Internet and pasted them all into a single HTML file.

Step 2: Import pandas

import pandas as pd

Step 3: Read in your HTML file

Note that the read_html function returns a list of dataframes:

list_of_dfs = pd.read_html('multiple_tables.html')

Now, with your list of dataframes, you can iterate over it, find the dataframe of the data you want to work with, and have at it.

for df in list_of_dfs:
    print(df.head())

Your data might not be in quite the shape you want, but pandas has lots of ways to shape a dataframe to your particular specifications. The important point is that pandas was able to read in your data in seconds versus the time it would have taken to transform the data into a CSV or some other arrangement for parsing.

Logging in Python

Python includes a standard logging API that provides all the basic functionality you usually need for logging information about your application. For the most part, I’ve implemented the API as follows:

Step 1: Do the standard imports

Not only do I import the logging package, I also import the os package to map the full path to my log file and the uuid package.

import logging
import uuid
import os

Step 2: Set up some global variables

I usually set up three global variables that I use for logging: current_dir, log_id, and extra. To provide the logging API a full path to my log file, I create a current_dir string that represents the full path to the current directory where my program is running.

Often, after my program has been running for a few weeks, I like to download the log file and gather different metrics on the program. One metric I’m always interested in is how long it takes for my program to run to perform its task (for programs that perform ETL tasks and the like) and is the script speeding up, slowing down, or running about the same over time. The way I do this is by generating a semi-unique value for every time the program runs. I include this unique value–I call it log_id–in every log entry. When I do my analysis, I can group by this log id, easily get the start and end times of the script, calculate the total run time per run, and determine how my script has been doing over time. The easy way to include that log_id in my log entries is to add my own user-defined LogRecord attribute. I do this by creating a dictionary called extra with my log_id key/value pair.

current_dir = os.path.dirname(os.path.realpath(__file__))
log_id = str(uuid.uuid4())[-12:]
extra = {'log_id': log_id}
My log_id helps separate one run of my program from another

Step 3: Create the configuration

Next, I create my configuration by setting the filename to the full path of my log file, my preferred date/time format, the format of the log file itself, and the minimum logging level to log. Traditionally, I’ve always just set up my configuration in the code.

logging.basicConfig(filename=current_dir + '/logger_example.log', datefmt='%Y-%m-%d %H:%M:%S', format='%(asctime)s|%(log_id)s|%(levelname)s|%(message)s', level=logging.INFO)

Step 4: Write your log events

Finally, I can start writing my log events. Since I’m including a user-defined LogRecord attribute, I have to always make sure to include the “extra” argument and pass my extra dictionary to it.

    logging.debug('this is a degug statement', extra=extra)
    logging.info('Did something', extra=extra)

A better way to do this

So, that approach to logging is fine, but I’d like to improve upon it in at least two ways:

  1. I’d like to move my configuration out to a separate properties file so that I can more easily change aspects of the logging configuration, especially the logging level and
  2. I’d like to implement rotating logs so that I can more easily manage the growth of my log files.

I’ve been able to achieve both goals by improving my code as follows:

Improvement Step 1: import logging.config

The first step in moving my logging configuration outside of my code and into a properties file is by importing logging.config:

import logging.config

Improvement Step 2: reference my configuration file

Next, I have to point the logging API to my logging configuration file. In this example, my configuration file is in the same directory as my Python script, so I don’t need to provide a full path to the file:

logging.config.fileConfig('logger_cfg_example.cfg')

Improvement Step 3: Setup my configuration file

Now that I’ve referenced my configuration file, I actually need to set it up.

[loggers]
keys=root

[handlers]
keys=fileHandler

[formatters]
keys=pipeFormatter

[logger_root]
level=INFO
handlers=fileHandler

[handler_fileHandler]
class=handlers.RotatingFileHandler
level=NOTSET
formatter=pipeFormatter
args=('logger_cfg_example.log', 'w', 2000, 3)

[formatter_pipeFormatter]
format=%(asctime)s|%(log_id)s|%(levelname)s|%(message)s
datefmt=
class=logging.Formatter

Check out the documentation to learn more about the configuration file format, what sections are required, and so forth. I’ve highlighted five lines of this configuration file to point out five interesting features:

  • In this example, I’ve set the logging level to INFO but I can now easily change that to any level I wish by editing this file.
  • I’m now able to achieve rotating logs by instructing Python to use the handlers.RotatingHandler class.
  • The RotatingHandler class takes four arguments and I can easily pass those to the class in my configuration file with the “args” key. Here, I’m telling the class to write logs to the file logger_cfg_example.log, open that file for “writing”, rotate the file every 2000 bytes, and only keep three archived log files. Note that log size argument is in bytes. In practice, you’d probably want to roll your log file after so many megabytes. For my testing purposes, I’m just rolling my log file after 2 kilobytes.
  • I can now move my log file format to my configuration file by setting the “format” key. Note that I can still include my user-defined attribute, log_id.
  • Finally, reading the documentation, I discovered that the default date/time format basically matches the format I use most often, so I’ve opted to leave that formatting blank.

Improvement Step 4: Nothing else, just log

With that configuration file, my code looks a little cleaner. I can go ahead and log like normal:

    logging.debug('this is a degug statement', extra=extra)
    logging.info('Did something', extra=extra)

Conclusion

So, going forward, I will start taking advantage of Python’s rotating log file capabilities and the configuration file option. Check out my github project for the full code examples. That’s not quite the end of the story, though. Recently, I was listening to a Python Bytes podcast where the hosts were discussing a Python package called loguru. The hosts seemed pretty excited about the different features of the API. The Python community has authored other logging packages, as well. Clearly, people have found issue with the core API such that they’ve spent time crafting alternatives. Some day, I should explore some of these alternatives and if they’re worth making a change.

The goodness of Notepad++

Notepad++ was #7 on my list of awesome, free tools and for good reason: it just rocks!

The other day, I found myself working through some online training with Pandas dataframes and friends.  Part of the course included working exercises in the site’s web-based Python console.  When I work such exercises, I like to copy off the code I come up with to local Jupyter Notebooks that I can easily reference in the future.  I also like to copy down whatever test data the exercises have me working with, so I can make sure I calculate the same results locally that I do in the exercises.

Here’s the challenge: how do you copy the dataframes from the online exercises to your local Jupyter Notebook?  Usually, when I want to copy off a dataframe, I’ll call the to_csv() function to save the contents of the dataframe to a CSV file that I can easily transport.  That’s not really an option with online exercises, though.  Here’s a thought: what about the to_dict() function to write the contents of the dataframe to the standard out of the online console as a dictionary?  Then, I can copy that dictionary over to my Notebook.  Let’s see what that looks like:

The view from the online training console

Let’s pretend for a moment that we’re in the online training console and we’re working some dataset (for simplicity, I’m using the Iris dataset).  The dataframe in the console might look like this:

Now, we can use to_dict() to write the dataframe out to the console (note: for brevity, I’m only writing out the first 5 records):

Copy the dictionary locally

With the dictionary written out to the online console, let’s copy that output to our clipboard, paste it into a local Notebook, and see if we can now load it into a local dataframe:

Well, waddya know?  That worked!  Wait a minute…where’s Notepad++ in all of this?

Yes, this approach works when you’re copying the dictionary directly into a code cell in a local Jupyter Notebook, but what if, instead, you copy that dictionary into a JSON file that you then load into a dataframe?

That approach doesn’t end well.  The reason is that Pandas doesn’t consider that copied JSON as valid JSON.  Specifically, it wants all the key names to be surrounded by quotation marks.

Incidentally, you might be asking, why would you copy the dictionary to a file when I’ve already demonstrated that you can copy it directly into a local Notebook?  The big reason is size: copying a dictionary of five rows is fine, but what if the dataframe you’re working with has 200 rows?  That becomes a very long dictionary that really muddies up your local Notebook.  To keep things clean, I find it best to copy such dictionaries to a JSON file.

So, how do you format this copied dictionary so Pandas can load it successfully?  Notepad++!  Notepad++ has a great find/replace feature that let’s you use regular expressions.  What I need to do is find all numbers that serve as key names in my dictionary and make sure these keys are surrounded by quotation marks.

For my “find” regular expression, I’ll use this: (\d+):

With this expression, I look for digits that are followed by a colon.  I’ll group those digits so that I can reference them in the “replace”.

For my “replace” expression, I’ll use this: \”\1\”:

The \1 refers to my group of digits.  I’m surrounding that group with quotation marks and then making sure the colon follows.  That yields the following:

And when we load that local JSON file into our local dataframe:

…we get success!  So, just one clever way Notepad++ has really helped me out.

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑