DadOverflow.com

Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: python (page 1 of 7)

Reading HTML into Dataframes, Part 2

In a previous post, I provided a simple example of using pandas to read tables from a static HTML file you have on disk. This is certainly valid for some use cases. However, if you’re like me, you’ll have other use cases where you’ll want to read tables live from the Internet. Here are some steps for doing that.

Step 1: Select an appropriate “web scraping” package

My go-to Python package for reading files from the Internet is requests. Indeed, I started this example with requests, but quickly found it wouldn’t work with the particular page I wanted to read. Some pages on the internet already contain their data pre-loaded in the HTML. Requests will work great for such pages. Increasingly, though, web developers are using Javascript to load data on their pages. Unfortunately, requests isn’t savvy enough to pick up data loaded with Javascript. So, I had to turn to a slightly more sophisticated approach. Selenium proved to be the solution I needed.

To get Selenium to work for me, I had to perform two operations:

  1. pip/conda install the selenium package
  2. download Mozilla’s gecko driver to my hard drive

Step 2: Import the packages I need

Obviously, you’ll need to import the selenium package, but I also import an Options library and Python’s time package for reasons I’ll explain later:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

Step 3: Set up some Options

This is…optional (pun completely intended)…but something I like to do for more aesthetic reasons. By default, when you run selenium, a new instance of your browser will launch and run all the commands you programmatically issue to it. This can be very helpful debugging your code, but can also get annoying after a while, so I suppress the launch of the browser window with the Options library:

options = Options()
options.headless = True  # stop the browser from popping up

Step 4: Retrieve your page

Next, instantiate a selenium driver and retrieve the page with the data you want to process. Note that I pass the file path of the gecko driver I downloaded to selenium’s driver:

driver = webdriver.Firefox(options=options, executable_path="C:\geckodriver-v0.24.0-win64\geckodriver.exe")
driver.get("https://www.federalreserve.gov/monetarypolicy/bst_recenttrends_accessible.htm")

Step 5: Take a nap

The website you’re scraping might take a few seconds to load the data you want, so you might need to slow down your code a little while the page loads. Selenium includes a variety of techniques to wait for the page to load. For me, I’ll just go the easy route and make my program sleep for five seconds:

time.sleep(5)  # wait 5 seconds for the page to load the data

Step 6: Pipe your table data into a dataframe

Now we get to the good part: having pandas create a dataframe from the data on the web page. As I explained in Part 1, the data you want must be loaded in a table node on the page you’re scraping. Sometimes pages load data in div tags and the like and use CSS to make it look like the data are in a table, so make sure you view the source of the web page and verify that the data is contained in a table node.

Initially in my example, I tried to pass the entire HTML to the read_html function, but the function was unable to find the tables. I suspect the tables may be too deeply nested in the HTML for pandas to find, but I don’t know for sure. So, I used other features of selenium to find the table elements I wanted and passed that HTML into the read_html function. There are several tables on this page that I’ll probably want to process, so I’ll probably have to write a loop to grab them all. This code only shows me grabbing the first table:

df_total_assets = pd.read_html(driver.find_element_by_tag_name("table").get_attribute('outerHTML'))[0]

Step 7: Keep things neat and tidy

A good coder cleans up his resources when he’s done, so make sure you close your selenium driver once you’ve populated your dataframe:

driver.quit()

Again, the data you’ve scraped into the dataframe may not be in quite the shape you want it to be, but that’s easily remedied with clever pandas coding. The point is that you’ve saved much time piping this data from its web page directly into your dataframe. To see my full example, check out my code here.

Reading HTML into Dataframes, Part 1

Recently, I asked a co-worker for a list of data on which I needed to work. Instead of sending me his spreadsheet as an email attachment, he pasted his spreadsheet directly into the body of an email. How in the world am I supposed to work with that? Pandas can help!

I saved his email out to disk as an HTML file. Outlook converted his pasted spreadsheet into a HTML table. Then, I just used Pandas’ read_html function to read the HTML file. It automatically found the table and converted it into a dataframe for me. Problem solved!

Step 1: Save your file as an HTML file

If the data you want to process is in a table in the body of an email, about your only option is to save that email to disk as an HTML file. Save the email, then I’d recommending opening the file in a text editor like Notepad++ and making sure the data you want to process was saved within a table element. In my example here, I simply grabbed three tables of data from the Internet and pasted them all into a single HTML file.

Step 2: Import pandas

import pandas as pd

Step 3: Read in your HTML file

Note that the read_html function returns a list of dataframes:

list_of_dfs = pd.read_html('multiple_tables.html')

Now, with your list of dataframes, you can iterate over it, find the dataframe of the data you want to work with, and have at it.

for df in list_of_dfs:
    print(df.head())

Your data might not be in quite the shape you want, but pandas has lots of ways to shape a dataframe to your particular specifications. The important point is that pandas was able to read in your data in seconds versus the time it would have taken to transform the data into a CSV or some other arrangement for parsing.

Logging in Python

Python includes a standard logging API that provides all the basic functionality you usually need for logging information about your application. For the most part, I’ve implemented the API as follows:

Step 1: Do the standard imports

Not only do I import the logging package, I also import the os package to map the full path to my log file and the uuid package.

import logging
import uuid
import os

Step 2: Set up some global variables

I usually set up three global variables that I use for logging: current_dir, log_id, and extra. To provide the logging API a full path to my log file, I create a current_dir string that represents the full path to the current directory where my program is running.

Often, after my program has been running for a few weeks, I like to download the log file and gather different metrics on the program. One metric I’m always interested in is how long it takes for my program to run to perform its task (for programs that perform ETL tasks and the like) and is the script speeding up, slowing down, or running about the same over time. The way I do this is by generating a semi-unique value for every time the program runs. I include this unique value–I call it log_id–in every log entry. When I do my analysis, I can group by this log id, easily get the start and end times of the script, calculate the total run time per run, and determine how my script has been doing over time. The easy way to include that log_id in my log entries is to add my own user-defined LogRecord attribute. I do this by creating a dictionary called extra with my log_id key/value pair.

current_dir = os.path.dirname(os.path.realpath(__file__))
log_id = str(uuid.uuid4())[-12:]
extra = {'log_id': log_id}
My log_id helps separate one run of my program from another

Step 3: Create the configuration

Next, I create my configuration by setting the filename to the full path of my log file, my preferred date/time format, the format of the log file itself, and the minimum logging level to log. Traditionally, I’ve always just set up my configuration in the code.

logging.basicConfig(filename=current_dir + '/logger_example.log', datefmt='%Y-%m-%d %H:%M:%S', format='%(asctime)s|%(log_id)s|%(levelname)s|%(message)s', level=logging.INFO)

Step 4: Write your log events

Finally, I can start writing my log events. Since I’m including a user-defined LogRecord attribute, I have to always make sure to include the “extra” argument and pass my extra dictionary to it.

    logging.debug('this is a degug statement', extra=extra)
    logging.info('Did something', extra=extra)

A better way to do this

So, that approach to logging is fine, but I’d like to improve upon it in at least two ways:

  1. I’d like to move my configuration out to a separate properties file so that I can more easily change aspects of the logging configuration, especially the logging level and
  2. I’d like to implement rotating logs so that I can more easily manage the growth of my log files.

I’ve been able to achieve both goals by improving my code as follows:

Improvement Step 1: import logging.config

The first step in moving my logging configuration outside of my code and into a properties file is by importing logging.config:

import logging.config

Improvement Step 2: reference my configuration file

Next, I have to point the logging API to my logging configuration file. In this example, my configuration file is in the same directory as my Python script, so I don’t need to provide a full path to the file:

logging.config.fileConfig('logger_cfg_example.cfg')

Improvement Step 3: Setup my configuration file

Now that I’ve referenced my configuration file, I actually need to set it up.

[loggers]
keys=root

[handlers]
keys=fileHandler

[formatters]
keys=pipeFormatter

[logger_root]
level=INFO
handlers=fileHandler

[handler_fileHandler]
class=handlers.RotatingFileHandler
level=NOTSET
formatter=pipeFormatter
args=('logger_cfg_example.log', 'w', 2000, 3)

[formatter_pipeFormatter]
format=%(asctime)s|%(log_id)s|%(levelname)s|%(message)s
datefmt=
class=logging.Formatter

Check out the documentation to learn more about the configuration file format, what sections are required, and so forth. I’ve highlighted five lines of this configuration file to point out five interesting features:

  • In this example, I’ve set the logging level to INFO but I can now easily change that to any level I wish by editing this file.
  • I’m now able to achieve rotating logs by instructing Python to use the handlers.RotatingHandler class.
  • The RotatingHandler class takes four arguments and I can easily pass those to the class in my configuration file with the “args” key. Here, I’m telling the class to write logs to the file logger_cfg_example.log, open that file for “writing”, rotate the file every 2000 bytes, and only keep three archived log files. Note that log size argument is in bytes. In practice, you’d probably want to roll your log file after so many megabytes. For my testing purposes, I’m just rolling my log file after 2 kilobytes.
  • I can now move my log file format to my configuration file by setting the “format” key. Note that I can still include my user-defined attribute, log_id.
  • Finally, reading the documentation, I discovered that the default date/time format basically matches the format I use most often, so I’ve opted to leave that formatting blank.

Improvement Step 4: Nothing else, just log

With that configuration file, my code looks a little cleaner. I can go ahead and log like normal:

    logging.debug('this is a degug statement', extra=extra)
    logging.info('Did something', extra=extra)

Conclusion

So, going forward, I will start taking advantage of Python’s rotating log file capabilities and the configuration file option. Check out my github project for the full code examples. That’s not quite the end of the story, though. Recently, I was listening to a Python Bytes podcast where the hosts were discussing a Python package called loguru. The hosts seemed pretty excited about the different features of the API. The Python community has authored other logging packages, as well. Clearly, people have found issue with the core API such that they’ve spent time crafting alternatives. Some day, I should explore some of these alternatives and if they’re worth making a change.

« Older posts

© 2019 DadOverflow.com

Theme by Anders NorenUp ↑