Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: tools (Page 25 of 35)

Reading HTML into Dataframes, Part 2

In a previous post, I provided a simple example of using pandas to read tables from a static HTML file you have on disk. This is certainly valid for some use cases. However, if you’re like me, you’ll have other use cases where you’ll want to read tables live from the Internet. Here are some steps for doing that.

Step 1: Select an appropriate “web scraping” package

My go-to Python package for reading files from the Internet is requests. Indeed, I started this example with requests, but quickly found it wouldn’t work with the particular page I wanted to read. Some pages on the internet already contain their data pre-loaded in the HTML. Requests will work great for such pages. Increasingly, though, web developers are using Javascript to load data on their pages. Unfortunately, requests isn’t savvy enough to pick up data loaded with Javascript. So, I had to turn to a slightly more sophisticated approach. Selenium proved to be the solution I needed.

To get Selenium to work for me, I had to perform two operations:

  1. pip/conda install the selenium package
  2. download Mozilla’s gecko driver to my hard drive

Step 2: Import the packages I need

Obviously, you’ll need to import the selenium package, but I also import an Options library and Python’s time package for reasons I’ll explain later:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

Step 3: Set up some Options

This is…optional (pun completely intended)…but something I like to do for more aesthetic reasons. By default, when you run selenium, a new instance of your browser will launch and run all the commands you programmatically issue to it. This can be very helpful debugging your code, but can also get annoying after a while, so I suppress the launch of the browser window with the Options library:

options = Options()
options.headless = True  # stop the browser from popping up

Step 4: Retrieve your page

Next, instantiate a selenium driver and retrieve the page with the data you want to process. Note that I pass the file path of the gecko driver I downloaded to selenium’s driver:

driver = webdriver.Firefox(options=options, executable_path="C:\geckodriver-v0.24.0-win64\geckodriver.exe")
driver.get("https://www.federalreserve.gov/monetarypolicy/bst_recenttrends_accessible.htm")

Step 5: Take a nap

The website you’re scraping might take a few seconds to load the data you want, so you might need to slow down your code a little while the page loads. Selenium includes a variety of techniques to wait for the page to load. For me, I’ll just go the easy route and make my program sleep for five seconds:

time.sleep(5)  # wait 5 seconds for the page to load the data

Step 6: Pipe your table data into a dataframe

Now we get to the good part: having pandas create a dataframe from the data on the web page. As I explained in Part 1, the data you want must be loaded in a table node on the page you’re scraping. Sometimes pages load data in div tags and the like and use CSS to make it look like the data are in a table, so make sure you view the source of the web page and verify that the data is contained in a table node.

Initially in my example, I tried to pass the entire HTML to the read_html function, but the function was unable to find the tables. I suspect the tables may be too deeply nested in the HTML for pandas to find, but I don’t know for sure. So, I used other features of selenium to find the table elements I wanted and passed that HTML into the read_html function. There are several tables on this page that I’ll probably want to process, so I’ll probably have to write a loop to grab them all. This code only shows me grabbing the first table:

df_total_assets = pd.read_html(driver.find_element_by_tag_name("table").get_attribute('outerHTML'))[0]

Step 7: Keep things neat and tidy

A good coder cleans up his resources when he’s done, so make sure you close your selenium driver once you’ve populated your dataframe:

driver.quit()

Again, the data you’ve scraped into the dataframe may not be in quite the shape you want it to be, but that’s easily remedied with clever pandas coding. The point is that you’ve saved much time piping this data from its web page directly into your dataframe. To see my full example, check out my code here.

Reading HTML into Dataframes, Part 1

Recently, I asked a co-worker for a list of data on which I needed to work. Instead of sending me his spreadsheet as an email attachment, he pasted his spreadsheet directly into the body of an email. How in the world am I supposed to work with that? Pandas can help!

I saved his email out to disk as an HTML file. Outlook converted his pasted spreadsheet into a HTML table. Then, I just used Pandas’ read_html function to read the HTML file. It automatically found the table and converted it into a dataframe for me. Problem solved!

Step 1: Save your file as an HTML file

If the data you want to process is in a table in the body of an email, about your only option is to save that email to disk as an HTML file. Save the email, then I’d recommending opening the file in a text editor like Notepad++ and making sure the data you want to process was saved within a table element. In my example here, I simply grabbed three tables of data from the Internet and pasted them all into a single HTML file.

Step 2: Import pandas

import pandas as pd

Step 3: Read in your HTML file

Note that the read_html function returns a list of dataframes:

list_of_dfs = pd.read_html('multiple_tables.html')

Now, with your list of dataframes, you can iterate over it, find the dataframe of the data you want to work with, and have at it.

for df in list_of_dfs:
    print(df.head())

Your data might not be in quite the shape you want, but pandas has lots of ways to shape a dataframe to your particular specifications. The important point is that pandas was able to read in your data in seconds versus the time it would have taken to transform the data into a CSV or some other arrangement for parsing.

Some of my favorite Linux commands

At work, I administrate several Linux systems. That, plus the bundling of the Windows Subsystem for Linux has me in the Linux environment quite a lot. Here is a list of several of my go-to Linux commands:

ssh

I use ssh to remotely connect to endpoints that I administrate. A command like:

ssh -i ~/my_key.pem someuser@someendpoint

let’s me use a private key to quickly gain access to a server I need to work on.

scp

Once I connect to a remote system, I often have to upload or download files to the system. The “secure copy” (scp) utility does the trick! To, say, download a log file from a remote system to my workstation for analysis, I can run a command like this:

scp -i ~/my_key.pem someuser@someendpoint:my_app.log ./my_app.log

ps

I often administrate long-running processes, so getting a report of the running processes with the ps command comes in quite handy. Piping that report to grep is even better. Here’s what I do to check for running Python processes:

ps -ef|grep python

df/du

Even in this age of disk space abundance, I still have to pay close attention to disk space on the systems I manage (even my own workstations). Command like df and du help in this regard. With df, I’ll run a simple command like this to get a quick snapshot of the space available to the main drives mounted to my system:

df -h

Occasionally, I’ll have one or a few directories or files larger than others that I should focus on for freeing up disk space. The du command helps me drill down to the problem areas with a command like this:

du -h –max-depth=1

find

Find is a great command for helping me find directories or files that meet certain criteria. Some of the systems I manage write hundreds or thousands of data files to a single directory. I usually archive such data files by month in case I ever need to refer back to them. I’ll use find to find all the files produced in a particular month and pipe those files to a command like tar to archive them. For example, suppose I need to archive all the CSV and TSV files produced in January 2019. I’ll run a command like so:

find .(-name “*.csv” -o -name “*.tsv”) -type f -newermt “2019-01-01” ! -newermt “2019-02-01” -print0|tar -czvf Jan2019.tar.gz –null -T –

For a nice explanation of some of those arguments, check out this article.

Of course, now that I’ve archived those files, I don’t want to leave the originals laying around taking up disk space, so I need to remove them. I will now reuse my find command, this time piping it to xargs and the remove (rm) command:

find .(-name “*.csv” -o -name “*.tsv”) -type f -newermt “2019-01-01” ! -newermt “2019-02-01” -print0|xargs rm

cp/mv/rm/touch

Working with files means creating, copying, moving, renaming, and deleting them…among other tasks. Linux has commands for all these tasks and more with tools like copy (cp), move (mv), remove (rm), and touch–a nice command to quickly create a new, blank file.

cat/head/tail/less/more

There are plenty of tools in Linux for viewing files. For example, cat writes the entire contents of a file to the screen. That’s fine if the file is small, but if it’s large, you might spend the next minute or two watching data fill up and scroll across your screen. If I’m looking for a particular word or phrase, I might pipe cat to grep like so:

cat my_app.log|grep error

I tend to use tail a lot, too, for looking at the last few lines of a log file. With this command, I can look at the last 20 lines of my log file:

tail -20 my_app.log

The great thing now with WSL is that you can use all these powerful Linux commands against your own Windows file system, although Microsoft does caution about getting too crazy with that.

For more interesting actions you can do with Linux commands, check out this article.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑