Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: python (Page 19 of 26)

Two techniques to replace text with Python

Python makes it pretty darn easy to replace text in a string:

s = 'The quick brown studebaker jumps over the lazy dog'
print(s.replace('studebaker', 'fox'))

However, when I’m cleaning larger data sets, I find myself perform multiple replacement operations. You can chain replace operations together:

s = 'The quick brown studebaker jumps over the lazy chupacabra'
print(s.replace('studebaker', 'fox').replace('chupacabra', 'dog'))

…but if you have a lot of text to replace, chaining replace operations together can go sideways pretty fast. Here are two techniques I’ve found to replace a large number of text in a cleaner way.

Using the Zip Function

The zip function is a neat technique to join tuples together for easy iteration. In the case of replacing text, we have a tuple of the text needing to be replaced and a tuple of the text that will be the substitutes:

s = 'The quick blue jackalope jumps under the lazy chupacabra'

old_words = ('blue', 'jackalope', 'under', 'chupacabra')
new_words = ('brown', 'fox', 'over', 'dog')

for check, rep in zip(old_words, new_words):
    s = s.replace(check, rep)
    
print(s)

Using Replace in a Pandas Dataframe

Often, I’ll have text in a pandas dataframe that I need to replace. For such circumstances, pandas provides a variety of solutions. I’ve found using a dictionary can be a clean way to solve this problem:

s = 'The quick blue jackalope jumps under the lazy chupacabra'
df = pd.DataFrame(s.split(' '), columns=['word'])
print(df)  # print the error-laden dataframe

replacements = {'blue': 'brown', 'jackalope': 'fox', 'under': 'over', 'chupacabra': 'dog'}
df['word'] = df.word.replace(replacements)
print(df)  # now print the cleaned up dataframe

Happy replacing!

Two convenient techniques to collect financial data for analysis

As I stare college bills in the face and know that retirement awaits in the not-too-distant future, I’m working hard to improve my financial literacy. One way I’m trying to do this and work on my programming and data analysis techniques at the same time is to download financial data directly and do some direct analysis with tools like pandas. Right from the start, I’ve found two convenient ways to download the financial data you wish to examine.

Option 1: quandl

Quandl is a great source for datasets and they make accessing their data even easier with their API. One big drawback I’ve encountered with the API is that I have yet to get it to work behind my company’s firewall. The only other point to note is that if you intend on making over 50 calls in one day, you’ll need to get a free API key.

import quandl

df_amzn1 = quandl.get("WIKI/AMZN", start_date="2018-01-01", end_date="2019-01-01")
df_amzn1.head()
The quandl result set

Option 2: pandas-datareader

Pandas-datareader wraps a lot of interesting APIs and hands the results back to you in the form of a pandas dataframe. In my example, I’m using pandas-datareader to call the Yahoo finance API to get Amazon stock price information. Apparently, the Yahoo API has changed too much/too frequently to the point where the pandas-datareader folks have said “enough, already” and deprecated their support of the API. Not content to let go just yet, others have offered up the aptly named fix-yahoo-finance package that can be used to plug the Yahoo hole in pandas-datareader. One other note: unlike quandl, I have successfully used pandas-datareader behind my company’s firewall. If you find yourself with SSL and timeout exceptions at work, you may want to give pandas-datareader a try.

from pandas_datareader import data as pdr
import fix_yahoo_finance as yf

yf.pdr_override()
df_amzn2 = pdr.get_data_yahoo("AMZN", start="2018-01-01", end="2019-01-01")
df_amzn2.head()
The pandas-datareader result set

Reading HTML into Dataframes, Part 2

In a previous post, I provided a simple example of using pandas to read tables from a static HTML file you have on disk. This is certainly valid for some use cases. However, if you’re like me, you’ll have other use cases where you’ll want to read tables live from the Internet. Here are some steps for doing that.

Step 1: Select an appropriate “web scraping” package

My go-to Python package for reading files from the Internet is requests. Indeed, I started this example with requests, but quickly found it wouldn’t work with the particular page I wanted to read. Some pages on the internet already contain their data pre-loaded in the HTML. Requests will work great for such pages. Increasingly, though, web developers are using Javascript to load data on their pages. Unfortunately, requests isn’t savvy enough to pick up data loaded with Javascript. So, I had to turn to a slightly more sophisticated approach. Selenium proved to be the solution I needed.

To get Selenium to work for me, I had to perform two operations:

  1. pip/conda install the selenium package
  2. download Mozilla’s gecko driver to my hard drive

Step 2: Import the packages I need

Obviously, you’ll need to import the selenium package, but I also import an Options library and Python’s time package for reasons I’ll explain later:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

Step 3: Set up some Options

This is…optional (pun completely intended)…but something I like to do for more aesthetic reasons. By default, when you run selenium, a new instance of your browser will launch and run all the commands you programmatically issue to it. This can be very helpful debugging your code, but can also get annoying after a while, so I suppress the launch of the browser window with the Options library:

options = Options()
options.headless = True  # stop the browser from popping up

Step 4: Retrieve your page

Next, instantiate a selenium driver and retrieve the page with the data you want to process. Note that I pass the file path of the gecko driver I downloaded to selenium’s driver:

driver = webdriver.Firefox(options=options, executable_path="C:\geckodriver-v0.24.0-win64\geckodriver.exe")
driver.get("https://www.federalreserve.gov/monetarypolicy/bst_recenttrends_accessible.htm")

Step 5: Take a nap

The website you’re scraping might take a few seconds to load the data you want, so you might need to slow down your code a little while the page loads. Selenium includes a variety of techniques to wait for the page to load. For me, I’ll just go the easy route and make my program sleep for five seconds:

time.sleep(5)  # wait 5 seconds for the page to load the data

Step 6: Pipe your table data into a dataframe

Now we get to the good part: having pandas create a dataframe from the data on the web page. As I explained in Part 1, the data you want must be loaded in a table node on the page you’re scraping. Sometimes pages load data in div tags and the like and use CSS to make it look like the data are in a table, so make sure you view the source of the web page and verify that the data is contained in a table node.

Initially in my example, I tried to pass the entire HTML to the read_html function, but the function was unable to find the tables. I suspect the tables may be too deeply nested in the HTML for pandas to find, but I don’t know for sure. So, I used other features of selenium to find the table elements I wanted and passed that HTML into the read_html function. There are several tables on this page that I’ll probably want to process, so I’ll probably have to write a loop to grab them all. This code only shows me grabbing the first table:

df_total_assets = pd.read_html(driver.find_element_by_tag_name("table").get_attribute('outerHTML'))[0]

Step 7: Keep things neat and tidy

A good coder cleans up his resources when he’s done, so make sure you close your selenium driver once you’ve populated your dataframe:

driver.quit()

Again, the data you’ve scraped into the dataframe may not be in quite the shape you want it to be, but that’s easily remedied with clever pandas coding. The point is that you’ve saved much time piping this data from its web page directly into your dataframe. To see my full example, check out my code here.

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑