Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: tools (Page 24 of 35)

Two techniques to replace text with Python

Python makes it pretty darn easy to replace text in a string:

s = 'The quick brown studebaker jumps over the lazy dog'
print(s.replace('studebaker', 'fox'))

However, when I’m cleaning larger data sets, I find myself perform multiple replacement operations. You can chain replace operations together:

s = 'The quick brown studebaker jumps over the lazy chupacabra'
print(s.replace('studebaker', 'fox').replace('chupacabra', 'dog'))

…but if you have a lot of text to replace, chaining replace operations together can go sideways pretty fast. Here are two techniques I’ve found to replace a large number of text in a cleaner way.

Using the Zip Function

The zip function is a neat technique to join tuples together for easy iteration. In the case of replacing text, we have a tuple of the text needing to be replaced and a tuple of the text that will be the substitutes:

s = 'The quick blue jackalope jumps under the lazy chupacabra'

old_words = ('blue', 'jackalope', 'under', 'chupacabra')
new_words = ('brown', 'fox', 'over', 'dog')

for check, rep in zip(old_words, new_words):
    s = s.replace(check, rep)
    
print(s)

Using Replace in a Pandas Dataframe

Often, I’ll have text in a pandas dataframe that I need to replace. For such circumstances, pandas provides a variety of solutions. I’ve found using a dictionary can be a clean way to solve this problem:

s = 'The quick blue jackalope jumps under the lazy chupacabra'
df = pd.DataFrame(s.split(' '), columns=['word'])
print(df)  # print the error-laden dataframe

replacements = {'blue': 'brown', 'jackalope': 'fox', 'under': 'over', 'chupacabra': 'dog'}
df['word'] = df.word.replace(replacements)
print(df)  # now print the cleaned up dataframe

Happy replacing!

Two convenient techniques to collect financial data for analysis

As I stare college bills in the face and know that retirement awaits in the not-too-distant future, I’m working hard to improve my financial literacy. One way I’m trying to do this and work on my programming and data analysis techniques at the same time is to download financial data directly and do some direct analysis with tools like pandas. Right from the start, I’ve found two convenient ways to download the financial data you wish to examine.

Option 1: quandl

Quandl is a great source for datasets and they make accessing their data even easier with their API. One big drawback I’ve encountered with the API is that I have yet to get it to work behind my company’s firewall. The only other point to note is that if you intend on making over 50 calls in one day, you’ll need to get a free API key.

import quandl

df_amzn1 = quandl.get("WIKI/AMZN", start_date="2018-01-01", end_date="2019-01-01")
df_amzn1.head()
The quandl result set

Option 2: pandas-datareader

Pandas-datareader wraps a lot of interesting APIs and hands the results back to you in the form of a pandas dataframe. In my example, I’m using pandas-datareader to call the Yahoo finance API to get Amazon stock price information. Apparently, the Yahoo API has changed too much/too frequently to the point where the pandas-datareader folks have said “enough, already” and deprecated their support of the API. Not content to let go just yet, others have offered up the aptly named fix-yahoo-finance package that can be used to plug the Yahoo hole in pandas-datareader. One other note: unlike quandl, I have successfully used pandas-datareader behind my company’s firewall. If you find yourself with SSL and timeout exceptions at work, you may want to give pandas-datareader a try.

from pandas_datareader import data as pdr
import fix_yahoo_finance as yf

yf.pdr_override()
df_amzn2 = pdr.get_data_yahoo("AMZN", start="2018-01-01", end="2019-01-01")
df_amzn2.head()
The pandas-datareader result set

Scanning slides, Part 2

Believe it or not, you used to be able to walk right up to the White House

A while back, I wrote about some PowerShell tricks I use as I scan the thousands of slides my dad has amassed over the last five decades. I did leave one small trick out, though, that I wish to share now.

As I scan old photos, I do my best to document every detail I can about the picture: when it was taken, where, who are the people in the photo, etc. One of these days, I’ll figure out a more robust way to store these details, but for now, I write them to text files that I keep in the very folders of the images they describe.

When I first began genealogy back in the 1990s, a lot of the software I worked with professionally used INI configuration files in which a section would begin left-aligned in the file and all other lines for the section would be tab-indented underneath. This is the format I adopted back then and, for consistency sake, have continued with ever since:

Example of my current image documentation file

So, you might be asking yourself, “self, what does any of this have to do with PowerShell?” Well, as I scan a set of slides, I’ll house them in their own folder. Then, I’ll run PowerShell like the following to quickly generate a readme/documentation text file to describe the images:

# generate a readme file for the directory
$dir = "C:\my_path\slides\grp_007"
$desc_line1 = "Slide appears to be dated "
$desc_line2 = "Slide is number #.  Photographer was likely John Jones. Slide is labeled 'Kodachrome II Transparency'. Slide was part of a metal container labeled magazine number '2'. Handwritten label on slide case reads, "
gci $dir | where {$_.Extension.ToLower() -eq ".jpg"} | foreach{"{0}`r`n`t{1}`r`n`t{2}`r`n" -f $_.Name, $desc_line1, $desc_line2} | Out-File ("{0}\readme_grp007.txt" -f $dir)

This code will quickly scaffold out my documentation file and save me a lot of typing. Typically, each slide has some sort of handwritten label that I’ll also want to capture in the readme file, so I’ll still have to go through each slide and type out the label corresponding to the image, but most of the slides share many of the same properties and being able to capture all those common properties at once is a great time saver.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑