Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Category: technology (Page 27 of 36)

The goodness of Notepad++

Notepad++ was #7 on my list of awesome, free tools and for good reason: it just rocks!

The other day, I found myself working through some online training with Pandas dataframes and friends.  Part of the course included working exercises in the site’s web-based Python console.  When I work such exercises, I like to copy off the code I come up with to local Jupyter Notebooks that I can easily reference in the future.  I also like to copy down whatever test data the exercises have me working with, so I can make sure I calculate the same results locally that I do in the exercises.

Here’s the challenge: how do you copy the dataframes from the online exercises to your local Jupyter Notebook?  Usually, when I want to copy off a dataframe, I’ll call the to_csv() function to save the contents of the dataframe to a CSV file that I can easily transport.  That’s not really an option with online exercises, though.  Here’s a thought: what about the to_dict() function to write the contents of the dataframe to the standard out of the online console as a dictionary?  Then, I can copy that dictionary over to my Notebook.  Let’s see what that looks like:

The view from the online training console

Let’s pretend for a moment that we’re in the online training console and we’re working some dataset (for simplicity, I’m using the Iris dataset).  The dataframe in the console might look like this:

Now, we can use to_dict() to write the dataframe out to the console (note: for brevity, I’m only writing out the first 5 records):

Copy the dictionary locally

With the dictionary written out to the online console, let’s copy that output to our clipboard, paste it into a local Notebook, and see if we can now load it into a local dataframe:

Well, waddya know?  That worked!  Wait a minute…where’s Notepad++ in all of this?

Yes, this approach works when you’re copying the dictionary directly into a code cell in a local Jupyter Notebook, but what if, instead, you copy that dictionary into a JSON file that you then load into a dataframe?

That approach doesn’t end well.  The reason is that Pandas doesn’t consider that copied JSON as valid JSON.  Specifically, it wants all the key names to be surrounded by quotation marks.

Incidentally, you might be asking, why would you copy the dictionary to a file when I’ve already demonstrated that you can copy it directly into a local Notebook?  The big reason is size: copying a dictionary of five rows is fine, but what if the dataframe you’re working with has 200 rows?  That becomes a very long dictionary that really muddies up your local Notebook.  To keep things clean, I find it best to copy such dictionaries to a JSON file.

So, how do you format this copied dictionary so Pandas can load it successfully?  Notepad++!  Notepad++ has a great find/replace feature that let’s you use regular expressions.  What I need to do is find all numbers that serve as key names in my dictionary and make sure these keys are surrounded by quotation marks.

For my “find” regular expression, I’ll use this: (\d+):

With this expression, I look for digits that are followed by a colon.  I’ll group those digits so that I can reference them in the “replace”.

For my “replace” expression, I’ll use this: \”\1\”:

The \1 refers to my group of digits.  I’m surrounding that group with quotation marks and then making sure the colon follows.  That yields the following:

And when we load that local JSON file into our local dataframe:

…we get success!  So, just one clever way Notepad++ has really helped me out.

Documenting your jupyter notebooks

A recent episode of the excellent podcast Talk Python to Me discussed an effort to collect and analyze some one million Jupyter Notebooks on Github.  Unsurprisingly, one conclusion drawn by the analyst is that notebook authors are not good at documenting their work.  I find that a little sad, given how rich Jupyter markdown is.

I have found the markdown syntax to be a little confusing, but recently I found this great “cheatsheet” that has helped:

I haven’t had a whole lot of opportunity to work with LaTeX, but when I have, it has been a challenge.  Here’s a cheatsheet that’s been helpful in the past:

 

Annotating the War on Poverty

The other day, I was listening to the Contra Krugman episode entitled “How to Unwind the Welfare State”. Toward the end of the discussion, the hosts began listing examples of private organizations in the free market solving social problems only to be stymied when the federal government began to insert itself into the situation. Host Bob Murphy referenced an article he wrote for FEE where he discussed how, in the 1950s and 60s, the free market was already lifting people out of poverty at a pretty good clip just to have Lyndon Johnson and the federal government jump on the bandwagon halfway through and claim that it was their legislation, not the free market, that did all the heavy lifting.

I couldn’t find the article Bob was referencing (maybe it was this?); nevertheless, it occurred to me this might be an opportunity to improve my matplotlib skills. Maybe I could find the official US poverty numbers, plot them out, then annotate the plot with markers indicating when key legislation in the War on Poverty was enacted. Would this convey the point Bob was making?  Here are highlights of what I did (the full code is available on my Github page):

Step 1: Get the data

Is it me or is it just confusing downloading the data you want from the US government?  The US Census Bureau publishes the poverty numbers, but I found it very confusing which numbers I needed and for the time period in which I needed it.  I finally found a dataset I could use on the page, Historical Poverty Tables: People and Families – 1959 to 2016.

Step 2: Load the data

Here’s a snippet of the spreadsheet I downloaded:

Makes sense, I guess, but it took me a while to figure out an optimal way to load the spreadsheet into a dataframe with Pandas.  In the end, though, it only took two lines of code:


1
2
df_pov = pd.read_excel('./hstpov9.xls', header=[3,4,5], index_col=0)
df_pov = df_pov[:-1]  # drop the last row as it's just a footnote

Step 3: Get some legislation dates

Wikipedia to the rescue!  Wikipedia called out four major pieces of legislation in the War on Poverty:

  • The Economic Opportunity Act of 1964 – August 20, 1964
  • Food Stamp Act of 1964 – August 31, 1964
  • Elementary and Secondary Education Act – April 11, 1965
  • Social Security Act 1965 (Created Medicare and Medicaid) – July 19, 1965

Step 4: Plot time?  Not so fast!

So, the major pieces of legislation happened in 1964 and 1965.  Now, I can plot the poverty rate from the dataset I have and then add annotations at years 1964 and 1965.  Er, wait a minute…the dataset is missing the poverty rate from those years!  In fact, it’s missing all the years between 1960 and 1969.  Weird!  How will I know, on the plot, where to place my annotations?  Well, Pandas can figure that out with its handy interpolate function!  Only two lines of code to do the calculation!


1
2
3
4
5
# create a dataframe for the data I'm missing
df_gap_data = df_pov.loc[[1960, 1969], ('Total', 'Below poverty')]
# create rows for the missing data and use Pandas interpolate to make a best guess at what the poverty rate was during
# those missing years
df_gap_data = df_gap_data.reindex(pd.RangeIndex(df_gap_data.index.min(), df_gap_data.index.max() + 1)).interpolate()

Step 5: Now, plot time!

Now that I know where to place my annotations, here’s what I came up with for the plot:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
laws = [('Education Act', 1965), ('Social Security Act', 1965),
        ('Economic Opportunity Act', 1964), ('Food Stamp Act', 1964)]
title = 'Total Below Poverty Percentage, United States, with annotations'
y_offset = 0  # offset counter for the text block annotations

# plot the poverty rate
ax = df_pov.sort_index().loc[:, ('Total', 'Below poverty', 'Percent')].plot(title=title, figsize=(12, 10))
ax.set_xlabel('Year')
ax.set_ylabel('Percent below poverty')

# loop through the legislation so I can add those annotations
for law in laws:
    y_offset += 30
    name, year = law
    percent = df_gap_data.loc[year, 'Percent']
    ci = Ellipse((year, percent), width=0.5, height=0.1, color='black', zorder=5)
    ax.add_patch(ci)

    ax.annotate(name,
                xy=(year, percent), xycoords='data',
                xytext=(175, 300 + y_offset), textcoords='axes points',
                size=20,
                bbox=dict(boxstyle="round", fc="0.8"),
                arrowprops=dict(arrowstyle="->", color='black', patchB=ci,
                                connectionstyle="angle3,angleA=0,angleB=-90"))

 

And that rendered the plot at the top of this post.  Does that chart illustrate the point Bob Murphy was trying to make in the podcast?  I think so, but take a listen for yourself and let me know.  The big takeaway is all the cool annotations you can do in matplotlib.

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑