Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: jupyter_notebook (Page 10 of 17)

Slope Charts in Python

I continue to explore the different charts from Machine Learning Plus’s Top 50 matplotlib visualizations post and look for good opportunities to recreate them with data sets I care about. Recently, I thought it might be interesting to create a slope chart where I simply match objects on one side of the chart to objects on the other side, without using the Y axis to convey any meaning. For my data set, I grabbed CollegeChoice.net’s 25 Best Colleges in Ohio. I didn’t dig into how they decide one college is better than another, although they do provide a description of their methodology. What I thought was interesting was that they provide the 4-5 most popular majors at each of the colleges. So, I thought I could create a slope chart where I write the top 10 Ohio Colleges on one side (all 25 would make the chart too cluttered), their most popular majors on the other side, and draw lines in between. How common are these majors among the top 10? My chart should be able to tell that story.

Step 1: Bring in all the packages I’ll need

Since I’m pulling in a parsing a web page for its data, requests and BeautifulSoup are in. numpy and math will help with spacing out the points in my chart and, of course, matplotlib will render the chart:

import requests
from bs4 import BeautifulSoup
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from matplotlib import cm
import math

Step 2: Grab the page and parse out the data

# grab the page
result = requests.get("https://www.collegechoice.net/rankings/best-colleges-in-ohio/")
soup = BeautifulSoup(result.content, 'lxml')

# parse out just the top 10 schools and their popular majors
ranking_divs = soup.find_all('div', 'ranking-box')
top_10_schools = {}

for ranking_div in ranking_divs[:10]:
    school = ranking_div.select_one('div.rb-list-title h3').text
    majors = [maj.text for maj in ranking_div.select('div.rb-ranking-body ul li')]
    top_10_schools[school] = majors

Step 3: Do some data cleanup

As I scanned the results, I noticed that CollegeChoice used two slightly different names for the same major: Visual and Performing Arts and Visual & Performing Arts. I had to write some code to clean that up:

for school, majors in top_10_schools.items():
    if 'Visual and Performing Arts' in majors:
        top_10_schools[school] = ['Visual & Performing Arts' if maj=='Visual and Performing Arts' else maj for maj in majors]

Step 4: Finally, build the chart

Now, I can build the chart, and that Machine Learning Plus article really helped out. The one difference was that, in their slope chart, they used GDP dollars on their Y axis. My Y axis wouldn’t have any sort of meaning: just a list of colleges. So, I used the numpy and math packages to help me evenly space out my points along the axis. Here’s what I came up with:

# if, say, you have a count of 11 and you want to round up to the nearest 5, this will return 15
def roundupto(your_count, round_up_to_nearest):
    return int(math.ceil(your_count / round_up_to_nearest)) * round_up_to_nearest

# draws a line between points
def newline(p1, p2, color='black'):
    ax = plt.gca()
    l = mlines.Line2D([p1[0], p2[0]], [p1[1], p2[1]], color=color, marker='o', markersize=6)
    ax.add_line(l)
    return l
    

fig, ax = plt.subplots(1, 1, figsize=(14, 14))

# get school and major lists and calculate the scale of the chart
school_list = list(top_10_schools.keys())
school_list.reverse()  # matplotlib will then put the #1 school at the top of the chart
major_list = list(set(sum(top_10_schools.values(), [])))
major_list.sort(); major_list.reverse()  # to help matplotlib list majors alphabetically down the chart
scale = roundupto(max(len(school_list), len(major_list)), 5)

# write the vertical lines
ax.vlines(x=1, ymin=0, ymax=scale, color='black', alpha=0.7, linewidth=1, linestyles='dotted')
ax.vlines(x=3, ymin=0, ymax=scale, color='black', alpha=0.7, linewidth=1, linestyles='dotted')

# plot the points; unlike the slope chart in the MachineLearningPlus.com article, my Y axis has no meaning, so
# I use numpy's linspace function to help me evenly space each point
school_y_vals = np.linspace(1, scale-1, num=len(school_list))
major_y_vals = np.linspace(1, scale-1, num=len(major_list))
ax.scatter(y=school_y_vals, x=np.repeat(1, len(school_list)), s=10, color='black', alpha=0.7)
ax.scatter(y=major_y_vals, x=np.repeat(3, len(major_list)), s=10, color='black', alpha=0.7)

# write the lines and annotation
for school, school_y_val in zip(school_list, school_y_vals):
    ax.text(1-0.05, school_y_val, school, horizontalalignment='right', verticalalignment='center', fontdict={'size':10})
    for major in top_10_schools[school]:
        major_y_val = major_y_vals[major_list.index(major)]
        newline([1, school_y_val], [3, major_y_val], color=cm.get_cmap('tab20')(school_list.index(school)))
        
for major, major_y_val in zip(major_list, major_y_vals):
    ax.text(3+0.05, major_y_val, major, horizontalalignment='left', verticalalignment='center', fontdict={'size':10})
    
# vertical line annotations
ax.text(1-0.05, scale-0.25, 'College', horizontalalignment='right', verticalalignment='center', 
        fontdict={'size':14, 'weight':700})
ax.text(3+0.05, scale-0.25, 'Major', horizontalalignment='left', verticalalignment='center', 
        fontdict={'size':14, 'weight':700})

# misc cleanup
ax.set(xlim=(0, 4), ylim=(0, scale))
ax.axis('off')

plt.title("Most Popular Majors at Ohio's Top 10 Colleges")
plt.show()

And the result is at the top of this post. Get my full code here.

Conclusions?

Well, I did note that over half of the popular majors are popular at only one of the Top 10 schools. I expected to see many of the same majors appear repeatedly across multiple schools. I guess maybe that’s a good thing: if, say, you want to study Finance, it would seem The Ohio State University and only The Ohio State University is the best place to study the discipline.

More importantly, the slope chart is now another cool visual I (and you) can add to your tool box. Happy sloping!

Ten things I like to do in Jupyter Markdown

One of the great things about Jupyter Notebook is how you can intersperse your code blocks with markdown blocks that you can use to add comments or simply more context around your code. Here are ten ways I like to use markdown in my Jupyter Notebooks.

1. Use hashes for easy titles

In your markdown cell, enter a line like this:

# This becomes a H1 header/title

That line will render as a header (h1 element).

2. Use asterisks and hyphens for bullet points

Try these lines in your markdown cell:

* this is one way to do a bullet point
- this is another way to do a bullet point

Both render as bullet point lists.

3. Use asterisks and underscores for emphasis

Next, try this:

*these words become italicized*
__these words become bold__
Wait…that didn’t render quite as expected

The phrase I wanted to italicize italicized and the phrase I wanted to bold went bold, but both phrases rendered on the same line. What gives? I’ve noticed that some markdown behaves like this, but here’s a simple solution: add a <br> (HTML for line break) at the end of each line where you want a line break. So, write this in your markdown cell:

*these words become italicized*<br>
__these words become bold__

4. Center my headers with some HTML

Instead of using the hashtag shortcut, code your header elements directly and style them to center:

<h1 style="text-align: center">This header is centered</h1>

Interestingly, I’ve noticed that my centering works in Jupyter Notebook, but not in Jupyter Lab.

5. Create thick dividing lines with HTML

My notebooks that do a lot of exploratory data analysis before jumping into data modeling can get quite lengthy. I find that a nice, thick dividing line between sections can be a great visual indicator of the changing focus of my notebook. In a markdown cell, give this a try:

<hr style="border-top: 5px solid purple; margin-top: 1px; margin-bottom: 1px"></hr>

6. Write mathematical formulas

I’m more coder than math guy, but a formula or two can sometimes be helpful explaining your solution to a problem. Jupyter markdown cells support LaTeX, so give this a whirl:

linear regression: $y = ax + b$
two dimensions: $y = a_{1}x_{1} + a_{2}x_{2} + b$
Ridge Regression: standard OLS loss function + $\alpha \times \sum_{i=1}^{n} a^{2}_i$

7. Create hyperlinks

Hyperlinks are easy in markdown:

[Google](https://google.com)

8. Drop in images with HTML

A picture is worth a thousand words:

<img src="mind_blown.gif" style="max-width:50%; max-height:50%"></img>

9. Create nice tables

Use pipes and dashes to create a table in your markdown:

|| sepal length (cm) | sepal width (cm) |
|----|----|----|
|0|5.1|3.5|
|1|4.9|3.0|

10. Escape text with three tick marks

Occasionally, I’ll want to show a code snippet in my markdown or other kind of escaped text. You can do that by surrounding your snippet with three back-tick characters:

```
sample code goes here
```

Bonus: change the background color of your markdown cells

It never occurred to me until recently, but Notebooks bring with them a variety of style classes that you can leverage in your own markdown. Here are four examples (note: this is yet another markdown trick that works in Jupyter Notebook, but not in Jupyter Lab…at least the version I’m presently running):

<div class="alert alert-block alert-info">
This is a blue background
</div>
<div class="alert alert-block alert-warning">
This is a yellow background
</div>
<div class="alert alert-block alert-success">
This is a green background
</div>
<div class="alert alert-block alert-danger">
This is a red background
</div>

For all of this code, check out my notebook here. Also, here are two other great posts on more markdown tips and tricks.

Python bingo

Have a road trip planned this summer? Want to keep the kids from driving you crazy as you drive to Walley World? How about playing the ol’ standard License Plate game but with a twist: License Plate Bingo!

Run this code a couple of times and print out the chart on separate pieces of paper. There are your bingo cards. Give one to each kid and/or adult. Now, hit the road! If you see a license plate from, say, Texas, and you have a “Texas” square, mark it off on your bingo card. If you can mark off a row horizontally, vertically, or diagonally, you win!

Step 1: Import your packages

import matplotlib.pyplot as plt
import matplotlib.style as style
import numpy as np
import random

%matplotlib inline
style.use('seaborn-poster')

Step 2: Get your State names

# compliments of this forum: https://gist.github.com/JeffPaine/3083347
states = ["AL - Alabama", "AK - Alaska", "AZ - Arizona", "AR - Arkansas", "CA - California", "CO - Colorado",
"CT - Connecticut", "DC - Washington DC", "DE - Deleware", "FL - Florida", "GA - Georgia",
"HI - Hawaii", "ID - Idaho", "IL - Illinios", "IN - Indiana", "IA - Iowa",
"KS - Kansas", "KY - Kentucky", "LA - Louisiana", "ME - Maine", "MD - Maryland",
"MA - Massachusetts", "MI - Michigan", "MN - Minnesota", "MS - Mississippi",
"MO - Missouri", "MT - Montana", "NE - Nebraska", "NV - Nevada", "NH - New Hampshire",
"NJ - New Jersey", "NM - New Mexico", "NY - New York", "NC - North Carolina",
"ND - North Dakota", "OH - Ohio", "OK - Oklahoma", "OR - Oregon", "PA - Pennsylvania",
"RI - Rhode Island", "SC - South Carolina", "SD - South Dakota", "TN - Tennessee",
"TX - Texas", "UT - Utah", "VT - Vermont", "VA - Virgina", "WA - Washington", "WV - West Virginia",
"WI - Wisconsin", "WY - Wyoming"]
state_names = [s.split('-')[1].strip() for s in states]

Step 3: Generate your bingo card

random.shuffle(state_names)
rowlen= 4  # make any size card you'd like

fig = plt.figure()
ax = fig.gca()
ax.set_xticks(np.arange(0, rowlen + 1))
ax.set_yticks(np.arange(0, rowlen + 1))
plt.grid()

for i, word in enumerate(state_names[:rowlen**2]):
    x = (i % rowlen) + 0.4
    y = int(i / rowlen) + 0.5
    ax.annotate(word, xy=(x, y), xytext=(x, y))
    
plt.show()
Python Bingo, FTW!

Grab my full source code here.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑