Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: python (Page 18 of 26)

Choosing the best coffee

Here’s another post in my quest to recreate many of the charts from Machine Learning Plus’s Top 50 matplotlib visualizations:

The perfect cup of coffee

Back in March, published an article that analyzed a coffee dataset from the Coffee Quality Institute (sounds like a great place to work!). Since I’m always looking for cool datasets to work with and since I love coffee, I thought this would be a great dataset to pull down and visualize in some fashion.

In the article, the author visualizes median coffee data from several countries around the world in polar charts. The polar charts worked well to get all 11 features on the chart at the same time, but every polar chart–from Ethiopia to the United States–looked the same. It was difficult to see how one country’s coffee differed from another’s. I wonder if there might be a better way to show the subtle variations among each country’s coffee? Enter in another article I talked about previously: Top 50 matplotlib Visualizations. I thought one chart in particular from that article, the Diverging Bars Chart, might do the trick.

Since each country can produce tens of different brands of coffee, I followed the lead of the original article and grabbed the median value from each country. I then applied the Diverging Bars technique to plot how far each country’s coffee varied from the mean.

One thing that puzzles me, though: in several of the categories, Papua New Guinea comes out on top. Yet if you look at the original article, the author lists the median Ethiopian coffee as coming out on top more often than not. What’s the reason for this discrepancy? I’m not really sure. I think I calculated the medians correctly–my Ethiopian values certainly match the author’s. Perhaps I’m working from a newer dataset than he did?

At any rate, I accomplished my main goal of creating some cool diverging bar charts. Enjoy with your favorite cup of java!

Step 1: Load the data

df_coffee = pd.read_csv('./data/arabica_data_cleaned.csv')

Step 2: Code the chart

Since the dataset has multiple features, each of which I’d like to chart, I decided to place my chart-generation code in a function so that I could easily reuse it from feature to feature:

def generate_chart(feature_to_chart, xlabel, title):
    df_chart = df_coffee.groupby('Country.of.Origin').median().loc[:, [feature_to_chart]].reset_index()
    df_chart['z'] = (df_chart[feature_to_chart] - df_chart[feature_to_chart].mean()) / df_chart[feature_to_chart].std()

    df_chart['colors'] = ['red' if x < 0 else 'green' for x in df_chart['z']]
    df_chart.sort_values('z', inplace=True)

    # draw plot
    plt.figure(figsize=(14,10), dpi=80)
    plt.hlines(y=df_chart.index, xmin=0, xmax=df_chart.z, color=df_chart.colors, alpha=0.4, linewidth=5)

    # decorations
    plt.gca().set(ylabel='$Country$', xlabel=xlabel)
    plt.yticks(df_chart.index, df_chart['Country.of.Origin'], fontsize=12)
    plt.title(title, fontdict={'size':20})
    plt.grid(linestyle='--', alpha=0.5)

Step 3: Generate the chart

Finally, I can call my function and generate the chart:

feature_to_chart = 'Flavor'
xlabel = '${0}$ $Variation$'.format(feature_to_chart)
title = 'Diverging Bars of Median Coffee {0} Rating'.format(feature_to_chart)

generate_chart(feature_to_chart, xlabel, title)
Median Coffee Flavors

Two other interesting charts:

Divergence of the “balance” feature
Divergence of the “acidity” feature

Check out my complete code here and look for more cool charts to come!

College Tuition vs. Starting Salary

A few months ago, Machine Learning Plus published a great article demonstrating the power of matplotlib by showcasing 50 cool visuals you can accomplish with the package. Inspired, I wanted to see if I could replicate some of these visuals, but with data I’m interested in.

So, I started with their bubble chart, but instead of using the strange, Midwest data they used, I thought I’d work in a space that’s been preoccupying my time of late: college tuition. What sort of bubble chart could I craft that depicted college tuition in some way? What about a bubble chart depicting the intersection of college tuitions and their corresponding average starting salaries? That might help parents and students better understand the return on investment associated with various colleges. Here’s what I came up with:

First, I decided to narrow down my work to just Ohio colleges. At, I found a dataset of median starting salaries by Ohio college for 2018.

Unfortunately, the dataset did not include college tuition prices. However, did have a dataset of Ohio college tuition prices for 2018-2019.

Much of my work revolved around cleaning up these data sources and merging them together for the final visual. As you might imagine, each dataset tended to have slight name variations between schools. For example, the dataset had an entry for Kettering College whereas the site calls that school Kettering College of Medical Arts. So, I had to do a fair amount of work making sure both datasets called each school the same name so that I could properly match on those names.

The dataset included some language to differentiate public schools from private, which I used to color my bubbles blue and red, respectively. The dataset included the school size which I used to size each bubble.

Machine Learning Plus’s bubble chart includes a cool “encircling” device that draws a circle around certain datapoints to draw the user’s attention to those points. Instead of doing that, I thought it’d be interesting to draw a “break even” line. All things equal, if you pay, say, $10,000 in tuition for 4 years, you’re tuition investment would break even if your first job out of school paid $40,000. I drew a line to that effect on the graph: datapoints above that line would have a positive return on investment whereas datapoints below that line would have a negative return on investment. I didn’t want to muddy up the chart labeling each bubble with the name of the college, but I still thought it’d be fun to calculate which schools are above and below the line, so I found a way to do that, added the calculation as a column to the dataframe, and printed out the Top 5 “Best” returns on investment and the Top 5 “Worse” returns on investment.

Top 5 biggest ROI schools: 
68              Central State University
40        Kent State University at Salem
45     Kent State University at Trumbull
56    Kent State University at Ashtabula
46              Shawnee State University
Name: School Name, dtype: object

Top 5 least ROI schools: 
8              Oberlin College
1               Kenyon College
3           Denison University
18      The College of Wooster
6     Ohio Wesleyan University
Name: School Name, dtype: object

Obviously, my “break even” assessment is very simplistic. There are many other variables I don’t account for: room and board, fees, financial aid, merit scholarships, taxes, and the like. The median starting salaries are across all graduates from a given school–from Philosophy majors to Computer Science. So, your mileage will certainly vary. For me, the bigger take-aways were 1) the challenge of obtaining, cleaning, and merging the datasets, 2) charting out the results in a cool way, and 3) calculating the datapoints above and below my break-even line. All my work is here in case you want to check it out. Look for more matplotlib charts inspired by the Machine Learning Plus article in the future!

Watermarking Matplotlib charts

A few weeks ago, I read an interesting article about watermarking your ggplot charts in R. R is certainly a fantastic tool, but as my go-to language for visualizations these days is Python, I had to ask myself, “self, how would you watermark your matplotlib charts?” Well, one answer is the text method of the Figure object.

Consider this polar chart I wrote some time ago:

colleges = ['College A', 'College B', 'College C', 'College D', 'College E']
scores = [76, 54, 58, 63, 65]

theta = np.arange(len(colleges))/float(len(colleges)) * 2 * np.pi
fig = plt.figure(figsize=(8, 8))
ax = plt.subplot(111, projection='polar')
ax.plot(theta, scores, color='green', marker='o')
ax.plot([theta[0], theta[-1]], [scores[0], scores[-1]], 'g-')  # hack to complete the circle
ax.set_rmax(max(scores) + 5)
ax.set_rticks(np.arange(0, max(scores) + 5, step=10))
labels = ax.set_xticklabels(colleges)

# hack to get the labels to show nicely
[l.set_ha('right') for l in labels if l.get_text() in ['College C', 'College D']]
[l.set_ha('left') for l in labels if l.get_text() in ['College A']]

for i, txt in enumerate(scores):
    ax.annotate(txt, (theta[i], scores[i]))


ax.set_title('College Scorecard')

# watermarking my chart
fig.text(0.95, 0.06, '',
         fontsize=12, color='gray',
         ha='right', va='bottom', alpha=0.5)

I’ve highlighted the part of the code that watermarks the chart. This code produces the following chart:

Hey, how about that cool watermark in the lower right-hand corner? Snazzy, right? One challenge I’ve found is that as you change the size of your chart, you’ll have to play around a little with the x and y coordinates of your watermark. Nevertheless, this seems to me like a great way to brand your charts.

« Older posts Newer posts »

© 2025

Theme by Anders NorenUp ↑