Here’s another post in my quest to recreate many of the charts from Machine Learning Plus’s Top 50 matplotlib visualizations:

The perfect cup of coffee

Back in March, TowardsDataScience.com published an article that analyzed a coffee dataset from the Coffee Quality Institute (sounds like a great place to work!). Since I’m always looking for cool datasets to work with and since I love coffee, I thought this would be a great dataset to pull down and visualize in some fashion.

In the article, the author visualizes median coffee data from several countries around the world in polar charts. The polar charts worked well to get all 11 features on the chart at the same time, but every polar chart–from Ethiopia to the United States–looked the same. It was difficult to see how one country’s coffee differed from another’s. I wonder if there might be a better way to show the subtle variations among each country’s coffee? Enter in another article I talked about previously: Top 50 matplotlib Visualizations. I thought one chart in particular from that article, the Diverging Bars Chart, might do the trick.

Since each country can produce tens of different brands of coffee, I followed the lead of the original article and grabbed the median value from each country. I then applied the Diverging Bars technique to plot how far each country’s coffee varied from the mean.

One thing that puzzles me, though: in several of the categories, Papua New Guinea comes out on top. Yet if you look at the original article, the author lists the median Ethiopian coffee as coming out on top more often than not. What’s the reason for this discrepancy? I’m not really sure. I think I calculated the medians correctly–my Ethiopian values certainly match the author’s. Perhaps I’m working from a newer dataset than he did?

At any rate, I accomplished my main goal of creating some cool diverging bar charts. Enjoy with your favorite cup of java!

Step 1: Load the data

# https://github.com/jldbc/coffee-quality-database
df_coffee = pd.read_csv('./data/arabica_data_cleaned.csv')
df_coffee.head()

Step 2: Code the chart

Since the dataset has multiple features, each of which I’d like to chart, I decided to place my chart-generation code in a function so that I could easily reuse it from feature to feature:

def generate_chart(feature_to_chart, xlabel, title):
    df_chart = df_coffee.groupby('Country.of.Origin').median().loc[:, [feature_to_chart]].reset_index()
    df_chart['z'] = (df_chart[feature_to_chart] - df_chart[feature_to_chart].mean()) / df_chart[feature_to_chart].std()

    df_chart['colors'] = ['red' if x < 0 else 'green' for x in df_chart['z']]
    df_chart.sort_values('z', inplace=True)
    df_chart.reset_index(inplace=True)

    # draw plot
    plt.figure(figsize=(14,10), dpi=80)
    plt.hlines(y=df_chart.index, xmin=0, xmax=df_chart.z, color=df_chart.colors, alpha=0.4, linewidth=5)

    # decorations
    plt.gca().set(ylabel='$Country$', xlabel=xlabel)
    plt.yticks(df_chart.index, df_chart['Country.of.Origin'], fontsize=12)
    plt.title(title, fontdict={'size':20})
    plt.grid(linestyle='--', alpha=0.5)
    plt.show()

Step 3: Generate the chart

Finally, I can call my function and generate the chart:

feature_to_chart = 'Flavor'
xlabel = '${0}$ $Variation$'.format(feature_to_chart)
title = 'Diverging Bars of Median Coffee {0} Rating'.format(feature_to_chart)

generate_chart(feature_to_chart, xlabel, title)
Median Coffee Flavors

Two other interesting charts:

Divergence of the “balance” feature
Divergence of the “acidity” feature

Check out my complete code here and look for more cool charts to come!