Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: python (Page 12 of 26)

Pictures and Words

A chart may be worth 1000 words, but sometimes embedding a few words in your chart can convey additional, helpful information. For example, take this chart that I built from the incredible COVID-19 data collected by Johns Hopkins University:

Pretty telling as it is. Now, let’s add some words to it:

Adding some words to the chart, 1) conveys additional, helpful information and 2) fills in some awkward whitespace. Seems like a win to me. For completeness sake, here’s what I did to build this chart:

Step 1: Import the packages

import pandas as pd
import matplotlib.pyplot as plt

Step 2: Load up the JHU dataset

df = pd.read_csv('./data/time_series_covid19_confirmed_US.csv')

Step 3: Trim the data down to just counties in Ohio

cols = [i for i, v in enumerate(df.columns) if v in ['Admin2', 'Province_State'] or v.endswith('2020')]
df_ohio = df[df.Province_State=='Ohio'].iloc[:,cols].copy()
df_ohio['county'] = df_ohio.Admin2 + ', ' + df_ohio.Province_State  # combine county and State together in a field
df_ohio = df_ohio.drop(columns=['Admin2', 'Province_State']).set_index('county')

Step 4: Build the chart

fig, ax = plt.subplots(figsize=(12,10))
title = 'Top 10 Ohio Counties with Confirmed COVID-19 Cases as of ' + df.columns[-1]
worst_county, worst_co_cases = [(k, v) for k, v in df_ohio['3/30/2020'].sort_values().tail(1).items()][0]

inset = """
There are {0} counties and other Ohio 
entities in this dataset.  As of {1}, 
there are {2:,} confirmed cases of COVID-19.  
{3} represents {4:.1f}% of that population.
""".format(df_ohio.shape[0], df.columns[-1], df_ohio['3/30/2020'].sum(), worst_county, 
           (worst_co_cases/df_ohio['3/30/2020'].sum())*100)

_ = df_ohio['3/30/2020'].sort_values().tail(10).plot(kind='barh', ax=ax, title=title)
_ = ax.set_ylabel('Ohio counties')
_ = ax.set_xlabel('Confirmed Cases')

# you have to experiment a little with the x, y positioning to get your word inset positioned just right
text = fig.text(0.30, 0.35, inset, va='center', ha='left', size=18)

Pretty darn slick!

Loguru, ftw

In virtually all the applications and operations I write, I try to incorporate some level of logging so that my code can be adequately supported, particularly in Production environments. Some time ago, I wrote about how I generally log in my Python applications. Well, lately, I’ve switched from that approach to using Loguru and I must say I’m rather satisfied with its ease of use. Here’s a quick example I put together recently of the package:

Step 1: Do your standard imports

As I explained in other posts on logging, I like adding a “run id” to each log line so that I can easily group lines together belonging to a single instance of my application, so I import the uuid package to help in that regard:

import os
import sys
import uuid
from loguru import logger

Step 2: Setup/customize my logging context

In one line, I can customize how each log line is written and set logging behavior like rolling the file when it hits 10 Mb in size:

runid = str(uuid.uuid4()).split('-')[-1]
logger.add('loguru_example.log', format='{time}|{extra[runid]}|{level}|{message}', level='INFO', rotation='10 MB')
logger_ctx = logger.bind(runid=runid)

Step 3: Start logging

def main(argv):
    logger_ctx.info('Starting run of the loguru_example.py script')
    # do some stuff
    logger_ctx.info('Completing run of the loguru_example.py script')


if __name__ == '__main__':
    main(sys.argv[1:])

And now you have a nice and easy log for your application:

Pretty darn simple! So now there’s no excuse: start logging today!

Wordclouds and Domains

I’m not a big fan of wordclouds, but management seems to like them. Recently, I was working on a wordcloud of domains and generated some unexpected results. For demonstration purposes, I grabbed the domains of stories Firefox Pocket recommended to me and shoved them into a dataframe:

df_domains = pd.DataFrame(domains, columns=['domain'])
df_domains.head()

Then, I took a list of the domains and preprocessed them in the conventional way you do for the package: you join them together likes words of text with spaces in between:

text = ' '.join(df_domains.domain.tolist())

Finally, I loaded those words into a wordcloud object:

wordcloud = WordCloud().generate(text)
_ = plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
_ = plt.title('Random Domains')

Which produced this wordcloud:

Anything missing?

Excellent…except…where are the top-level domains? All the .coms, .nets, etc? By Jove, they’re not there! If you check out the frequency map the wordcloud created, you can start to get a clue about what happened:

wordcloud.words_
{'theatlantic': 1.0,
 'theguardian': 0.6666666666666666,
 'outsideonline': 0.6666666666666666,
 'bbc': 0.6666666666666666,
 'theoutline': 0.6666666666666666,
 'washingtonpost': 0.3333333333333333,
 'mentalfloss': 0.3333333333333333,
 'citylab': 0.3333333333333333,
 'bloomberg': 0.3333333333333333,
 'popsci': 0.3333333333333333,
 'espn': 0.3333333333333333,
 'nytimes': 0.3333333333333333,
 'rollingstone': 0.3333333333333333,
 'inverse': 0.3333333333333333,
 'livescience': 0.3333333333333333,
 'newyorker': 0.3333333333333333,
 'nautil': 0.3333333333333333,
 'us': 0.3333333333333333,
 'theconversation': 0.3333333333333333,
 'vox': 0.3333333333333333,
 'hbr': 0.3333333333333333,
 'org': 0.3333333333333333,
 'wired': 0.3333333333333333,
 'lifehacker': 0.3333333333333333,
 'dariusforoux': 0.3333333333333333,
 'atlasobscura': 0.3333333333333333}

The “generate” function removed all the .coms, .nets, and so on when it built the frequency map. A little more digging and we can see that the default regular expression is the problem: “\w[\w’]+”. It’s looking for words (and even apostrophes), but stopping with punctuation marks like periods. Now, you can futz around with providing your own regular expression that will include the full domain–I tried that–but regular expressions are hard and there’s actually a better way: the pandas value_counts function. The value_counts function will let you generate your own frequency map that you can provide to the wordcloud package directly. First, let’s just take a look at what value_counts produces. We’ll pipe the results to a dictionary so that the data is in a form the wordcloud package needs:

df_domains.domain.value_counts().to_dict()
{'theatlantic.com': 3,
 'theoutline.com': 2,
 'outsideonline.com': 2,
 'bbc.com': 2,
 'theguardian.com': 2,
 'bloomberg.com': 1,
 'rollingstone.com': 1,
 'theconversation.com': 1,
 'wired.com': 1,
 'inverse.com': 1,
 'popsci.com': 1,
 'atlasobscura.com': 1,
 'mentalfloss.com': 1,
 'newyorker.com': 1,
 'espn.com': 1,
 'nytimes.com': 1,
 'hbr.org': 1,
 'nautil.us': 1,
 'washingtonpost.com': 1,
 'lifehacker.com': 1,
 'livescience.com': 1,
 'vox.com': 1,
 'citylab.com': 1,
 'dariusforoux.com': 1}

True, the values are integers and not floats, but wordcloud doesn’t care:

freqs = df_domains.domain.value_counts().to_dict()

wordcloud = WordCloud().generate_from_frequencies(freqs)
_ = plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
_ = plt.title('Random Domains')

And now we have a wordcloud of domains, complete with their top-level domain parts. Nice!

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑