Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Month: February 2020 (Page 1 of 2)

Wordclouds and Domains

I’m not a big fan of wordclouds, but management seems to like them. Recently, I was working on a wordcloud of domains and generated some unexpected results. For demonstration purposes, I grabbed the domains of stories Firefox Pocket recommended to me and shoved them into a dataframe:

df_domains = pd.DataFrame(domains, columns=['domain'])
df_domains.head()

Then, I took a list of the domains and preprocessed them in the conventional way you do for the package: you join them together likes words of text with spaces in between:

text = ' '.join(df_domains.domain.tolist())

Finally, I loaded those words into a wordcloud object:

wordcloud = WordCloud().generate(text)
_ = plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
_ = plt.title('Random Domains')

Which produced this wordcloud:

Anything missing?

Excellent…except…where are the top-level domains? All the .coms, .nets, etc? By Jove, they’re not there! If you check out the frequency map the wordcloud created, you can start to get a clue about what happened:

wordcloud.words_
{'theatlantic': 1.0,
 'theguardian': 0.6666666666666666,
 'outsideonline': 0.6666666666666666,
 'bbc': 0.6666666666666666,
 'theoutline': 0.6666666666666666,
 'washingtonpost': 0.3333333333333333,
 'mentalfloss': 0.3333333333333333,
 'citylab': 0.3333333333333333,
 'bloomberg': 0.3333333333333333,
 'popsci': 0.3333333333333333,
 'espn': 0.3333333333333333,
 'nytimes': 0.3333333333333333,
 'rollingstone': 0.3333333333333333,
 'inverse': 0.3333333333333333,
 'livescience': 0.3333333333333333,
 'newyorker': 0.3333333333333333,
 'nautil': 0.3333333333333333,
 'us': 0.3333333333333333,
 'theconversation': 0.3333333333333333,
 'vox': 0.3333333333333333,
 'hbr': 0.3333333333333333,
 'org': 0.3333333333333333,
 'wired': 0.3333333333333333,
 'lifehacker': 0.3333333333333333,
 'dariusforoux': 0.3333333333333333,
 'atlasobscura': 0.3333333333333333}

The “generate” function removed all the .coms, .nets, and so on when it built the frequency map. A little more digging and we can see that the default regular expression is the problem: “\w[\w’]+”. It’s looking for words (and even apostrophes), but stopping with punctuation marks like periods. Now, you can futz around with providing your own regular expression that will include the full domain–I tried that–but regular expressions are hard and there’s actually a better way: the pandas value_counts function. The value_counts function will let you generate your own frequency map that you can provide to the wordcloud package directly. First, let’s just take a look at what value_counts produces. We’ll pipe the results to a dictionary so that the data is in a form the wordcloud package needs:

df_domains.domain.value_counts().to_dict()
{'theatlantic.com': 3,
 'theoutline.com': 2,
 'outsideonline.com': 2,
 'bbc.com': 2,
 'theguardian.com': 2,
 'bloomberg.com': 1,
 'rollingstone.com': 1,
 'theconversation.com': 1,
 'wired.com': 1,
 'inverse.com': 1,
 'popsci.com': 1,
 'atlasobscura.com': 1,
 'mentalfloss.com': 1,
 'newyorker.com': 1,
 'espn.com': 1,
 'nytimes.com': 1,
 'hbr.org': 1,
 'nautil.us': 1,
 'washingtonpost.com': 1,
 'lifehacker.com': 1,
 'livescience.com': 1,
 'vox.com': 1,
 'citylab.com': 1,
 'dariusforoux.com': 1}

True, the values are integers and not floats, but wordcloud doesn’t care:

freqs = df_domains.domain.value_counts().to_dict()

wordcloud = WordCloud().generate_from_frequencies(freqs)
_ = plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
_ = plt.title('Random Domains')

And now we have a wordcloud of domains, complete with their top-level domain parts. Nice!

Converting file size values

Lately, I’ve been challenged with performing calculations and charting of file size values in different units of measure. For example, I’ll have file size values in gigabytes but will have to plot those values against terabytes of disk capacity. I’m a little surprised that Python doesn’t have a ready way to solve this problem. There is the hurry.filesize package, but that requires that you pass a bytes value into the function. What if you only have a gigabytes value to pass? Well, I came up with my own solution, largely inspired by similar solutions. Here’s my function:

import re


def convert_filesize(size, desired_uom, factor=1024):
    """Converts a provided computer data storage value to a different unit of measure.
    
    Keyword arguments:
    size -- a string of the current size (ex. '1.5 GB')
    desired_uom -- a string of the new unit of measure (ex. 'TB')
    factor -- the factor used in the conversion (default 1024)
    """
    uom_options = ['B', 'KB', 'MB', 'GB', 'TB']
    supplied_uom = re.search(r'[a-zA-Z]+', size)
    if supplied_uom:
        supplied_uom = supplied_uom.group()
    else:
        raise ValueError('size argument did not contain expected unit of measure')
        
    supplied_size = float(size.replace(supplied_uom, ''))
    supplied_size_in_bytes = supplied_size * (factor ** (uom_options.index(supplied_uom)))
    converted_size = supplied_size_in_bytes / (factor ** (uom_options.index(desired_uom)))
    return converted_size, '{0:,.2f} {1}'.format(converted_size, desired_uom)

Then, you’ll use it like so:

# non SI conversion
print('Using default conversion factor of 1024:')
print(convert_filesize('1024 B', 'KB'))
print(convert_filesize('1.5 GB', 'MB'))
print(convert_filesize('59.3 GB', 'TB'))

print('\nUsing this IEC/SI conversion factor of 1000:')
# conversion recommended by IEC (https://www.convertunits.com/from/MB/to/GB)
print(convert_filesize('1024 B', 'KB', factor=1000))
print(convert_filesize('1.5 GB', 'MB', factor=1000))
print(convert_filesize('59.3 GB', 'TB', factor=1000))

Which produces the following results:

Using default conversion factor of 1024:
(1.0, '1.00 KB')
(1536.0, '1,536.00 MB')
(0.05791015625, '0.06 TB')

Using this IEC/SI conversion factor of 1000:
(1.024, '1.02 KB')
(1500.0, '1,500.00 MB')
(0.0593, '0.06 TB')

I’m sure there’s much room for improvement, but this routine seems to meet my needs for now.

Making Music with Pandas

This year I’ve started taking guitar lessons. While I’m anxious to jump into learning a bunch of songs, my instructor is keen on me developing foundational knowledge in music theory, scales, modes, and so forth–which I’m perfectly fine with, as well.

So far, we’ve covered several ways to play major scales, the pentatonic minor scale, and the natural minor scale. We also talked about scale “relatives:” how every major scale has a minor scale and every minor scale is a subset of a major scale, the two being relatives of each other.

My instructor then gave me this assignment: play any major scale from the low E string to the high E string, transition into the scale’s relative minor by dropping down three frets, and finish playing out the relative minor scale.

As I’ve been practicing this task, though, I often find myself off by a fret. I have to ask myself, “self, what major scale did you start in? C major? So why are you playing the G# natural minor scale?”

What would really help my practice is to have a handy cheatsheet to show me all the notes in each major scale and highlight the relative minor scale of each major. I could write it all out by hand, but why do that when I have Python and Pandas at my disposal! Here’s what I came up with:

Import my packages

I really only need pandas for this work:

import pandas as pd

Generate the twelve major scales

Here’s the code I came up with to calculate all the notes in each scale. Each scale consists of 15 notes spanning three octaves:

# make up my list of notes
chromatic_scale_ascending = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
# since I usually start on the low E string, rearrange the notes starting on E
scale_from_e = (chromatic_scale_ascending + chromatic_scale_ascending)[4:16]

# the scale pattern:
# root, whole step, whole step, half step, whole step, whole step, whole step, half step
key_steps = [2, 2, 1, 2, 2, 2]  # on the guitar, a whole step is two frets
major_keys = []
for root in scale_from_e:
    three_octaves = scale_from_e * 3
    steps_from_root = three_octaves.index(root)
    major_scale = [root]
    # construct the unique notes in the scale
    for step in key_steps:
        steps_from_root += step
        major_scale.append(three_octaves[steps_from_root])
        
    # span the scale across 3 octaves
    major_keys.append(major_scale * 2 + [root])

Drop the scales into Pandas for the looks

Writing my list of lists to a pandas dataframe and then writing that dataframe out in a jupyter notebook makes everything look nice. More importantly, I can use the style function in pandas to highlight the relative minor scales of each major scale:

df_major_keys = pd.DataFrame(major_keys)

# use this function to highlight the relative minor scales in orange
def highlight_natural_minor(data):
    df = data.copy()
    df.iloc[:,5:13] = 'background-color: orange'
    return df

df_major_keys.style.apply(highlight_natural_minor, axis=None)

…and here’s my handy major/minor scale cheatsheet:

My major-relative-minor-scale cheatsheet

Column 0 is the tonic/root of the major scale while columns 5 through 12 represent the relative minor scale of that major. So we can see that that the E major scale contains the C# minor scale. For example, Ozzy’s Crazy Train apparently moves between A major and F# minor scales which sound just great together–assuming you ignore Ozzy’s personal eccentricities.

So here’s a cool way to merge my interests in music and Python and Pandas into one large mash of goodness.

« Older posts

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑