DadOverflow.com

Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Page 24 of 57

Converting file size values

Lately, I’ve been challenged with performing calculations and charting of file size values in different units of measure. For example, I’ll have file size values in gigabytes but will have to plot those values against terabytes of disk capacity. I’m a little surprised that Python doesn’t have a ready way to solve this problem. There is the hurry.filesize package, but that requires that you pass a bytes value into the function. What if you only have a gigabytes value to pass? Well, I came up with my own solution, largely inspired by similar solutions. Here’s my function:

import re


def convert_filesize(size, desired_uom, factor=1024):
    """Converts a provided computer data storage value to a different unit of measure.
    
    Keyword arguments:
    size -- a string of the current size (ex. '1.5 GB')
    desired_uom -- a string of the new unit of measure (ex. 'TB')
    factor -- the factor used in the conversion (default 1024)
    """
    uom_options = ['B', 'KB', 'MB', 'GB', 'TB']
    supplied_uom = re.search(r'[a-zA-Z]+', size)
    if supplied_uom:
        supplied_uom = supplied_uom.group()
    else:
        raise ValueError('size argument did not contain expected unit of measure')
        
    supplied_size = float(size.replace(supplied_uom, ''))
    supplied_size_in_bytes = supplied_size * (factor ** (uom_options.index(supplied_uom)))
    converted_size = supplied_size_in_bytes / (factor ** (uom_options.index(desired_uom)))
    return converted_size, '{0:,.2f} {1}'.format(converted_size, desired_uom)

Then, you’ll use it like so:

# non SI conversion
print('Using default conversion factor of 1024:')
print(convert_filesize('1024 B', 'KB'))
print(convert_filesize('1.5 GB', 'MB'))
print(convert_filesize('59.3 GB', 'TB'))

print('\nUsing this IEC/SI conversion factor of 1000:')
# conversion recommended by IEC (https://www.convertunits.com/from/MB/to/GB)
print(convert_filesize('1024 B', 'KB', factor=1000))
print(convert_filesize('1.5 GB', 'MB', factor=1000))
print(convert_filesize('59.3 GB', 'TB', factor=1000))

Which produces the following results:

Using default conversion factor of 1024:
(1.0, '1.00 KB')
(1536.0, '1,536.00 MB')
(0.05791015625, '0.06 TB')

Using this IEC/SI conversion factor of 1000:
(1.024, '1.02 KB')
(1500.0, '1,500.00 MB')
(0.0593, '0.06 TB')

I’m sure there’s much room for improvement, but this routine seems to meet my needs for now.

Making Music with Pandas

This year I’ve started taking guitar lessons. While I’m anxious to jump into learning a bunch of songs, my instructor is keen on me developing foundational knowledge in music theory, scales, modes, and so forth–which I’m perfectly fine with, as well.

So far, we’ve covered several ways to play major scales, the pentatonic minor scale, and the natural minor scale. We also talked about scale “relatives:” how every major scale has a minor scale and every minor scale is a subset of a major scale, the two being relatives of each other.

My instructor then gave me this assignment: play any major scale from the low E string to the high E string, transition into the scale’s relative minor by dropping down three frets, and finish playing out the relative minor scale.

As I’ve been practicing this task, though, I often find myself off by a fret. I have to ask myself, “self, what major scale did you start in? C major? So why are you playing the G# natural minor scale?”

What would really help my practice is to have a handy cheatsheet to show me all the notes in each major scale and highlight the relative minor scale of each major. I could write it all out by hand, but why do that when I have Python and Pandas at my disposal! Here’s what I came up with:

Import my packages

I really only need pandas for this work:

import pandas as pd

Generate the twelve major scales

Here’s the code I came up with to calculate all the notes in each scale. Each scale consists of 15 notes spanning three octaves:

# make up my list of notes
chromatic_scale_ascending = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
# since I usually start on the low E string, rearrange the notes starting on E
scale_from_e = (chromatic_scale_ascending + chromatic_scale_ascending)[4:16]

# the scale pattern:
# root, whole step, whole step, half step, whole step, whole step, whole step, half step
key_steps = [2, 2, 1, 2, 2, 2]  # on the guitar, a whole step is two frets
major_keys = []
for root in scale_from_e:
    three_octaves = scale_from_e * 3
    steps_from_root = three_octaves.index(root)
    major_scale = [root]
    # construct the unique notes in the scale
    for step in key_steps:
        steps_from_root += step
        major_scale.append(three_octaves[steps_from_root])
        
    # span the scale across 3 octaves
    major_keys.append(major_scale * 2 + [root])

Drop the scales into Pandas for the looks

Writing my list of lists to a pandas dataframe and then writing that dataframe out in a jupyter notebook makes everything look nice. More importantly, I can use the style function in pandas to highlight the relative minor scales of each major scale:

df_major_keys = pd.DataFrame(major_keys)

# use this function to highlight the relative minor scales in orange
def highlight_natural_minor(data):
    df = data.copy()
    df.iloc[:,5:13] = 'background-color: orange'
    return df

df_major_keys.style.apply(highlight_natural_minor, axis=None)

…and here’s my handy major/minor scale cheatsheet:

My major-relative-minor-scale cheatsheet

Column 0 is the tonic/root of the major scale while columns 5 through 12 represent the relative minor scale of that major. So we can see that that the E major scale contains the C# minor scale. For example, Ozzy’s Crazy Train apparently moves between A major and F# minor scales which sound just great together–assuming you ignore Ozzy’s personal eccentricities.

So here’s a cool way to merge my interests in music and Python and Pandas into one large mash of goodness.

Dealing with missing dates in your dataframes

I work with a lot of time-based data and occasionally have to work with data where there are chunks of missing time. For example, consider my kindle “free time” reading data. May 2019 was a good reading month for me, so I’d like to take a closer look at that month and chart out the ebb and flow of my reading time by day. To start with, I’ll just take a look at my reading times by day for the month:

cols = ['accessdate', 'read_mins']
df_may = df_dayinfo[(df_dayinfo.accessdate >= '2019-05-01') & (df_dayinfo.accessdate <= '2019-05-31')][cols]
df_may
My reading minutes for May 2019: there are a few days when I didn’t read at all

If I were to chart that data out, it wouldn’t quite be accurate as there are gaps in the month:

fig, ax = plt.subplots(figsize=(15, 6))
_ = df_may.groupby('accessdate').sum().plot(ax=ax, marker='o')
_ = ax.axhline(y=0.0, xmin=0.0, xmax=1.0, color='gray', ls='--')
_ = ax.set_title('My Reading Time for May 2019')
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Reading Time (minutes)')

In the past, to accommodate for these missing days, I’d build a second dataframe of all the days in the month and 0.0 read minutes. Then, I’d merge the dataframes together so that I would have entries for all the days of the month:

# create a dataframe with all the days of the month and 0.0 read times
start = datetime(2019, 5, 1)
end = datetime(2019, 6, 1)
may_zeros = [[start + timedelta(days=x), 0.0] for x in range(0, (end-start).days)]
df_may_zeros = pd.DataFrame(may_zeros, columns=['accessdate', 'read_mins'])

# now, merge my 0.0 read time df with my actual data to get a full representation of the month
df_may1 = pd.concat([df_may, df_may_zeros]).groupby('accessdate').sum()

# finally, create the chart
fig, ax = plt.subplots(figsize=(15, 6))
_ = df_may1.groupby('accessdate').sum().plot(ax=ax)
_ = ax.axhline(y=0.0, xmin=0.0, xmax=1.0, color='gray', ls='--')
_ = ax.set_title('My Reading Time for May 2019')
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Reading Time (minutes)')
Reading times in May including missed days

So, problem solved, but it turns out Pandas has an even better way to solve this problem: use Pandas’ date_range function along with reindex:

# use date_range to create an index for every day in May 2019
idx = pd.date_range('05-01-2019', '05-31-2019')
# now, group my real data by day, reindex it with the days in May, and fill any missing values with 0
df_may2 = df_may.groupby('accessdate').sum().reindex(idx, fill_value=0)

# now, we can create the chart
fig, ax = plt.subplots(figsize=(15, 6))

# i can overlay my original chart to see the differences if I want
# _ = df_may.groupby('accessdate').sum().plot(ax=ax, color='r')
_ = df_may2.plot(ax=ax, marker='o')
_ = ax.axhline(y=0.0, xmin=0.0, xmax=1.0, color='gray', ls='--')
_ = ax.set_title('My Reading Time for May 2019')
_ = ax.set_xlabel('Date')
_ = ax.set_ylabel('Reading Time (minutes)')
The same results with the aid of date_range

If you don’t want to assume your missing days are simply 0.0 values, reindex will, by default, fill the values with NaN. You could then run interpolate over your dataframe and calculate some other value.

In addition to date_range, Pandas has lots of other general-purpose functions worth checking out. So now you have two ways to fill in missing dates in your dataframes!

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑