Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Category: technology (Page 18 of 36)

Is Third Normal Form obsolete?

On a recent episode of .NET Rocks, hosts Carl and Richard along with guest Julie Lerman have an interesting discussion–right at the beginning of the episode–on how important it is these days to normalize the table structures in your relational databases.

Richard goes so far as to suggest it might be an “obsolete concept.” That it is more important to persist “the truth at the time.”

“I was taught Third Normal Form decades ago by C.J. Date…and so it’s been a real struggle to say it’s my instinct and I think it’s wrong!”

Richard Campbell, 14 November 2019

Personally, I’ve felt a little guilty contemplating denormalized database solutions to solve my problems on different occasions. It’s certainly a relief to know that a) I’m not alone and b) denormalized solutions might be more the norm than the exception.

divmod, for the win!

I had a situation recently where I had a list of values laid out in a grid like so:

I had to figure out the row and column positions for each value.

So, let’s start with a list of numbers:

some_list = [i for i in range(15)]

First, how can I easily figure out what row each number belongs to? If you said “mod,” you’d be right! You take the mod of the number divided by the size of the group: in this case, 5:

group_size = 5
for n in some_list:
    print('Number {0} belongs to row {1}'.format(n, n % group_size))
Number 0 belongs to row 0
Number 1 belongs to row 1
Number 2 belongs to row 2
Number 3 belongs to row 3
Number 4 belongs to row 4
Number 5 belongs to row 0
Number 6 belongs to row 1
Number 7 belongs to row 2
Number 8 belongs to row 3
Number 9 belongs to row 4
Number 10 belongs to row 0
Number 11 belongs to row 1
Number 12 belongs to row 2
Number 13 belongs to row 3
Number 14 belongs to row 4

Now, how do I figure out what column each value belongs to? For that, I need to divide each number by the group size and take the int portion of the value. An easier way to do that is to use Python floor division:

group_size = 5
for n in some_list:
    print('Number {0} belongs to column {1}'.format(n, n // group_size))
Number 0 belongs to column 0
Number 1 belongs to column 0
Number 2 belongs to column 0
Number 3 belongs to column 0
Number 4 belongs to column 0
Number 5 belongs to column 1
Number 6 belongs to column 1
Number 7 belongs to column 1
Number 8 belongs to column 1
Number 9 belongs to column 1
Number 10 belongs to column 2
Number 11 belongs to column 2
Number 12 belongs to column 2
Number 13 belongs to column 2
Number 14 belongs to column 2

But I really need both the row and column values together. Sure, I could write my mod operation on one line and my floor division operation on another, but Python has a cool function to do both at the same time, divmod:

group_size = 5
for n in some_list:
    col, row = divmod(n, group_size)
    print('Number {0} belongs at row {1}, column {2}'.format(n, row, col))
Number 0 belongs at row 0, column 0
Number 1 belongs at row 1, column 0
Number 2 belongs at row 2, column 0
Number 3 belongs at row 3, column 0
Number 4 belongs at row 4, column 0
Number 5 belongs at row 0, column 1
Number 6 belongs at row 1, column 1
Number 7 belongs at row 2, column 1
Number 8 belongs at row 3, column 1
Number 9 belongs at row 4, column 1
Number 10 belongs at row 0, column 2
Number 11 belongs at row 1, column 2
Number 12 belongs at row 2, column 2
Number 13 belongs at row 3, column 2
Number 14 belongs at row 4, column 2

But now let’s get more real and use this feature to write out one of the greatest catalogs of all time: the albums of “Weird Al” Yankovic:

import matplotlib.style as style
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
style.use('seaborn-poster')

group_size = 5
albums = ['"Weird Al" Yankovic (1983)', 
          '"Weird Al" Yankovic in 3-D (1984)',
          'Dare to Be Stupid (1985)',
          'Polka Party! (1986)',
          'Even Worse (1988)',
          'Peter and the Wolf (1988)',
          'UHF - Original Motion Picture\nSoundtrack and Other Stuff (1989)',
          'Off the Deep End (1992)',
          'Alapalooza (1993)',
          'Bad Hair Day (1996)',
          'Running with Scissors (1999)',
          'Poodle Hat (2003)',
          'Straight Outta Lynwood (2006)',
          'Alpocalypse (2011)',
          'Mandatory Fun (2014)']

# set up my grid chart
fig, ax = plt.subplots()
ax.set_xticks(np.arange(0, (len(albums)/group_size) + 1))
ax.set_yticks(np.arange(0, group_size + 1))
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_title('The Catalog of "Weird Al" Yankovic')
plt.grid()

# now, enumerate through the album list and use divmod to get row and column values to write out the album names
for i, album in enumerate(albums):
    col, row = divmod(i, group_size)
    ax.annotate(album, xy=(col+.1, row+.4), xytext=(col+.1, row+.4))

plt.show()

So, divmod: love it, use it! Check out my full code here.

Visualizing Six Degrees of Kevin Bacon

As I try to find new and interesting ways to visualize data, I have made occasional use of the Network Graph. I’ve written some network graphs in D3, but Python has a great tool, networkx, that makes building and visualizing your network graphs a breeze.

Now…what sort of data can I graph? Well, back in the day, my friends and I would play the Six Degrees of Kevin Bacon game. How about I visualize that?

To start with, I downloaded a movie and actor dataset from Kaggle.

Step 1: Load my packages

import pandas as pd
import ast  # fantastic package (https://docs.python.org/3/library/ast.html)
import networkx as nx
import random
import matplotlib.pyplot as plt

%matplotlib inline

Step 2: Load the CSV into a dataframe

This is pretty boilerplate, but the thing to note here is my use of the ast package. In the CSV, the cast and crew columns are lists of lists of dictionaries (say that five times fast). Normally, pandas would just see those brackets and curly braces as characters in a string and would cast the entire column as a string. However, you can use the literal_eval function of the ast package to get pandas to see those columns as lists of lists of dictionaries.

df = pd.read_csv('../../../tmdb_5000_credits.csv')
df['cast'] = df.cast.apply(ast.literal_eval)

Step 3: Load the database into a network graph

Here, I iterate through the list of movies, adding each movie as a node in the graph. With each movie, I iterate through the cast list adding each actor to the graph, as well. I make sure to track the actors as I add them so I only add them once. As I add both movie and cast members, I add an “edge” (connection) between the movie and the actor that worked in it. One cool thing about networkx graphs is that you can also add a data payload to each node and edge. For each movie and actor, I add an associated “type” and “color”.

G = nx.Graph()
added_actor = []

def add_movie_and_actors_to_graph(row):
    G.add_node(row.title, {'type': 'movie', 'color': 'blue'})
    for actor in row.cast:
        if actor['name'] not in added_actor:
            G.add_node(actor['name'], {'type': 'actor', 'color': 'red' if actor['name']=='Kevin Bacon' else 'green'})
            added_actor.append(actor['name'])
        G.add_edge(row.title, actor['name'])


_ = df.apply(lambda r: add_movie_and_actors_to_graph(r), axis=1)

Step 4: Test the theory

Now the fun stuff. Let’s first pick five random actors from the dataset:

random_actors = random.sample(added_actor, 5)

What are the Bacon numbers for these actors?

NetworkX has an excellent function, shortest_path, that will tell me the shortest path between the randomly selected actor and Kevin Bacon:

for a in random_actors:
    path = nx.shortest_path(G,source=a,target='Kevin Bacon')
    print('{0} has a Bacon score of: {1}'.format(a, int(len(path)/2)))
    print(path)
Veriano Ginesi has a Bacon score of: 2
['Veriano Ginesi', 'The Good, the Bad and the Ugly', 'Eli Wallach', 'Mystic River', 'Kevin Bacon']
Alisha Boe has a Bacon score of: 3
['Alisha Boe', 'Paranormal Activity 4', 'Stephen Dunham', 'Catch Me If You Can', 'Tom Hanks', 'Apollo 13', 'Kevin Bacon']
Sue Pierce has a Bacon score of: 3
['Sue Pierce', 'The Hangover', 'Bradley Cooper', 'Guardians of the Galaxy', 'Michael Rooker', 'JFK', 'Kevin Bacon']
Wallace Wolodarsky has a Bacon score of: 2
['Wallace Wolodarsky', 'Fantastic Mr. Fox', 'Meryl Streep', 'The River Wild', 'Kevin Bacon']
Sébastien Faglain has a Bacon score of: 4
['Sébastien Faglain', 'The Country Doctor', 'François Cluzet', 'A Monster in Paris', 'Vanessa Paradis', 'Yoga Hosers', 'Johnny Depp', 'Black Mass', 'Kevin Bacon']

Sure enough, they’re all under seven degrees of separation!

Step 5: Graph the test

G_chart = nx.Graph()  # a new graph I spin up just for charting purposes

# populate the new graph with the random actors and their paths to Kevin Bacon
for a in random_actors:
    nodes_in_path = nx.shortest_path(G, source=a, target='Kevin Bacon')
    for n in nodes_in_path:
        if not G_chart.has_node(n):
            original_node = [a for a in G.nodes(data=True) if a[0]==n][0]
            # add node and its data payload to the graph i'll use in my chart
            G_chart.add_node(original_node[0], original_node[1])  
    G_chart.add_path(nodes_in_path)
    
fig, ax = plt.subplots(figsize=(15, 15))

# networkx layouts can be really tricky: something you just have to play with
pos = nx.spring_layout(G_chart, scale=0.25)
#pos = nx.circular_layout(G_chart)

color_map = [n[1]['color'] for n in G_chart.nodes(data=True)]
labels = {n:n for n in G_chart.nodes()}

plt.title('Six Degrees of Kevin Bacon')
ax.axis('off')
nx.draw_networkx(G_chart, pos, node_color=color_map, alpha=0.5, labels=labels, with_labels=True, ax=ax)

from matplotlib.lines import Line2D
custom_legend = [Line2D([0], [0], marker='o', markerfacecolor='g', markersize=10, color='w', label='Actor'), 
                 Line2D([0], [0], marker='o', markerfacecolor='b', markersize=10, color='w', label='Movie'),
                 Line2D([0], [0], marker='o', markerfacecolor='r', markersize=10, color='w', label='The Man Himself')]
ax.legend(handles=custom_legend, loc='lower right')

And now the network graph:

A little clunky, but you get the gist

The lingering question

So, is the Bacon theory really true? I iterated through all the actors in the dataset and calculated their paths. Out of over 54,000 actors, nearly everyone was under seven degrees of separation from Kevin Bacon–so the theory’s pretty spot-on. However, over 600 had no paths to him at all.

more_than_six_degrees = []
no_path_at_all = []

for a in added_actor:
    try:
        path = nx.shortest_path(G,source=a,target='Kevin Bacon')
        if int(len(path)/2) > 6:
            #print('Oh-oh: Looks like actor {0} has {1} degrees of separation from Kevin Bacon.'.format(a, int(len(path)/2)))
            more_than_six_degrees.append(a)
    except nx.NetworkXNoPath as e:
        #print('Woh: it appears actor {0} has no path at all to Kevin Bacon!'.format(a))
        no_path_at_all.append(a)
        
print('In this database...out of {0:,} actors...'.format(len(added_actor)))
print('There were {0} actors with more than 6 degrees of separation from Kevin Bacon.'.format(len(more_than_six_degrees)))
print('There were {0} actors with no path to Kevin Bacon at all!'.format(len(no_path_at_all)))
In this database...out of 54,201 actors...
There were 0 actors with more than 6 degrees of separation from Kevin Bacon.
There were 673 actors with no path to Kevin Bacon at all!

You can check out my complete source code here.

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑