Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: jupyter_notebook (Page 4 of 17)

Poor man’s database modeling

What can Donald do if he can’t afford an ERD modeling tool?

Although I’m no database administrator, every-so-often I need to model out table relationships as part of some new project. Most folks would use a tool like Microsoft Visio to model their entity relationships. However, Visio costs money and I don’t have it…er, I don’t have Visio. (I don’t have much money, either, given the exorbitant costs of college these days.)

So, what’s a cheapskate like myself to do? Why, use free tools, that’s what! Lately, I’ve found some joy in the combination of SQLite and DBeaver.

Step 1: Code out your database

In a text editor like Notepad++, start coding out your database. Let’s take the canonical example of an orders database. You might rough out some of those tables with the following script:

DROP TABLE IF EXISTS items;
CREATE TABLE items(
	item_key VARCHAR(64) NOT NULL,
	item_desc VARCHAR(100),
	unit_price DECIMAL(5,2),
	CONSTRAINT item_pk PRIMARY KEY (item_key)
);

DROP TABLE IF EXISTS customers;
CREATE TABLE customers(
	customer_key VARCHAR(64) NOT NULL,
	customer_name VARCHAR(100),
	CONSTRAINT customer_pk PRIMARY KEY (customer_key)
);

DROP TABLE IF EXISTS orders;
CREATE TABLE orders(
	order_key VARCHAR(64) NOT NULL,
	customer_fk VARCHAR(64) NOT NULL,
	order_date DATETIME DEFAULT CURRENT_TIMESTAMP,
	CONSTRAINT order_pk PRIMARY KEY (order_key),
	CONSTRAINT customer_foreign_key FOREIGN KEY (customer_fk) REFERENCES customers(customer_key)
);

DROP TABLE IF EXISTS order_details;
CREATE TABLE order_details(
	order_fk VARCHAR(64) NOT NULL,
	item_fk VARCHAR(64) NOT NULL,
	quantity INTEGER,
	CONSTRAINT order_detail_pk PRIMARY KEY (order_fk, item_fk),
	CONSTRAINT order_foreign_key FOREIGN KEY (order_fk) REFERENCES orders(order_key),
	CONSTRAINT item_foreign_key FOREIGN KEY (item_fk) REFERENCES items(item_key)
);

Save your script to a file like orders_db.sql.

Step 2: Generate a SQLite database

Now, pipe your script into SQLite to generate your orders database. In Windows, you can open a command shell and run a command like the following:

C:\sqlite-tools-win32-x86-3300100\sqlite3.exe orders.db < orders_db.sql

Step 3: Open your database in DBeaver

DBeaver supports tens of different types of databases, including SQLite. Creating a connection to your newly created SQLite database is relatively easy. Follow the guide to creating a new connection. When it comes time to select your database type, you can click the “SQL” tab and scroll down until you find the selection for SQLite. All your connection needs is a path to the database file you created in Step 2–in my example, I called the database orders.db.

Step 4: Create an ER diagram from the tables in your database

DBeaver includes a cool feature called Entity Relation Diagrams where it allows you to easily visualize your database. In the application, you can select File > New from the menu and find the ER Diagram selection under the DBeaver folder. From there, you should be presented with a list of available databases including the one you setup in Step 2. Go to your orders database, find your tables, and select all of them. Once you name your ER Diagram, you should be able to click the Finish button. If all goes well, you should see a visual like the following:

My ER Diagram in DBeaver (in dark mode)

Having a visual of your database helps you identify missing columns, bad or missing relationships, etc. You should now be able to identify these problems, go back to Step 1 with the code to fix them, and repeat the steps again until your model starts to look more appropriate. With my tables visualized in an ER diagram, I find it easier to share and discuss my ideas with teammates rather than trying to talk through the code. ER diagrams also make for good images in documentation.

Bonus: your ER Diagrams in Jupyter Notebooks

Recently, I was looking for more information on using markdown in Apache Zeppelin notebooks when I happened upon this discussion of pegdown parsers in Zeppelin. The article mentioned the YUML plugin. YUML seems to be a handy way of creating simple UML diagrams.

So, on my laptop, I opened up a new Jupyter notebook, created a new “markdown” cell, and wrote out the following HTML:

<img src="http://yuml.me/diagram/scruffy/class/[customers]<>1-*>[orders], [orders]<>1-*>[order_details], [items]<>1-*>[order_details], [note: My database model {bg:cornsilk}]" >

This code then rendered this UML diagram:

My ERD created with yuml.me

Certainly not as detailed as my SQLite/DBeaver solution, but quicker to achieve and still helpful with team discussions and documentation. In the comments, feel free to add your “poor person” approach to database modeling.

Learning Guitar with Python, Part 2

Here’s another installment in my occasional series on using Python to help me learn guitar.

At the start of my practice sessions, I play scales to both help me get warmed up and help me know where the notes in a given key are in relation to each other. I’m learning that scale patterns are important to soloing, so I’m starting to commit a lot of these scales to memory. Having them easily available and all together in a Jupyter Notebook makes this process a lot easier for me.

So, here’s the code I wrote and accompanying image of the modal scales I practice on a regular basis:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display_html

%matplotlib inline


Ionian = {'Low E': ['', '1', '', '2'], 'A': ['3', '4', '', '5'], 'D': ['6', '', '7', '1'],
          'G': ['2', '', '3', '4'], 'B': ['', '5', '', '6'], 'High E': ['7', '1', '', '']}
Dorian = {'Low E': ['', '2', '', '3', '4'], 'A': ['', '5', '', '6', ''], 'D': ['7', '1', '', '2', ''],
          'G': ['3', '4', '', '5', ''], 'B': ['', '6', '', '7', '1'], 'High E': ['', '2', '', '', '']}
Phrygian = {'Low E': ['3', '4', '', '5'], 'A': ['6', '', '7', '1'], 'D': ['2', '', '3', '4'],
            'G': ['5', '', '6', ''], 'B': ['7', '1', '', '2'], 'High E': ['3', '', '', '']}
Lydian = {'Low E': ['', '4', '', '5'], 'A': ['6', '', '7', '1'], 'D': ['2', '', '3', '4'],
          'G': ['5', '', '6', ''], 'B': ['7', '1', '', '2'], 'High E': ['3', '4', '', '']}
Myxolydian = {'Low E': ['', '5', '', '6', ''], 'A': ['7', '1', '', '2', ''], 'D': ['3', '4', '', '5', ''],
          'G': ['6', '', '7', '1', ''], 'B': ['', '2', '', '3', '4'], 'High E': ['', '5', '', '', '']}
Aeolian = {'Low E': ['', '6', '', '7', '1'], 'A': ['', '2', '', '3', '4'], 'D': ['', '5', '', '6', ''],
          'G': ['7', '1', '', '2', ''], 'B': ['', '3', '4', '', '5'], 'High E': ['', '6', '', '', '']}
Locrian = {'Low E': ['7', '1', '', '2'], 'A': ['3', '4', '', '5'], 'D': ['6', '', '7', '1'],
            'G': ['2', '', '3', '4'], 'B': ['', '5', '', '6'], 'High E': ['7', '', '', '']}

modes = {'Ionian':Ionian, 'Dorian':Dorian, 'Phrygian':Phrygian, 'Lydian':Lydian, 'Myxolydian':Myxolydian, 
         'Aeolian':Aeolian, 'Locrian':Locrian}
modes_html = '<h3>Scales for Each Mode</h3>'

for i, mode_name in enumerate(modes):
    nbr_of_frets = len(modes[mode_name]['Low E'])
    df_mode = pd.DataFrame(modes[mode_name], index=np.arange(1, nbr_of_frets+1))
    # https://stackoverflow.com/a/50899244
    df_mode_styler = df_mode.style.set_table_attributes("style='display:inline'").set_caption(mode_name)
    modes_html += df_mode_styler._repr_html_() + '&nbsp;&nbsp;&nbsp;'
    if (i+1) % 3 == 0:
        modes_html += '<p></p>'

display_html(modes_html, raw=True)

This code in a Jupyter Notebook produces this graphic:

The numbers for each note represent the note’s position in the key in which you’re playing. Ionian pushes the first note of the key. For example, if you’re playing in the Key of C major, the “1” would be C. You’d find C on your Low E string (8th fret) and start playing the scale. If you wanted to play the Dorian mode of that key–D Dorian–you’d find D on your Low E string–a whole step up to the 10th fret–and start your Dorian scale there.

When I practice these scales, I’ve been taking a few approaches. For one, I play the scale going up and then turning around and coming back. I’ll start with the first note on the Low E string and play all the way up to the last note on the High E string and then play back down to where I started. To keep things interesting, sometimes I’ll reverse this process so that I start on that last High E string note, play down to the first note on the Low E and then turn around and play back to where I started.

Next, I’ll play each scale for every fret up the neck until I reach the 12th fret. So, for the Ionian scale, I’ll start the scale on the 2nd fret, play it through, move to the 3rd fret, play it through, and repeat this process until I reach the 12th fret. I do this for all the scales, so I play each scale about 10 or so times in a given practice session.

Finally, as I get these scales under my belt, I’m trying to play them faster and faster. I’ll usually try to power through the first four modes as fast and accurate as I can, take a small break, and then finish the last three. Then, I’ll move on to the rest of my practice.

More Python and music notes to follow!

Pulling public data into dataframes

Some of what I write about here is inspired by challenges I encounter at work. Often the hardest part in describing those challenges is substituting public data and scenarios for my work-specific ones. Sites like kaggle.com and wikipedia.org really come to the rescue.

Recently, I had a circumstance where I need to process a dataframe in which one of the columns contained a list of values for each field. I think I came up with a clever way of dealing with that obstacle and would like to discuss it in these pages, but first…what sort of public data can I collect to replicate my scenario? How about some English football?! Let’s take some some football titles–say, the UEFA Champions League champion, the UEFA Europa League champion, and the UEFA Super Cup champion–and build a dataframe that shows which English football clubs have ever won any of these titles.

Step 1: Collect the raw data for each of these titles

I’ve written a few times before on how to use Python to collect public data from the Internet. This time around, I went with a slightly more manual approach. I went to each of the three Wikipedia pages, hit F12 to bring up my developer tools, and used those tools to copy just the HTML code for the data tables I was interested in–the ones listing each club team and the number of times they won the particular title. I simply copied the HTML to three different data files for subsequent processing.

Step 2: Read in the data files

The pandas read_html function is such a time saver here. Here’s the code I came up with to read in my data files:

import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline  # for jupyter notebook
matplotlib.style.use('seaborn')

df_champions = pd.read_html('./data/uefa_champions.txt')[0]
df_champions = df_champions[df_champions.Titles>0]
df_champions['title'] = 'UEFA Champions League champ'

df_europa = pd.read_html('./data/uefa_europa_league.txt')[0]
df_europa = df_europa[df_europa.Winners>0]
df_europa['title'] = 'UEFA Europa League champ'

df_super = pd.read_html('./data/uefa_super.txt')[0]
df_super = df_super[df_super.Winners>0]
df_super['title'] = 'UEFA Super Cup champ'

The df_champions dataframe looks like this:

The last five records in the df_champions dataframe. Which of these are English teams?

These dataframes are looking good, but how do I know which teams are the English teams? Wikipedia identifies each team’s nationality with a flag icon, but pandas isn’t pulling in that data. Time for a little HTML parsing with BeautifulSoup.

Step 3: Collect the names of the English teams

Since pandas didn’t pull in the nationality information, I had to revisit each of the HTML data files and parse out that information with BeautifulSoup:

epl_teams = []
for filepath in ['./data/uefa_champions.txt', './data/uefa_europa_league.txt', './data/uefa_super.txt']:
    with open(filepath, 'r') as f:
        soup = BeautifulSoup(f)

        for th in soup.findAll('th'):
            span = th.find('span', {'class':'flagicon'})
            if span:
                a = span.find('a', {'title':'England'})
                if a:
                    epl_teams.append(th.text.strip())

Step 4: Filter each title list down to just the English teams

With the names of the English teams, I can filter my dataframes down accordingly:

df_champions = df_champions[df_champions.Club.isin(epl_teams)]
df_europa = df_europa[df_europa.Club.isin(epl_teams)]
df_super = df_super[df_super.Club.isin(epl_teams)]

Step 5: Merge the dataframes together

I need to merge my three dataframes into a single one. The pandas merge function did the trick:

df_combo = df_champions[['Club','title']].merge(df_europa[['Club','title']], on='Club', how='outer')
df_combo = df_combo.merge(df_super[['Club','title']], on='Club', how='outer')

And the results:

Merging dataframes with columns of the same name forces pandas to add suffices to those column names

Step 6: Combine the “title” columns together into one

The final step–to just get this public data into a shape to replicate my original problem–is to merge the three “title” columns into a single one. Two lines do the deed:

df_combo['title'] = df_combo.apply(lambda r: [t for t in [r.title_x, r.title_y, r.title] if t is not np.nan], axis=1)
df_combo = df_combo[['Club', 'title']]
A dataframe with a list of values in the “title” column

Phew. That was a fair amount of work just to pull together some public data to replicate one of my work datasets. And the actual code I wrote to analyze a dataframe containing lists in a column? Well, that will have to wait for a future post.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑