Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: jupyter_notebook (Page 14 of 17)

Downloading files in bulk

With the recent changes in Firefox of late, several of my favorite plugins no longer work. That’s really frustrating. One of those plugins is DownloadThemAll.

Suppose you navigate to a webpage that contains links to several files you wish to download. You could: 1) right-click each link, one by one, select “Save As” from the context menu, and download each file or 2) open up the DownloadThemAll plugin window, check the files you want to download from a list DownloadThemAll auto-magically populates, and click the “download” button to download all those files together.

Or at least you could go with Option #2 until Firefox Quantum came along and made the plugin incompatible. So, what’s one to do? Well, here’s one approach I tried recently in Python, although it could be a little gnarly for the non-technical:

Step 1: Load a few helpful Python packages

Requests, BeautifulSoup, and lxml will serve you right!


1
2
3
import requests
from bs4 import BeautifulSoup
import lxml

Step 2: Load up and parse the webpage

Use the requests package to grab the webpage you want to work with and then use BeautifulSoup to parse it so that it’s easier to find the links you want:


1
2
3
url = 'https://ia801501.us.archive.org/zipview.php?zip=/12/items/NSAsecurityPosters1950s60s/NSAsecurityPosters_1950s-60s_jp2.zip'
result = requests.get(url)
soup = BeautifulSoup(result.content, "lxml")

Step 3: Download those files!

Well, it’s a little more complicated than that. First, I had to take a look at the HTML source code of the webpage I wanted to work with and then find the HTML elements containing the download links. In this case, the challenge was relatively easy: I just needed to find all the elements with an “id” attribute equal to “jpg”. With BeautifulSoup, I could easily find all those elements, loop through them, and pull out the data I needed, including the download link. With that download link, I can use requests again to pull down the content and easily save it to disk:


1
2
3
4
5
6
7
8
for jpg_cell in soup.find_all(id="jpg"):
    link = 'https:' + jpg_cell.find('a').attrs['href']
    # I noticed that part of the download url contains the HTML encoding '%2F'.  I need to replace that with a
    # forward slash before I have a valid link I can use to download
    file_name = link.replace('%2F', '/').split('/')[-1]
    print(link + '  ' + file_name)  # just to visually validate I parsed the link and filename correctly
    r = requests.get(link)
    open("./data/" + file_name, 'wb').write(r.content)

Check out the complete source code on my Github page. Also, check out this fantastic article from DataCamp.com that goes into even greater detail to explain web scraping in Python.

Not so fast!

Ok, so you can go through that somewhat cumbersome process for all the download jobs you might have, but in the future, I think I’m just going to pop over to Chrome and use the Chrono Download Manager extension.

Rick Rolling with matplotlib

I know this meme is 100 Internet years old, but when I read this awesome post from Little Miss Data, I knew what I had to do:

Ok, ok.  So there are a few problems with the GIF:

  1. Rick’s head isn’t exactly centered and doesn’t pivot in a fluid way
  2. It would probably be nicer if the labels were over their respective wedges.

In Little Miss Data’s post, she uses R code to first generate her charts, then save them to PNG files, then load those PNG files as background images to pre-created, animated GIFs.  Unfortunately, I could find no way in Python to replicate that behavior.  So, I cheated a little: in a loop, I created my chart, dropped the Rickster on top of it, and saved the image to disk.  With every loop iteration, I slightly rotated Rick.  Outside of my loop, I used the imageio package to load all the saved images into one animated GIF.  Here’s a look at my code:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
data = [['Give you up', 30],
        ['Let you down', 20],
        ['Run around \nand dessert you', 20],
        ['Make you cry', 15],
        ['Say goodbye', 10],
        ['Tell a lie \nand \nhurt you', 5]]

headers = ['thing', 'percentage']

df = pd.DataFrame(data, columns=headers)

[os.remove(f) for f in glob.glob('fig_*.png')]  # remove any images from previous runs
pngs = []

for i in np.arange(0, 10):
    fig, ax = plt.subplots()
    img = plt.imread('rr2.gif')
    img_w, img_h, x = img.shape
    df.plot.pie(y='percentage', labels=df.thing, figsize=(9, 9), legend=False,
                title='Things Rick Astley would never do', ax=ax)
    ax.set_ylabel('')
    fig_w, fig_h = fig.get_size_inches()*fig.dpi
    rr = ndimage.rotate(img, i*36)
    _ = ax.figure.figimage(rr, fig_w/2 - img_w/2, fig_h/2 - img_h/2, zorder=1)
    fig.savefig('fig_{0}.png'.format(i))
    pngs.append('fig_{0}.png'.format(i))
   
images = []
for png in pngs:
    img = imread(png)
    images.append(img)
mimsave('rr_final.gif', images)

The full code is available on my Github page.

As for fixing the problems I identified earlier, I’m sure that can be done with some smarter calculations in the figimage line and the labels should be able to be adjusted through appropriate calls to the ax object, but I’ll leave that to someone with more time on his hands…or not, as this is probably a dumb thing to spend your time on.

Anyway, there you go: rick rolling the matplotlib way!

Roots Magic and Jupyter Notebook: like peas and carrots


via GIPHY

Roots Magic seems to be a popular tool among genealogists. At least, the Genealogy Guys certainly recommend it.

I’ve been using the tool for about a year now. My previous genealogy database tool seemed to lock away my data in its own proprietary database confining me to the queries and views exposed only in its user interface. Before I switched away, though, I wanted to make sure my next tool would give me a little more latitude with regard to accessing my data. At a genealogy conference in 2017, I actually had a short conversation with the head honcho himself, Bruce Buzbee, and voiced this concern. Bruce briefly mentioned Roots Magic’s connection with sqlite. Interest piqued, I bought his software and made the switch.

It seems like most of the work exploring the sqlite foundation of Roots Magic has been captured at the site SQLiteToolsForRootsMagic. [Side note: Wikispaces, the platform on which SQLiteToolsForRootsMagic is built is going away, so, by the time you read this, the SQLiteTools site might be no more. Fortunately, the site owners are developing migration plans, so stay tuned.] These folks tend to use clients like SQLiteSpy to run their queries. Nothing wrong with that, but since my favorite development canvas lately is Jupyter Notebook, I asked myself, “Self, could you query a Roots Magic database in Jupyter Notebook?” The answer is: absolutely!  Here are some of the steps I took to query a Roots Magic database in Jupyter Notebook.

Step 1: Import the necessary packages

Load all my go-to packages including pandas and matplotlib as well as sqlite3 and Pivotal’s SQL Magic (SQL Magic isn’t necessary, but it makes writing SQL queries a little nicer):


1
2
3
4
5
6
7
8
9
10
import sqlite3
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.style as style

%load_ext sql_magic
%matplotlib inline
plt.style.use('fivethirtyeight')

Step 2: Connect to my Roots Magic file

For demonstration purposes, I grabbed a copy of George Washington’s family tree and saved it off in Roots Magic.  Connecting to the file is pretty darn easy:


1
2
3
4
# Note that sqlite seems to require a full path to the file you wish to load
conn = sqlite3.connect("C:\\data_files\\qsync_laptop\\jupyter_notebooks\\query_gen_dbs\\GeorgeWashingtonFamilyBig.rmgc")
%config SQL.conn_name='conn'  # useful when using sql_magic
cur = conn.cursor()

Step 3: Go to town!

At this point, the only real challenge is dealing with the COLLATE NOCASE issue, but that’s just a minor inconvenience.  After that, it’s just spending time understanding the Roots Magic database schema and the relationships between tables.  SQLiteToolsForRootsMagic has really blazed the trail here, so I encourage you to spend some time on the site looking over the hundreds of posted queries to get a better understanding of the database schema.

With direct access to the database, you can print out facts about your data that may not be exposed in the Roots Magic user interface:


1
2
3
4
5
6
7
8
9
10
11
12
cur.execute("SELECT OwnerID FROM NameTable")
nbr_of_people = len(cur.fetchall())
cur.execute("SELECT FamilyID FROM FamilyTable")
nbr_of_families = len(cur.fetchall())
cur.execute("SELECT FamilyID FROM FamilyTable")
nbr_of_families = len(cur.fetchall())
cur.execute("SELECT FamilyID FROM EventTable")
nbr_of_events = len(cur.fetchall())

print('This database contains {0} individuals.'.format(nbr_of_people))
print('It includes {0} families.'.format(nbr_of_families))
print('It includes {0} events.'.format(nbr_of_events))

Output:

This database contains 529 individuals.
It includes 114 families.
It includes 2679 events.

Or this:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
%%read_sql df_ages -d
SELECT OwnerID
    ,Surname COLLATE NOCASE AS Surname
    ,Given COLLATE NOCASE AS Given
    ,BirthYear AS BirthYear
    ,DeathYear AS DeathYear
    ,(DeathYear - BirthYear) AS age
FROM NameTable n
WHERE COALESCE(BirthYear, 0) > 0 AND COALESCE(DeathYear, 0) > 0
    AND (age BETWEEN 0 AND 110) --remove anyone over 110 years of age or under 0 as that's a likely error

oldest = df_ages.sort_values('
age', ascending=False).head(1)
youngest = df_ages.sort_values('
age').head(1)
print('
This family tree contains {0} individuals with recorded birth and death dates.'.format(df_ages.shape[0]))
print('
The oldest person in the tree is {0} {1} at age {2}.'.format(oldest.Given.values[0], oldest.Surname.values[0],
                                                                   oldest.age.values[0]))
print('
The youngest person in the tree is {0} {1} at age {2}.'.format(youngest.Given.values[0], youngest.Surname.values[0],
                                                                      youngest.age.values[0]))
print('
The average age for family members in this tree is {0:.1f} years.'.format(df_ages.age.mean()))
print('
The median age for family members in this tree is {0:.1f} years.'.format(df_ages.age.median()))

Output:

This family tree contains 192 individuals with recorded birth and death dates.
The oldest person in the tree is John WASHINGTON at age 99.
The youngest person in the tree is Mildred WASHINGTON at age 1.
The average age for family members in this tree is 52.9 years.
The median age for family members in this tree is 52.0 years.

How about some charts?

You can find this work and more on my Github page.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑