With the recent changes in Firefox of late, several of my favorite plugins no longer work. That’s really frustrating. One of those plugins is DownloadThemAll.

Suppose you navigate to a webpage that contains links to several files you wish to download. You could: 1) right-click each link, one by one, select “Save As” from the context menu, and download each file or 2) open up the DownloadThemAll plugin window, check the files you want to download from a list DownloadThemAll auto-magically populates, and click the “download” button to download all those files together.

Or at least you could go with Option #2 until Firefox Quantum came along and made the plugin incompatible. So, what’s one to do? Well, here’s one approach I tried recently in Python, although it could be a little gnarly for the non-technical:

Step 1: Load a few helpful Python packages

Requests, BeautifulSoup, and lxml will serve you right!


1
2
3
import requests
from bs4 import BeautifulSoup
import lxml

Step 2: Load up and parse the webpage

Use the requests package to grab the webpage you want to work with and then use BeautifulSoup to parse it so that it’s easier to find the links you want:


1
2
3
url = 'https://ia801501.us.archive.org/zipview.php?zip=/12/items/NSAsecurityPosters1950s60s/NSAsecurityPosters_1950s-60s_jp2.zip'
result = requests.get(url)
soup = BeautifulSoup(result.content, "lxml")

Step 3: Download those files!

Well, it’s a little more complicated than that. First, I had to take a look at the HTML source code of the webpage I wanted to work with and then find the HTML elements containing the download links. In this case, the challenge was relatively easy: I just needed to find all the elements with an “id” attribute equal to “jpg”. With BeautifulSoup, I could easily find all those elements, loop through them, and pull out the data I needed, including the download link. With that download link, I can use requests again to pull down the content and easily save it to disk:


1
2
3
4
5
6
7
8
for jpg_cell in soup.find_all(id="jpg"):
    link = 'https:' + jpg_cell.find('a').attrs['href']
    # I noticed that part of the download url contains the HTML encoding '%2F'.  I need to replace that with a
    # forward slash before I have a valid link I can use to download
    file_name = link.replace('%2F', '/').split('/')[-1]
    print(link + '  ' + file_name)  # just to visually validate I parsed the link and filename correctly
    r = requests.get(link)
    open("./data/" + file_name, 'wb').write(r.content)

Check out the complete source code on my Github page. Also, check out this fantastic article from DataCamp.com that goes into even greater detail to explain web scraping in Python.

Not so fast!

Ok, so you can go through that somewhat cumbersome process for all the download jobs you might have, but in the future, I think I’m just going to pop over to Chrome and use the Chrono Download Manager extension.