DadOverflow.com

Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Page 44 of 57

Downloading files in bulk

With the recent changes in Firefox of late, several of my favorite plugins no longer work. That’s really frustrating. One of those plugins is DownloadThemAll.

Suppose you navigate to a webpage that contains links to several files you wish to download. You could: 1) right-click each link, one by one, select “Save As” from the context menu, and download each file or 2) open up the DownloadThemAll plugin window, check the files you want to download from a list DownloadThemAll auto-magically populates, and click the “download” button to download all those files together.

Or at least you could go with Option #2 until Firefox Quantum came along and made the plugin incompatible. So, what’s one to do? Well, here’s one approach I tried recently in Python, although it could be a little gnarly for the non-technical:

Step 1: Load a few helpful Python packages

Requests, BeautifulSoup, and lxml will serve you right!


1
2
3
import requests
from bs4 import BeautifulSoup
import lxml

Step 2: Load up and parse the webpage

Use the requests package to grab the webpage you want to work with and then use BeautifulSoup to parse it so that it’s easier to find the links you want:


1
2
3
url = 'https://ia801501.us.archive.org/zipview.php?zip=/12/items/NSAsecurityPosters1950s60s/NSAsecurityPosters_1950s-60s_jp2.zip'
result = requests.get(url)
soup = BeautifulSoup(result.content, "lxml")

Step 3: Download those files!

Well, it’s a little more complicated than that. First, I had to take a look at the HTML source code of the webpage I wanted to work with and then find the HTML elements containing the download links. In this case, the challenge was relatively easy: I just needed to find all the elements with an “id” attribute equal to “jpg”. With BeautifulSoup, I could easily find all those elements, loop through them, and pull out the data I needed, including the download link. With that download link, I can use requests again to pull down the content and easily save it to disk:


1
2
3
4
5
6
7
8
for jpg_cell in soup.find_all(id="jpg"):
    link = 'https:' + jpg_cell.find('a').attrs['href']
    # I noticed that part of the download url contains the HTML encoding '%2F'.  I need to replace that with a
    # forward slash before I have a valid link I can use to download
    file_name = link.replace('%2F', '/').split('/')[-1]
    print(link + '  ' + file_name)  # just to visually validate I parsed the link and filename correctly
    r = requests.get(link)
    open("./data/" + file_name, 'wb').write(r.content)

Check out the complete source code on my Github page. Also, check out this fantastic article from DataCamp.com that goes into even greater detail to explain web scraping in Python.

Not so fast!

Ok, so you can go through that somewhat cumbersome process for all the download jobs you might have, but in the future, I think I’m just going to pop over to Chrome and use the Chrono Download Manager extension.

Music to drive by, Part 3

In a couple of previous posts, I proposed a couple of ways to easily copy all or a portion of your music library to a thumb drive for playing in your usb-enabled automobile.  You can check out my solution at my Github page.

Recently, though, I encountered yet another frustration: my “copy” script iterates through a JSON file of the inventory of my music.  In the file, my music is listed in alphabetical order by the artist.  So, I copy my music to my thumb drive in alphabetical order.  Which means I’ll get Aerosmith on my thumb drive, but will likely never get ZZ Top.  Bummer!  So, I came up with a great solution: Get-Random!

All I need to do is alter one line in my “copy” script:


1
2
# apply my selection criteria and get a list of the songs to copy over to the flashdrive
$mp3s_to_write_to_drive = $mp3_col | where {$genres_i_want -contains $_.genre} | where {$bands_to_skip -notcontains $_.artist} | sort {Get-Random}

I just need to pipe my $mp3_col collection object to “sort {Get-Random}”.  This will sort the mp3 files randomly such that they’ll be copied to my thumb drive in a random order.  Cool!

I’ll probably not update my script in Github with this minor change, but just tack this little command at the end of the line in your downloaded copy of the script and you’ll be set.

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑