Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: tools (Page 29 of 35)

OCRing images in Windows

I recently visited a facility that displayed framed “wall art” of funny quotes from famous people. I found the quotes amusing, so I took pictures of all the wall hangings. The problem is, I don’t want to spend the time typing up all those quotes by hand (of course, I’ve probably spent much more time programming an alternative).  Anyway, OCR to the rescue!

Windows seems to have a variety of options for OCR, but these all seem largely GUI driven.  I’d rather have a command line solution.  Enter Tesseract-OCR.

Tesseract is a command line tool used for parsing text from image files.  Like most cool tools of its ilk, it works best in Linux.  Am I sunk, then, as my main environment is Windows?  Nope.  I can install tesseract in my Linux sub-system and access it from Windows.  Here’s how I solved this problem:

Step 1: Use wsl.exe to run tesseract

I actually ran all my work from a Jupyter Notebook using its shell command feature.


1
2
image_file = '/mnt/c/myfilepath/nb-miscellany/IMG_20180801_150228139.jpg'
! wsl tesseract {image_file} {image_file}

Step 2: Open the results

Tesseract seems to automatically append a “.txt” to the end of the outfile you supply it.  Since I supplied it my image filepath, it created a new file, IMG_20180801_150228139.jpg.txt, containing the text it parsed.  I can just run “cat” to see the results:


1
2
3
out_file = image_file + '.txt'

! wsl cat {out_file}

And here are my results:

Everyone needs to believe in something. I believe I'll
have another drink.

W.C. Fields

The only reason people get lost in thought is because
it's unfamiliar territory.

Unknown

I want a man who's kind and understanding. Is that
too much to ask of a millionaire?
Zsa Zsa Gabor

By the time a man is wise enough To watch his step,
he's too old to go anywhere.
Billy Crystal

There are two Types of people in +his world, good and
bad. The good sleep better, but the bad seem To
enjoy the waking hours much more.

Woody Allen

r never forget o face, but in your case I'll be glad to
make an exception.
Groucho Marx

The secret of staying young is to live honestly, eat
slowly and lie about your age.
Lucille Ball

Not too shabby!

Run Linux apps in Windows

Windows 10 now includes the ability to run a Linux shell within it.  That alone is pretty awesome.  What’s even awesome…er…is that you can easily access that sub-system from Windows with the wsl.exe utility.  Try this out:

Step 1: Launch your Linux subsystem

On my Windows laptop, I installed an instance of Ubuntu.  From my home directory, I simply list the directory contents:


1
2
3
4
5
6
7
8
9
brad@brad-laptop:~$ ll
total 8
drwxr-xr-x 1 brad brad 4096 Aug 26 13:57 ./
drwxr-xr-x 1 root root 4096 Aug 25 21:08 ../
-rw-r--r-- 1 brad brad  220 Aug 25 21:08 .bash_logout
-rw-r--r-- 1 brad brad 3771 Aug 25 21:08 .bashrc
-rw-r--r-- 1 brad brad  807 Aug 25 21:08 .profile
-rw-r--r-- 1 brad brad    0 Aug 25 21:11 .sudo_as_admin_successful
-rw-rw-rw- 1 brad brad    0 Aug 26 11:04 test.txt

Step 2: Open up the Windows command shell

Now, open up a Windows command shell.  Using wsl.exe, list the contents of your home directory.  Interestingly, while my Ubuntu instance knows the “ll” alias, wsl does not.  Nevertheless, I can run the ls -l command and see the contents of my home directory.

What if you have multiple Linux sub-systems installed?

Initially, I installed multiple Linux sub-systems on my Windows machine, but could find no way to get wsl to target a specific system.  There may well be an option: I just haven’t been able to find it yet.  Regardless, this advent from Microsoft now opens up so many more options, as there are a variety of wonderful tools in Linux that either can’t be installed in Windows or can’t easily be installed.  Now, you don’t have to: just install those tools in your Linux sub-system and run them there or from Windows via wsl.exe.

Downloading files in bulk

With the recent changes in Firefox of late, several of my favorite plugins no longer work. That’s really frustrating. One of those plugins is DownloadThemAll.

Suppose you navigate to a webpage that contains links to several files you wish to download. You could: 1) right-click each link, one by one, select “Save As” from the context menu, and download each file or 2) open up the DownloadThemAll plugin window, check the files you want to download from a list DownloadThemAll auto-magically populates, and click the “download” button to download all those files together.

Or at least you could go with Option #2 until Firefox Quantum came along and made the plugin incompatible. So, what’s one to do? Well, here’s one approach I tried recently in Python, although it could be a little gnarly for the non-technical:

Step 1: Load a few helpful Python packages

Requests, BeautifulSoup, and lxml will serve you right!


1
2
3
import requests
from bs4 import BeautifulSoup
import lxml

Step 2: Load up and parse the webpage

Use the requests package to grab the webpage you want to work with and then use BeautifulSoup to parse it so that it’s easier to find the links you want:


1
2
3
url = 'https://ia801501.us.archive.org/zipview.php?zip=/12/items/NSAsecurityPosters1950s60s/NSAsecurityPosters_1950s-60s_jp2.zip'
result = requests.get(url)
soup = BeautifulSoup(result.content, "lxml")

Step 3: Download those files!

Well, it’s a little more complicated than that. First, I had to take a look at the HTML source code of the webpage I wanted to work with and then find the HTML elements containing the download links. In this case, the challenge was relatively easy: I just needed to find all the elements with an “id” attribute equal to “jpg”. With BeautifulSoup, I could easily find all those elements, loop through them, and pull out the data I needed, including the download link. With that download link, I can use requests again to pull down the content and easily save it to disk:


1
2
3
4
5
6
7
8
for jpg_cell in soup.find_all(id="jpg"):
    link = 'https:' + jpg_cell.find('a').attrs['href']
    # I noticed that part of the download url contains the HTML encoding '%2F'.  I need to replace that with a
    # forward slash before I have a valid link I can use to download
    file_name = link.replace('%2F', '/').split('/')[-1]
    print(link + '  ' + file_name)  # just to visually validate I parsed the link and filename correctly
    r = requests.get(link)
    open("./data/" + file_name, 'wb').write(r.content)

Check out the complete source code on my Github page. Also, check out this fantastic article from DataCamp.com that goes into even greater detail to explain web scraping in Python.

Not so fast!

Ok, so you can go through that somewhat cumbersome process for all the download jobs you might have, but in the future, I think I’m just going to pop over to Chrome and use the Chrono Download Manager extension.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑