DadOverflow.com

Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Page 54 of 57

Parsing PDFs in Python

(Apologies for the awful alliteration…boo-ya!)

A few weeks ago, my child competed in the 2018 Queen City Classic Chess Tournament.   Close to 700 students–from Kindergarten to High School age–participated. Despite a long day (nearly twelve hours of travel and tournament play), my boy and his team did well and maintained good spirits.

Since the organizers are courteous enough to publish the match results, I thought this might be a good opportunity (read, nerd moment) to download the data and analyze it a little to see what sort of patterns I might find.

Well, this turned out to be easier said than done. The organizers published the results in 28 PDF files. In my experience, PDF files are no friend to the data analyst. So, my first challenge became downloading the PDF files and transforming them into objects useful for data analysis.

In this post, I will highlight a few of the steps I implemented to acquire the data and transform it into something useful. In a future post, I will discuss some of the analysis I performed on the data. The complete data acquisition code I wrote is available as a Jupyter Notebook on my github page.

Step 1: Getting the data locations

All the PDF files are exposed as download links on the results page, so I decided the simplest approach would be to just scrape the page and use BeautifulSoup to parse out the download links I needed:


1
2
3
4
5
6
7
8
9
10
11
12
# get the download page
result = requests.get("https://ccpf.proscan.com/programs/queen-city-classic-chess-tournament/final-results/")
soup = BeautifulSoup(result.content, "lxml")

# build a list of the download links
pdf_list = []
results_section = soup.find("section", class_="entry-content cf")
pdf_links = results_section.find_all("a")
for pdf_link in pdf_links:
    title = pdf_link.text
    link = pdf_link.attrs['href']
    pdf_list.append([title, link])

Step 2: Download the PDFs

With a list of the download links, I then just iterated through the list to download the files:


1
2
3
4
5
for p in pdf_list:
    # save pdf to disk
    r = requests.get(p[1])
    pdf_file = p[1].split('/')[-1]
    open("./data/" + pdf_file, 'wb').write(r.content)

Step 3: Read the PDFs

Here’s where I hit my challenge: my go-to solution for PDFs, PyPDF2, just didn’t work on these files. So, I had to find another option. My searching revealed the suite of utilities Xpdf tools. Even then, it took some playing with these tools to find a path forward. In the end, I was able to use the tool pdftotext to at least extract the results from the PDF files to simple text files that would be tremendously easier to work with.

In Jupyter Notebook, I looped through my PDF list and used the shell command to run pdftotext:


1
2
3
4
for p in (pdf_list + pl_pdf_list):
    pdf_path = './data/' + p[1].split('/')[-1]
    txt_path = pdf_path.replace('.pdf', '.txt')
    ! {pdftotext} -table {pdf_path} {txt_path}

Step 4: Parse the resulting text

With the results safely in easy-to-read text files, the hard part was over, right? Well, unfortunately, no delimiters came with the data extract, so, I had to get creative again. Enter regular expressions (regex).

Even without delimiters, the extracted data seemed to follow some basic patterns, so I devised three different regular expressions–one for each file type as the results in the PDFs followed one of three schemas–to match the data elements. Initially, I tried numpy’s fromregex hoping for a quick, one-liner win. The function worked well for the most part, but it still stumbled on a few lines inexplicably. So, I just resorted to conventional Python regex:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
re_ind_nr = r'\s+([0-9]+)\s+(.+?(?=\s{2}))\s{2,}(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+)'
re_ind_r = r'\s+([0-9]+)\s+(.+?(?=\s{2}))\s{2,}(.+?(?=\s{2}))\s+([0-9 ]+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+)'
re_team = r'([0-9]+)\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+)'
# the regex still let's some garbage rows come through like headers and footers.  use this list to weed the garbage out
elems_of_rows_to_remove = ['Score', 'Rnd1', 'Code', 'TBrk1']
ind_nr_list = []
ind_r_list = []
team_list = []

# iterate through the list of result files I downloaded.  The PDFs fall into one of three categories: team results,
# ranked player results, or non-ranked player results.  The file names follow a loose convention: if "team" or "tm"
# is in the file name, that file is a list of team results.  If a file name starts with "n", that file represents
# results of non-ranked players.  All the rest are results of ranked players.
for p in pdf_list:
    title = p[0]
    txt_file = './data/{0}'.format(p[1].split('/')[-1].replace('.pdf', '.txt'))
    with open(txt_file, 'r') as f:
        t = f.read()
        if 'team' in title.lower() or 'tm' in title.lower():
            l = re.findall(re_team, t)
            l = [[title] + list(r) for r in l if not any(i in r for i in elems_of_rows_to_remove)]
            [team_list.append(r) for r in l]
        elif title.lower().startswith('n'):
            l = re.findall(re_ind_nr, t)
            l = [[title] + list(r) for r in l if not any(i in r for i in elems_of_rows_to_remove)]
            [ind_nr_list.append(r) for r in l]
        else:
            l = re.findall(re_ind_r, t)
            l = [[title] + list(r) for r in l if not any(i in r for i in elems_of_rows_to_remove)]
            [ind_r_list.append(r) for r in l]

Step 5: Call it a day

Finally, I had the data in three different lists I could work with, but I’ll save that part for another day. Again, my complete code is at my github page. Hopefully soon, I’ll find some extra time to do the analysis I originally intended.

Tips to improving your vocabulary

My oldest child has engaged in the college quest: meditating on what profession she might want to pursue then reverse-engineering that to an associated major and ideal college to support that vision, visiting schools, and, most importantly, studying for and taking the standardized tests–ACT and SAT.

On more than one occasion, she’s complained about the English and/or writing portions of the tests, bemoaning the fact that these sections make use of advanced vocabulary than she’s unused to. For many years, I’ve tried to press on her the importance of expanding her vocabulary; yet, she continues to ignore my appeals (as seems to be our standard father/daughter dynamic). If she would ever listen to me, here are ten practical tips I would encourage her to employ to increase her command of the English language.

1. Go looking for great words on the Internet

As you’d expect, the Internet is a great resource for improving your vocabulary. There are word-of-the-day sites that you might visit daily for new material, but there are also plenty of “themed” lists to work your way through, as well. Here are a few that I’ve found educational:

2. Install a word-of-the-day application on your phone

Why go to the words when they can come to you? There are a number of free word-of-the-day mobile applications out there. Currently, I’m using Dictionary.com’s app. One nice feature of this app is its notifications: at 8:00am every day, the app sends me a notification with the new word. If I like the new word (or any other word I might look up in the app), I can add it to my “favorites”–so, I always have a list handy of some of my favorite words.

 

3. Get a word-of-the-day calendar

There are a variety of calendar and planner-type products out there aiming to help grow your vocabulary!  If you prefer more of a traditional interface from which to learn, this just might be your ticket.

 

4. Get a dictionary and/or thesaurus

Maybe this is my pre-Internet brain talking, but a dictionary and thesaurus should definitely be part of your library. Probably your kids’, too!

 

5. Read challenging books

Words only work when they’re uttered in proper context–and reveal your ignorance when used otherwise. What better way to learn a new word than through the pen of the professionals? Read the likes of Umberto Eco, Gore Vidal, and David Stockman, among others, to deepen your communication options.

 

6. Listen to challenging podcasts

There are podcasts, like the Grammar Girl podcast, dedicated to improving your communication skills. After that, merely listening to intelligent people discussing challenging topics can be quite beneficial. For example, just the other day, Tom Woods reintroduced me to the wonderful word, “vicissitudes“. Podcasts, then, can be an excellent way to both learn more about a particular topic and extend your vocabulary.

 

7. Listen to word-of-the-day apps on your Amazon Echo or Google Home devices

To my Amazon daily briefing, I’ve included the Peppercorn Media word of the day skill. Every school morning, just before venturing out to the bus stop with one of my children, we listen to the daily briefing and acquire a new word of the day. Thus, we get our word-of-the-day in a quick and entertaining way.

 

8. Watch challenging movies or television

Personally, I find movies and television predominantly a waste of time, but if you must imbibe, try to make it media that positively augments your intellect. I find science fiction and historical works occasionally useful for this purpose.  Star Trek, The Martian, and Amistad are a few creations that seem to work in this regard. However, I did learn the word “flibbertigibbet” from the highly underrated Joe Versus the Volcano.

 

9. Force yourself to use your new words in conversation

Stephen Covey wrote, “to learn and not to do is really not to learn. To know and not to do is really not to know.” It’s not enough to learn new and interesting words, but to actually incorporate them into your regular dialog. Similar to martial arts where you repeat a punch or kick hundreds of times until it becomes part of your muscle memory, you must also invoke your new words multiple times so they become easy go-to options in your conversations and writings.

 

10. Write more, forcing yourself to use your new words

As with enhancing your conversations, littering your writing with your new words can help ingrain those new options into your writing toolbox. Also, look for additional writing opportunities like the school newspaper and yearbook (and even blogging) to help further hone your craft.

Learning on the go: podcast edition

I have a lengthy commute: sometimes an hour or more each way. Years ago, I would listen to the morning drive time radio. Then, I discovered podcasts and realized that I could make my commutes productive by actually learning something while I navigate my metal coffin to my cube dwelling for the day. Here are ten podcasts I’ve benefited from over the years:

1. .NET Rocks

Carl and Richard talk all things .NET and more (that is, various software development topics for those of less nerdy persuasion). The two also dive into more sciency topics with their periodic “geek out” sessions. .NET Rocks has to be one of the longest running podcasts around, having started in 2002, and they show no signs of quitting any time soon.

2. Contra Krugman

Economist Paul Krugman seems to have the ear of lots of media outlets. Unfortunately, he tends to run fast and loose with the “facts” he presents in these venues. While the media lets him get away with his embellishments, Tom Woods and Bob Murphy don’t: in every episode, they point out his mistakes and–dare I say?–potential lies and have a lot of fun in the process.

3. The Tom Woods Show

Not content with his weekly Contra Krugman podcast, Tom Woods also hosts The Tom Woods Show: easily digestible, daily podcast episodes covering a wide variety of topics from economics, to current events, to history, and much more. I highly recommend this one!

4. Hanselminutes

Technologist Scott Hanselman hosts a periodic conversation with other prominent technologists. He covers lots of software development topics but occasionally ventures into broader themes such as how to attract more women to STEM careers, technology in non-profits, tracking your own life and health metrics, etc.

5. Part of the Problem

Comedian Dave Smith discusses current events from a more libertarian perspective…and drops a joke or two!

6. The Sword and Laser

I love science fiction and fantasy books! In the Sword and Laser, Tom Merritt and Veronica Belmont discuss a wide variety of science fiction and fantasy books. They’ll often introduce me to authors and books I’ve never heard of, which can be frustrating since plummeting down the highway is no time to be writing down cool book recommendations!

7. Talk Python to Me

I’ve been teaching myself to code in Python for the last several years now, so I’m always eager to find resources to help me speed that process along. Enter Talk Python to Me. Here, Michael Kennedy interviews a variety of Python aficionados and discusses the many cool projects they’re working on. I particularly enjoy when he asks his guests to identify a couple of their favorite packages–I’ve found quite a few of their recommendations helpful to me in my work and personal projects.

8. The Genealogy Guys

I’ve listened to the Genealogy Guys for years now and even had the pleasure of attending a session taught by Drew Smith himself at the Ohio Genealogical Conference in 2016. In The Genealogy Guys, George and Drew discuss a wide variety of topics to help amateur and professional alike with their family history challenges.

9. The James Altucher Show

James Altucher walks to the beat of a different drummer. In this podcast, James interviews lots of popular and influential people from his unique perspective, trying to identify the patterns and practices that make them successful.

10. The Survival Podcast

Don’t let the name fool you: no one’s wearing a tinfoil hat here. Jack Spirko is passionate about helping people identify their single points of failure and helping them build backups and redundancies in these areas. At my work–and I’m sure nearly everyone else’s–there’s such a huge emphasis on disaster recovery planning. Every new software or system we put in place has to have a detailed plan on what to do if the system suddenly fails. We even have quarterly exercises where we pretend the systems have failed and walk through our recovery plans, step by step, to make sure they actually work. My thought is, if businesses place such importance on disaster planning and recovery, how much more important is it that we do the same things for our own families? If disaster strikes, to heck with work: I want to make sure my family makes it through unscathed. This is what The Survival Podcast is all about.

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑