Tag: python (Page 25 of 26)

Parsing PDFs in Python

March 15, 2018 / Brad

(Apologies for the awful alliteration…boo-ya!)

A few weeks ago, my child competed in the 2018 Queen City Classic Chess Tournament. Close to 700 students–from Kindergarten to High School age–participated. Despite a long day (nearly twelve hours of travel and tournament play), my boy and his team did well and maintained good spirits.

Since the organizers are courteous enough to publish the match results, I thought this might be a good opportunity (read, nerd moment) to download the data and analyze it a little to see what sort of patterns I might find.

Well, this turned out to be easier said than done. The organizers published the results in 28 PDF files. In my experience, PDF files are no friend to the data analyst. So, my first challenge became downloading the PDF files and transforming them into objects useful for data analysis.

In this post, I will highlight a few of the steps I implemented to acquire the data and transform it into something useful. In a future post, I will discuss some of the analysis I performed on the data. The complete data acquisition code I wrote is available as a Jupyter Notebook on my github page.

Step 1: Getting the data locations

All the PDF files are exposed as download links on the results page, so I decided the simplest approach would be to just scrape the page and use BeautifulSoup to parse out the download links I needed:


1
2
3
4
5
6
7
8
9
10
11
12
# get the download page

result = requests.get("https://ccpf.proscan.com/programs/queen-city-classic-chess-tournament/final-results/")

soup = BeautifulSoup(result.content, "lxml")



# build a list of the download links

pdf_list = []

results_section = soup.find("section", class_="entry-content cf")

pdf_links = results_section.find_all("a")

for pdf_link in pdf_links:

    title = pdf_link.text

    link = pdf_link.attrs['href']

    pdf_list.append([title, link])

Step 2: Download the PDFs

With a list of the download links, I then just iterated through the list to download the files:


1
2
3
4
5
for p in pdf_list:

    # save pdf to disk

    r = requests.get(p[1])

    pdf_file = p[1].split('/')[-1]

    open("./data/" + pdf_file, 'wb').write(r.content)

Step 3: Read the PDFs

Here’s where I hit my challenge: my go-to solution for PDFs, PyPDF2, just didn’t work on these files. So, I had to find another option. My searching revealed the suite of utilities Xpdf tools. Even then, it took some playing with these tools to find a path forward. In the end, I was able to use the tool pdftotext to at least extract the results from the PDF files to simple text files that would be tremendously easier to work with.

In Jupyter Notebook, I looped through my PDF list and used the shell command to run pdftotext:


1
2
3
4
for p in (pdf_list + pl_pdf_list):

    pdf_path = './data/' + p[1].split('/')[-1]

    txt_path = pdf_path.replace('.pdf', '.txt')

    ! {pdftotext} -table {pdf_path} {txt_path}

Step 4: Parse the resulting text

With the results safely in easy-to-read text files, the hard part was over, right? Well, unfortunately, no delimiters came with the data extract, so, I had to get creative again. Enter regular expressions (regex).

Even without delimiters, the extracted data seemed to follow some basic patterns, so I devised three different regular expressions–one for each file type as the results in the PDFs followed one of three schemas–to match the data elements. Initially, I tried numpy’s fromregex hoping for a quick, one-liner win. The function worked well for the most part, but it still stumbled on a few lines inexplicably. So, I just resorted to conventional Python regex:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
re_ind_nr = r'\s+([0-9]+)\s+(.+?(?=\s{2}))\s{2,}(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+)'

re_ind_r = r'\s+([0-9]+)\s+(.+?(?=\s{2}))\s{2,}(.+?(?=\s{2}))\s+([0-9 ]+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+)'

re_team = r'([0-9]+)\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+)'

# the regex still let's some garbage rows come through like headers and footers.  use this list to weed the garbage out

elems_of_rows_to_remove = ['Score', 'Rnd1', 'Code', 'TBrk1']

ind_nr_list = []

ind_r_list = []

team_list = []



# iterate through the list of result files I downloaded.  The PDFs fall into one of three categories: team results, 

# ranked player results, or non-ranked player results.  The file names follow a loose convention: if "team" or "tm"

# is in the file name, that file is a list of team results.  If a file name starts with "n", that file represents

# results of non-ranked players.  All the rest are results of ranked players.

for p in pdf_list:

    title = p[0]

    txt_file = './data/{0}'.format(p[1].split('/')[-1].replace('.pdf', '.txt'))

    with open(txt_file, 'r') as f:

        t = f.read()

        if 'team' in title.lower() or 'tm' in title.lower():

            l = re.findall(re_team, t)

            l = [[title] + list(r) for r in l if not any(i in r for i in elems_of_rows_to_remove)]

            [team_list.append(r) for r in l]

        elif title.lower().startswith('n'):

            l = re.findall(re_ind_nr, t)

            l = [[title] + list(r) for r in l if not any(i in r for i in elems_of_rows_to_remove)]

            [ind_nr_list.append(r) for r in l]

        else:

            l = re.findall(re_ind_r, t)

            l = [[title] + list(r) for r in l if not any(i in r for i in elems_of_rows_to_remove)]

            [ind_r_list.append(r) for r in l]

Step 5: Call it a day

Finally, I had the data in three different lists I could work with, but I’ll save that part for another day. Again, my complete code is at my github page. Hopefully soon, I’ll find some extra time to do the analysis I originally intended.

Learning on the go: podcast edition

February 23, 2018 / Brad

I have a lengthy commute: sometimes an hour or more each way. Years ago, I would listen to the morning drive time radio. Then, I discovered podcasts and realized that I could make my commutes productive by actually learning something while I navigate my metal coffin to my cube dwelling for the day. Here are ten podcasts I’ve benefited from over the years:

1. .NET Rocks

Carl and Richard talk all things .NET and more (that is, various software development topics for those of less nerdy persuasion). The two also dive into more sciency topics with their periodic “geek out” sessions. .NET Rocks has to be one of the longest running podcasts around, having started in 2002, and they show no signs of quitting any time soon.

2. Contra Krugman

Economist Paul Krugman seems to have the ear of lots of media outlets. Unfortunately, he tends to run fast and loose with the “facts” he presents in these venues. While the media lets him get away with his embellishments, Tom Woods and Bob Murphy don’t: in every episode, they point out his mistakes and–dare I say?–potential lies and have a lot of fun in the process.

3. The Tom Woods Show

Not content with his weekly Contra Krugman podcast, Tom Woods also hosts The Tom Woods Show: easily digestible, daily podcast episodes covering a wide variety of topics from economics, to current events, to history, and much more. I highly recommend this one!

4. Hanselminutes

Technologist Scott Hanselman hosts a periodic conversation with other prominent technologists. He covers lots of software development topics but occasionally ventures into broader themes such as how to attract more women to STEM careers, technology in non-profits, tracking your own life and health metrics, etc.

5. Part of the Problem

Comedian Dave Smith discusses current events from a more libertarian perspective…and drops a joke or two!

6. The Sword and Laser

I love science fiction and fantasy books! In the Sword and Laser, Tom Merritt and Veronica Belmont discuss a wide variety of science fiction and fantasy books. They’ll often introduce me to authors and books I’ve never heard of, which can be frustrating since plummeting down the highway is no time to be writing down cool book recommendations!

7. Talk Python to Me

I’ve been teaching myself to code in Python for the last several years now, so I’m always eager to find resources to help me speed that process along. Enter Talk Python to Me. Here, Michael Kennedy interviews a variety of Python aficionados and discusses the many cool projects they’re working on. I particularly enjoy when he asks his guests to identify a couple of their favorite packages–I’ve found quite a few of their recommendations helpful to me in my work and personal projects.

8. The Genealogy Guys

I’ve listened to the Genealogy Guys for years now and even had the pleasure of attending a session taught by Drew Smith himself at the Ohio Genealogical Conference in 2016. In The Genealogy Guys, George and Drew discuss a wide variety of topics to help amateur and professional alike with their family history challenges.

9. The James Altucher Show

James Altucher walks to the beat of a different drummer. In this podcast, James interviews lots of popular and influential people from his unique perspective, trying to identify the patterns and practices that make them successful.

10. The Survival Podcast

Don’t let the name fool you: no one’s wearing a tinfoil hat here. Jack Spirko is passionate about helping people identify their single points of failure and helping them build backups and redundancies in these areas. At my work–and I’m sure nearly everyone else’s–there’s such a huge emphasis on disaster recovery planning. Every new software or system we put in place has to have a detailed plan on what to do if the system suddenly fails. We even have quarterly exercises where we pretend the systems have failed and walk through our recovery plans, step by step, to make sure they actually work. My thought is, if businesses place such importance on disaster planning and recovery, how much more important is it that we do the same things for our own families? If disaster strikes, to heck with work: I want to make sure my family makes it through unscathed. This is what The Survival Podcast is all about.

10 of My Favorite Free Tools

February 4, 2018 / Brad

My dad has countless shop tools to support his mechanical wizardry in the garage. Conversely, I wield a great many software tools to aid my work and interest in technology. Several tools I use are free and others cost some dough. Here’s a list of 10 free tools I find quite useful (in no particular order):

1. Jupyter Notebook

I’ve been teaching myself Python the past 2-3 years and have found Jupyter Notebook an indispensable tool in the endeavor. Jupyter Notebook is basically an integrated development environment (IDE) in your browser. You write your code in “cells”. Console output from a cell will be written right below the code. Even cooler, you can create chart and graph visuals right within your notebook. You can also intersperse your code cells with “markdown” cells. True, developers have a reputation for despising code documentation: nevertheless, interspersing your code cells with markdown cells providing some commentary on what you’re trying to do can make for a neat effect in your notebook–it can certainly remind you of what you were trying to do (if you had to walk away from your code for a while), but it can also turn your entire notebook into a report that you can hand in to management.

Download: I recommend installing Anaconda’s distribution of Jupyter Notebook
Here’s a great “getting started” tutorial from DataCamp.com
Here’s a Jupyter Notebook I built

2. PowerShell

I’ve talked about PowerShell a few times in the past. I’ve heard people call PowerShell the old Windows DOS shell “on steroids” (what an overused suffix!). If you run Windows 7 or higher, you have PowerShell–so that’s at least 74% of Windows users. The primary lever in PowerShell is the cmdlet and there are a ton. Plus, you can write your own. What I find even more compelling, though, is that PowerShell can tap into the entire .NET framework, which gives it tremendous capabilities for a scripting language. If you have some sort of operation you want to automate on your Windows system–from backing up files to pulling down stock closing prices–you can make that happen in PowerShell.

Check out some of my PowerShell work here

3. Logparser

Logparser is probably one of the more esoteric tools on this list and, frankly, I use it much more at work than I do at home. For the developers, DBAs, and IT operations people out there: imagine if you were able to write everyday SQL, not across a database table but rather across a large, delimited file? Imagine being about to find the five error messages out of a log file of millions of lines, at the command line, in seconds? Yes, grep can do this, but we’re talking Windows. And that’s not all: just about all the standard SQL operations are available to Logparser including group, distinct, count, order by, etc. Some basic charting is available, too. I’ve found Logparser to actually parse large files faster than PowerShell, so I’ll often write PowerShell scripts that call Logparser and work with the results. Super combo!

4. Slickrun

Slickrun is a tool that probably won’t make a lot of sense until you start forcing yourself to use it. Then, before you know it, you won’t be able to live without it.

Slickrun is like the Windows Run command but on steroids (dangit!). You start by creating Magic Words: short words or phrases you associate to some action. For instance, I’ve created a “jupyter_notebook” magic word that launches my Jupyter Notebook platform. That way, I don’t have to click the Start Window and scroll through various program files to find the Jupyter Notebook shortcut. Instead, I hit a key combination–in my case, Alt-Q–that opens up the small Slickrun window, then I just start typing “jupyter_notebook” and hit <Enter> to launch. Slickrun even auto-completes magic words, so I tend to be able to launch Jupyter Notebook in about four keyboard strikes–Alt-Q then “ju” as Slickrun will usually auto-complete that to “jupyter_notebook” then <Enter>. I can hit those four keys in a fraction of the time it would take me to navigate the Start Window and find the program. I tie magic words to applications, websites, and even PowerShell scripts and the like to do things like back up my files. Pretty slick, eh?

5. Password Safe

Security experts are always telling us to never use the same userid/password combination from one site to another. That way, when your online Dungeons & Dragons account gets pwned, your bank account doesn’t. But if you’re suppose to have a new userid/password combo for every account you create, how in the world do you keep track of all those credentials?

Well, Password Safe is one way to do it. With Password Safe, you create a strongly encrypted file–your safe. You then protect it with a strong password–the only one you really have to remember. Then, within the safe, you create as many credential entries as you need. You can also create folders within your safe to better organize your credential entries. Furthermore, Password Safe can create strong passwords for you. So, for that next account you create, you can click a “generate” button and have Password Safe create a strong password for you automatically.

In the end, your safe is a file, so make sure you properly back it up. Sharing your safe across devices can be a challenge, although I believe there are some techniques available to make that easier and there are other products out there like LastPass that focus on solving that problem.

6. Visual Studio Code

I still use Microsoft Visual Studio for a number of projects and love JetBrains’ PyCharm, but I’m really trying to embrace Visual Studio Code more and more.

Visual Studio Code is a free IDE developed by Microsoft, of all companies. It supports a great number of programming languages including C#, Python, Java, and even PowerShell. At first, I struggled to learn my way around the tool, but I’m starting to find an increasing number of tutorials and presentations that use it, so that’s helpful. All the cool kids seem to be using VS Code, so it’s probably a good thing to add to your toolbox.

7. Notepad++

Notepad++ has been around for a while, but I continue to use it everyday. While I also use Microsoft Notepad for simple tasks, nothing beats text editing in Notepad++ with its multi-tab interface and plugin support. I find it very helpful for formatting XML and JSON files, using its regex find/replace features, and even using XPath query operations on occasion.

8. Paint.net

I’ve known about Paint.net for a while, but have only started using it recently to build logos and images like the one at the top of this page. I am absolutely no graphic artist and Paint.net’s interface can be quite intimidating, but it can help you craft some pretty nifty images. I’ve scoured YouTube for as many tutorials as I can find to try to shorten my learning curve with the product–I recommend this one in particular for making logos.

9. Git-Bash

Git is what all the cool kids do for software source control and, since all the cool kids work frequently from the command line, Git Bash is what you need for all your source code management operations. Aside from git utilities, you also get a fair amount of Bash utilities–like a two-for-one special! In the past, I used Cygwin to get a Bash experience on my Windows machines, but, going forward, I’m going to try to perform all the Bash-based work I need to do in Git Bash, instead, and see where that gets me.

10. Q-dir

Pop quiz: how many Windows Explorer instances do you have open right now? How many of them tend to stay open? Q-dir stands for “Quad Directory” or “Quad Explorer”. By default, Q-dir is a single window split into four sections. Each of the four sections hosts its own Windows Explorer instance. So, off the bat, you have four File Explorers in one window. Even cooler, each “mini explorer” can have tabs that point to other directory paths. Pretty awesome! So, for the most part, you don’t need to have five instances of Windows Explorer running and then try to rifle through each instance of find the one you’re looking for: instead, run one instance of Q-dir and find the Windows Explorer you need there.

So, that’s just a small taste of many of the free tools I use to try to make better use of my time. If you really want to geek out on tools, I highly recommend Scott Hanselman’s Ultimate Tools guide.

Tag: python (Page 25 of 26)

Parsing PDFs in Python

Step 1: Getting the data locations

Step 2: Download the PDFs

Step 3: Read the PDFs

Step 4: Parse the resulting text

Step 5: Call it a day

Learning on the go: podcast edition

1. .NET Rocks

2. Contra Krugman

3. The Tom Woods Show

4. Hanselminutes

5. Part of the Problem

6. The Sword and Laser

7. Talk Python to Me

8. The Genealogy Guys

9. The James Altucher Show

10. The Survival Podcast

10 of My Favorite Free Tools

1. Jupyter Notebook

2. PowerShell

3. Logparser

4. Slickrun

5. Password Safe

6. Visual Studio Code

7. Notepad++

8. Paint.net

9. Git-Bash

10. Q-dir

Recent Posts

Recent Comments

Archives

Meta

Parsing PDFs in Python

Step 1: Getting the data locations

Step 2: Download the PDFs

Step 3: Read the PDFs

Step 4: Parse the resulting text

Step 5: Call it a day

Learning on the go: podcast edition

1. .NET Rocks

2. Contra Krugman

3. The Tom Woods Show

4. Hanselminutes

5. Part of the Problem

6. The Sword and Laser

7. Talk Python to Me

8. The Genealogy Guys

9. The James Altucher Show

10. The Survival Podcast

10 of My Favorite Free Tools

1. Jupyter Notebook

2. PowerShell

3. Logparser

4. Slickrun

5. Password Safe

6. Visual Studio Code

7. Notepad++

8. Paint.net

9. Git-Bash

10. Q-dir

Recent Posts

Recent Comments

Archives

Tags

Meta