Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: jupyter_notebook (Page 16 of 17)

Dealing with Multi-line Strings in Python

Often, I’ll have to set a multi-line string in Python, such as a long SQL statement or friendly message of some sort.  It would be easy to write one, long string but that wouldn’t be very readable or maintainable–having to scroll horizontally to read the string–nor would it be compliant with PEP8.  So, what is a coder to do?  Well, fortunately, Python has a number of ways to skin this cat:

Option 1: Concatenate your strings with plus signs

Here, I just concatenate my individual lines together with the good ol’ plus sign:


1
2
3
4
qry1 = ('SELECT fname ' +
        ',lname ' +
        ',job_title ' +
        'FROM people')

Option 2: Concatenate your strings without the plus signs

Well, it turns out you don’t need the plus signs after all to concatenate your lines together:


1
2
3
4
qry2 = ('SELECT fname '
        ',lname '
        ',job_title '
        'FROM people')

Option 3: Use backslashes for line concatenation

Here, you don’t need to surround your strings with parentheses…just use the backslash to do the concatenation work:


1
2
3
4
qry3 = 'SELECT fname ' \
        ',lname ' \
        ',job_title ' \
        'FROM people'

Option 4: (my personal favorite) Surround your multi-line string with three double-quotes

This technique makes it very easy to read the content of your multi-line string:


1
2
3
4
5
6
qry4 = """
SELECT fname
    ,lname
    ,job_title
FROM people
"""

Bonus Option (Jupyter Notebook only): Use the sql_magic keyword

If you have a long SQL statement and are working in Jupyter Notebook, consider using the sql_magic keyword from Pivotal:


1
2
3
4
5
%%read_sql df_result -c conn
SELECT fname
    ,lname
    ,job_title
FROM people

Check out these examples in action on my Jupyter Notebook.

 

Borrowed time

I’m a big fan of jupyter notebooks and anaconda.  The other day, I was reading a blog post from Continuum/Anaconda founder Travis Oliphant about his decision to leave the company.  I found this quote particularly stirring:

“As a founder over 40 with modest means, I had a family of 6 children who relied on me. That family had teenage children who needed my attention and pre-school and elementary-school children that I could not simply leave only in the hands of my wife. I look back and sometimes wonder how we pulled it off. The truth probably lies in the time we borrowed: time from exercise, time from sleep, time from vacations, and time from family.”

Here, Travis seems to sum up some of the theme of this blog: my family and our creature comforts–our need for food, housing, clothing, etc.–have claim to much of my time.  And, of course, my time is bound by my mortality.  I have to creatively borrow against those claims to occasionally pursue other subjects I find interesting.  While I’m no founder of a great company like Anaconda, I’d like to think that if Travis was able to make something as great as he did despite his immense time commitments, maybe I can, too.

Parsing PDFs in Python

(Apologies for the awful alliteration…boo-ya!)

A few weeks ago, my child competed in the 2018 Queen City Classic Chess Tournament.   Close to 700 students–from Kindergarten to High School age–participated. Despite a long day (nearly twelve hours of travel and tournament play), my boy and his team did well and maintained good spirits.

Since the organizers are courteous enough to publish the match results, I thought this might be a good opportunity (read, nerd moment) to download the data and analyze it a little to see what sort of patterns I might find.

Well, this turned out to be easier said than done. The organizers published the results in 28 PDF files. In my experience, PDF files are no friend to the data analyst. So, my first challenge became downloading the PDF files and transforming them into objects useful for data analysis.

In this post, I will highlight a few of the steps I implemented to acquire the data and transform it into something useful. In a future post, I will discuss some of the analysis I performed on the data. The complete data acquisition code I wrote is available as a Jupyter Notebook on my github page.

Step 1: Getting the data locations

All the PDF files are exposed as download links on the results page, so I decided the simplest approach would be to just scrape the page and use BeautifulSoup to parse out the download links I needed:


1
2
3
4
5
6
7
8
9
10
11
12
# get the download page
result = requests.get("https://ccpf.proscan.com/programs/queen-city-classic-chess-tournament/final-results/")
soup = BeautifulSoup(result.content, "lxml")

# build a list of the download links
pdf_list = []
results_section = soup.find("section", class_="entry-content cf")
pdf_links = results_section.find_all("a")
for pdf_link in pdf_links:
    title = pdf_link.text
    link = pdf_link.attrs['href']
    pdf_list.append([title, link])

Step 2: Download the PDFs

With a list of the download links, I then just iterated through the list to download the files:


1
2
3
4
5
for p in pdf_list:
    # save pdf to disk
    r = requests.get(p[1])
    pdf_file = p[1].split('/')[-1]
    open("./data/" + pdf_file, 'wb').write(r.content)

Step 3: Read the PDFs

Here’s where I hit my challenge: my go-to solution for PDFs, PyPDF2, just didn’t work on these files. So, I had to find another option. My searching revealed the suite of utilities Xpdf tools. Even then, it took some playing with these tools to find a path forward. In the end, I was able to use the tool pdftotext to at least extract the results from the PDF files to simple text files that would be tremendously easier to work with.

In Jupyter Notebook, I looped through my PDF list and used the shell command to run pdftotext:


1
2
3
4
for p in (pdf_list + pl_pdf_list):
    pdf_path = './data/' + p[1].split('/')[-1]
    txt_path = pdf_path.replace('.pdf', '.txt')
    ! {pdftotext} -table {pdf_path} {txt_path}

Step 4: Parse the resulting text

With the results safely in easy-to-read text files, the hard part was over, right? Well, unfortunately, no delimiters came with the data extract, so, I had to get creative again. Enter regular expressions (regex).

Even without delimiters, the extracted data seemed to follow some basic patterns, so I devised three different regular expressions–one for each file type as the results in the PDFs followed one of three schemas–to match the data elements. Initially, I tried numpy’s fromregex hoping for a quick, one-liner win. The function worked well for the most part, but it still stumbled on a few lines inexplicably. So, I just resorted to conventional Python regex:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
re_ind_nr = r'\s+([0-9]+)\s+(.+?(?=\s{2}))\s{2,}(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+)'
re_ind_r = r'\s+([0-9]+)\s+(.+?(?=\s{2}))\s{2,}(.+?(?=\s{2}))\s+([0-9 ]+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+)'
re_team = r'([0-9]+)\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+?(?=\s{2}))\s+(.+)'
# the regex still let's some garbage rows come through like headers and footers.  use this list to weed the garbage out
elems_of_rows_to_remove = ['Score', 'Rnd1', 'Code', 'TBrk1']
ind_nr_list = []
ind_r_list = []
team_list = []

# iterate through the list of result files I downloaded.  The PDFs fall into one of three categories: team results,
# ranked player results, or non-ranked player results.  The file names follow a loose convention: if "team" or "tm"
# is in the file name, that file is a list of team results.  If a file name starts with "n", that file represents
# results of non-ranked players.  All the rest are results of ranked players.
for p in pdf_list:
    title = p[0]
    txt_file = './data/{0}'.format(p[1].split('/')[-1].replace('.pdf', '.txt'))
    with open(txt_file, 'r') as f:
        t = f.read()
        if 'team' in title.lower() or 'tm' in title.lower():
            l = re.findall(re_team, t)
            l = [[title] + list(r) for r in l if not any(i in r for i in elems_of_rows_to_remove)]
            [team_list.append(r) for r in l]
        elif title.lower().startswith('n'):
            l = re.findall(re_ind_nr, t)
            l = [[title] + list(r) for r in l if not any(i in r for i in elems_of_rows_to_remove)]
            [ind_nr_list.append(r) for r in l]
        else:
            l = re.findall(re_ind_r, t)
            l = [[title] + list(r) for r in l if not any(i in r for i in elems_of_rows_to_remove)]
            [ind_r_list.append(r) for r in l]

Step 5: Call it a day

Finally, I had the data in three different lists I could work with, but I’ll save that part for another day. Again, my complete code is at my github page. Hopefully soon, I’ll find some extra time to do the analysis I originally intended.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑