Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: jupyter_notebook (Page 2 of 17)

Parsing Word documents with Python

If you ever had a need to programmatically examine the text in a Microsoft Word document, getting the text out in the first place can be challenging. Sure, you can manually save your document to a plain text file that’s much easier to process, but if you have multiple documents to examine, that can be painful.

Recently I had such a need and found this Toward Data Science article quite helpful. But let’s take the challenge a little further: suppose you had a document with multiple sections and need to pull the text from specific sections.

Page 1 has my table of contents
Page 2 contains a variety of sections

Let’s suppose I need to pull just the text from the “sub-sections”. In my example, I have three sub-sections: Sub-Section 1, Sub-Section 2, and Sub-Section 3. In my Word document, I’ve styled these headers as “Heading 2” text. Here’s how I went about pull out the text for each of these sections.

Step 1: Import your packages

For my needs, I only need to import zipfile and ElementTree, which is nice as I didn’t need to install any third party packages:

import zipfile
import xml.etree.ElementTree as ET

Step 2: Parse the document XML

doc = zipfile.ZipFile('./data/test.docx').read('word/document.xml')
root = ET.fromstring(doc)

Step 3: Explore the XML for the sections and text you want

You’ll spend most of your time here, trying to figure out what elements hold the contents in which you are interested. The XML of Microsoft documents follows the WordprocessingML standard, which can be quite complicated. I spent a lot of time manually reviewing my XML looking for the elements I needed. You can write out the XML like so:

ET.tostring(root)

Step 4: Find all the paragraphs

To solve my problem, I first decided to pull together a collection of all the paragraphs in the document so that I could later iterate across them and make decisions. To make that work a little easier, I also declared a namespace object used by Microsoft’s WordprocessingML standard:

# Microsoft's XML makes heavy use of XML namespaces; thus, we'll need to reference that in our code
ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
body = root.find('w:body', ns)  # find the XML "body" tag
p_sections = body.findall('w:p', ns)  # under the body tag, find all the paragraph sections

It can be helpful to actually see the text in each of these sections. Through researching Microsoft’s XML standard, I know that document text is usually contained in “t” elements. So, if I write an XPath query to find all the “t” elements within a given section, I can join the text of all those elements together to get the full text of the paragraph. This code does that:

for p in p_sections:
    text_elems = p.findall('.//w:t', ns)
    print(''.join([t.text for t in text_elems]))
    print()

Step 5: Find all the “Heading 2” sections

Now, let’s iterate through each paragraph section and see if we can figure out which sections have been styled with “Heading 2”. If we can find those Heading 2 sections, we’ll then know that the subsequent text is the text we need.

Through researching more the XML standard, I found that if I search for pStyle elements that contain the value “Heading2”, these will be the sections I’m after. To make my code a little cleaner, I wrote functions to both evaluate each section for the Heading 2 style and extract the full text of the section:

def is_heading2_section(p):
    """Returns True if the given paragraph section has been styled as a Heading2"""
    return_val = False
    heading_style_elem = p.find(".//w:pStyle[@w:val='Heading2']", ns)
    if heading_style_elem is not None:
        return_val = True
    return return_val


def get_section_text(p):
    """Returns the joined text of the text elements under the given paragraph tag"""
    return_val = ''
    text_elems = p.findall('.//w:t', ns)
    if text_elems is not None:
        return_val = ''.join([t.text for t in text_elems])
    return return_val


section_labels = [get_section_text(s) if is_heading2_section(s) else '' for s in p_sections]

Now, if I print out my section_labels list, I see this:

My section_labels list

Step 6: Finally, extract the Heading 2 headers and subsequent text

Now, I can use simple list comprehension to glue together both the section headers and associated text of the three sub-sections I’m after:

section_text = [{'title': t, 'text': get_section_text(p_sections[i+1])} for i, t in enumerate(section_labels) if len(t) > 0]

And that list looks like this:

My section_text list

You can download my code here.

Parsing inconsistent file data with Python

I’ve posted a few of my data file parsing challenges in the past–here and here, for example–and here’s a slight variation on those examples.

A friend of mine was challenged with parsing data from a file where not every field was guaranteed to be present. Take this for example:

A file with not-so-consistent data

In the above file, the “Nickname” field only appears in one record. Plus, the fields in the last record are in a different order than the others. All we know for sure is that we have a string (/****start record) that indicates when a new record begins and field keys and values are separated by “: “. How would I go about parsing such a file?

Step 1: Read in the file with pandas

I could pull in the file with Python’s open and read capabilities, but it’s probably even easier to do it with the pandas read_csv function (even though this is not a true CSV file):

import pandas as pd

df_raw = pd.read_csv('./data/another_odd_format.txt', names=['col1'])

Now we have a “raw” dataframe that looks like so:

Step 2: Find where your records start and end

Finding out where your records start is easy enough: find all the rows with your start record indicator (/****start record) and add 1. After that, finding where your records end is easy, too: subtract 1 from each of the “start record” rows beginning with the second start line. (Technically, I subtracted 2 to account for the empty line at the end of each record.) Then, you can just add that last index of your dataframe on at the end. Here’s what I did:

new_rec_str = '/****start record'
start_rows = [i+1 for i in df_raw[df_raw.col1.str.startswith(new_rec_str)].index.tolist()]
end_rows = [s-2 for s in start_rows[1:]] + [df_raw.index[-1]]

Step 3: Loop through each start/end record pair and parse away

recs = []  # list to contain my parsed records
for start_row, end_row in zip(start_rows, end_rows):
    new_rec = {}
    for i in range(start_row, end_row + 1):
        line = df_raw.loc[i, 'col1']
        new_rec[line.split(': ')[0]] = line.split(': ')[1]
    recs.append(new_rec)

This gives you a nice list of dictionary objects:

[{'Fname': 'al', 'LName': 'bundy', 'Address': 'chicago'},
 {'Fname': 'marcy', 'LName': 'darcy', 'Address': 'chicago'},
 {'Fname': 'theodore',
  'LName': 'cleaver',
  'Address': 'mayfield',
  'Nickname': 'the beaver'},
 {'Address': 'Hill Valley', 'LName': 'mcfly', 'Fname': 'marty'}]

Step 4: Load your list into a new dataframe

df_clean = pd.DataFrame(recs)

This gives you a neat and tidy dataframe:

If you don’t like the NaN in records missing a “Nickname” value, you can always add a fillna function to the end of the dataframe line and replace the NaN values with something like empty string (”). So, that’s how I would tackle this sort of challenge.

Positioning Subplots

I’m certainly a big fan of visualizing data. Often, I like to present multiple types of visualizations together to offer a variety of perspectives on the data. For example, I might provide both a bar chart and scatter plot together to provide deeper insight than a single visual would:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

%matplotlib inline


# data from: https://www.kaggle.com/mysarahmadbhat/imdb-top-1000-movies
df = pd.read_csv('./data/regex_imdb.csv').fillna(0)

fig, ax = plt.subplots(1, 2, figsize=(10,6))

d1 = df[df.Year==2019][['Name', 'Gross']].sort_values('Gross').tail(10)
_ = ax[0].barh(d1.Name, d1.Gross)
_ = ax[0].set_xlabel('Gross Earnings')

d2 = df[df.Year==2019][['Run_time', 'Gross', 'Genre']].copy()
d2['Genre'] = d2.Genre.apply(lambda g: g.split(',')[0])

_ = sns.scatterplot(data=d2, x='Run_time', y='Gross', hue='Genre', ax=ax[1])
_ = ax[1].set_xlabel('Runtime (minutes)')

_ = fig.suptitle('Analysis of Movies from 2019')

In this sort of work, I will target specific axes to display specific charts. Thus, in my above example, I explicitly pushed a bar chart to ax[0] and a scatter plot to ax[1].

However, on occasion, circumstances demand that I write the same type of chart to multiple subplots where I change one variable for each. For example, suppose I want to get a quick view of the top 10 movies by gross earnings from 2010 to 2019:

I could write code to target each of these axes explicitly, but that would mean a lot of code and a lot of copy/paste. Instead, I’d rather just write a loop to iterate through the years and write the appropriate bar chart to the appropriate axis.

Looping and rendering the charts comes relatively easy to me. What usually trips me up in these efforts is targeting the right row and column. I often spend most of my time trying to remember how I solved this problem in the past.

Well no more! Hopefully this post will serve as a reference any time I need to do this type of work in the future. Ultimately, my solution is just three lines of code:

nbr_of_rows = 5
nbr_of_cols = 2

coords = [(r, c) for r in range(nbr_of_rows) for c in range(nbr_of_cols)]

Here, I set the number of rows and columns I want in my visual and do some list comprehension to pair those coordinates together in a list. Now, I have a nice, pre-built list of coordinates to leverage in my loop:

fig, ax = plt.subplots(nbr_of_rows, nbr_of_cols, figsize=(12,12))

for i, yr in enumerate(range(2010, 2020)):
    r,c = coords[i]  # grab the pre-built coordinates
    d = df[df.Year==yr][['Name', 'Gross']].sort_values('Gross').tail(10)
    _ = ax[r][c].barh(d.Name, d.Gross)
    _ = ax[r][c].set_title('Top 10 grossing movies in {0}'.format(yr))
    
fig.tight_layout()

Awesome: one less thing I have to think about!

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑