Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: jupyter_notebook (Page 9 of 17)

How do you transpose a Question/Answer dataset?

Recently, a friend came to me with an interesting challenge. He had a dataset of questions and answers where each record contained a single question and the answer to the question. Arguably, this dataset was already in a tidy format, but my friend wanted to transpose the data such that each unique question became a column of its own with the answers as values.

Before I could come to his aid, my friend already found a great answer at Medium.com using the pandas function: pivot_table.

Here’s what he did:

Let’s suppose you have this table of question/answer, tab-delimited data:

person	question	answer
Sir Robin	What is your name?	Sir Robin of Camelot
Sir Robin	What is your quest?	To seek the Holy Grail
Sir Robin	What is the capital of Assyria?	I don't know that
Sir Lancelot	What is your name?	Sir Lancelot of Camelot
Sir Lancelot	What is your quest?	To seek the Holy Grail
Sir Lancelot	What is your favorite colour?	Blue
Sir Galahad	What is your name?	Sir Galahad of Camelot
Sir Galahad	What is your quest?	I seek the Grail
Sir Galahad	What is your favorite colour?	"Blue, no Yellow"
King Arthur	What is your name?	"Arthur, King of the Britons"
King Arthur	What is your quest?	I seek the Holy Grail
King Arthur	What is the air speed of an unladened swallow?	What do you mean?  An African or European swallow?

Step 1: Import pandas and read in your data

import pandas as pd

df = pd.read_csv('questions.csv', sep='\t')

Step 2: pivot_table

df_pivotted = df.pivot_table(index='person', values=['answer'], 
                             columns=['question'], aggfunc=lambda x: ' '.join(str(v) for v in x))
df_pivotted.head()
pivot_table does the job nicely

The trick here is the aggfunc operation. The aggfunc parameter is normally used to sum, average, or perform some other type of numeric operation on your values columns. Interestingly, though, you can apparently supply your own custom function to this parameter instead. Here, the Medium.com author found that he could simply loop through every letter of the answer and re-join them with spaces, effectively return the original answer.

That seems pretty complicated

The use of pivot_table certainly works in this example and it’s pretty sweet to see that you can pass your own custom function to it. However, pandas also has a more generic, pivot function. Could that have worked here?

The answer is: yes. When you google pandas pivot vs pivot_table, one of the top responses is this Stackoverflow.com post that suggests pivot_table only allows numerically-typed columns in the values parameter while pivot will take strings. I don’t think this is quite true, since the above example passed a string column to the values parameter, but it does suggest that pivot might be more disposed to working with strings than pivot_table. Let’s give it a try:

df.pivot(index='person', values='answer', columns='question')
Whaddya know?! Pivot can do the job, too!

Not only can pivot do the transformation, it certainly seems less complicated. Check out my full code here.

How do you hide secrets in Jupyter Notebooks?

Often in my notebooks, I will connect to a relational database or other data store, query the system for data, and then do all sorts of amazing operations with said data. Many times, these data stores are restricted to select users and I must authenticate myself to the system–usually with an id and password. One might be inclined to code such connection strings inline in his Jupyter Notebook. However, I usually check my notebooks in to source control and/or hand them in to management as reports or documentation. Thus, any number of people might see my notebooks potentially compromising my personal id and password were I to code the credentials inline.

So, how can I hide my secrets–my connection strings and other sensitive information–so I can still safely share the good work I do in my notebooks? The way I do it is by moving my connection strings to configuration files. Allow me to demonstrate:

Step 1: Import my packages

from sqlalchemy import create_engine
import pandas as pd
from configparser import ConfigParser

I import the usual suspects–SQLAlchemy for database management and pandas for my dataframe work–but I’m also loading in configparser. It’s this last package that will help me pull out my secret stuff to a separate file that I can protect.

Step 2: Create my configuration file

Now, I need to create that separate configuration file. In the same directory as my notebook, I’ll create a text file. I usually name my file nb.cfg–as in, notebook config. For my example, storing the connection string to my SQLite database, my configuration file looks like so:

[my_db]
conn_string: sqlite:///mwc.db

Although SQLite databases don’t have authentication requirements, you can imagine, say, a connection string to a PostgreSQL database that would contain an id and password.

Step 3: Load the configuration file

Back in your notebook, load your configuration file:

parser = ConfigParser()
_ = parser.read('nb.cfg')

Step 4: Access the secrets in your configuration file

Now you’re ready to access those secrets! In this example, I’ll pass my secret connection string to my database engine object:

engine = create_engine(parser.get('my_db', 'conn_string'))

Step 5: Profit!

That’s basically it. In my example, I can now use my database engine object to query a table in my database and load the results into a dataframe:

qry = """
SELECT *
FROM people
"""

df_mwc_people = pd.read_sql(qry, engine)

Check out the complete code example here.

Postscript

You might ask yourself, “self, do I need to do anything else to protect my config file from getting into the hands of my enemies?” Well, since I often use Git for source control, I do want to make sure I don’t accidentally check my configuration file into my source code repository. To avoid that problem, I create a .gitignore file and add the name of my configuration file to it. Then, every time I commit a change, Git will simply ignore committing my configuration file.

Scraping the PyOhio Schedule

The twelfth annual PyOhio conference was held on July 27-28 and…it. was. awesome!

Now, when it comes to planning for a conference, I must admit that I’m a bit “old school.” A day or two before the gathering, I like to print out the schedule and carefully research each session so that I can choose the ones that best meet my work and personal objectives. Often, a conference will let you download a printable schedule; however, I didn’t find any such file on the PyOhio website. No matter, I can write some Python to scrape the schedule from the website and create my own CSV for printing. Here’s what I did:

Step 1: Import the requisite packages

import requests
from bs4 import BeautifulSoup
import csv

Step 2: Grab the schedule page

result = requests.get('https://www.pyohio.org/2019/events/schedule/')
soup = BeautifulSoup(result.content, 'lxml')

Step 3: Parse out the sessions

Unfortunately, I can only attend Saturday, so my code just focuses on pulling the Saturday sessions:

day_2_list = [['start_end', 'slot1', 'slot2', 'slot3', 'slot4']]
day_2 = soup.select('div.day')[1]  # get just Saturday
talks_section = day_2.find('h3', string='Keynotes, Talks, & Tutorials').parent

# iterate across each time block
for time_block in talks_section.select('div.time-block'):
    start_end = time_block.find('div', {'class': 'time-wrapper'}).get_text().replace('to', ' - ')
    time_rec = [start_end, '', '', '', '']
    # now, iterate across each slot within a time block.  a time block can have 1-4 time slots
    for slot in time_block.select('div.time-block-slots'):
        for i, card in enumerate(slot.select('div.schedule-item')):
            class_title = card.select_one('h3').get_text()
            presenter = (card.select('p')[0]).get_text()
            location = (card.select('p')[1]).get_text()
            time_rec[i+1] = '{0}\n{1}\n{2}'.format(class_title, presenter, location)
    day_2_list.append(time_rec)  # after grabbing each slot, write the block to my "day 2" list

Step 4: Write the scraped results to a CSV

csv_file = 'pyohio_20190727_schedule.csv'

with open(csv_file, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(day_2_list)

Sweet! Now I can choose just the right sessions to attend. Get my complete code here.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑