Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: tools (Page 20 of 35)

How do you transpose a Question/Answer dataset?

Recently, a friend came to me with an interesting challenge. He had a dataset of questions and answers where each record contained a single question and the answer to the question. Arguably, this dataset was already in a tidy format, but my friend wanted to transpose the data such that each unique question became a column of its own with the answers as values.

Before I could come to his aid, my friend already found a great answer at Medium.com using the pandas function: pivot_table.

Here’s what he did:

Let’s suppose you have this table of question/answer, tab-delimited data:

person	question	answer
Sir Robin	What is your name?	Sir Robin of Camelot
Sir Robin	What is your quest?	To seek the Holy Grail
Sir Robin	What is the capital of Assyria?	I don't know that
Sir Lancelot	What is your name?	Sir Lancelot of Camelot
Sir Lancelot	What is your quest?	To seek the Holy Grail
Sir Lancelot	What is your favorite colour?	Blue
Sir Galahad	What is your name?	Sir Galahad of Camelot
Sir Galahad	What is your quest?	I seek the Grail
Sir Galahad	What is your favorite colour?	"Blue, no Yellow"
King Arthur	What is your name?	"Arthur, King of the Britons"
King Arthur	What is your quest?	I seek the Holy Grail
King Arthur	What is the air speed of an unladened swallow?	What do you mean?  An African or European swallow?

Step 1: Import pandas and read in your data

import pandas as pd

df = pd.read_csv('questions.csv', sep='\t')

Step 2: pivot_table

df_pivotted = df.pivot_table(index='person', values=['answer'], 
                             columns=['question'], aggfunc=lambda x: ' '.join(str(v) for v in x))
df_pivotted.head()
pivot_table does the job nicely

The trick here is the aggfunc operation. The aggfunc parameter is normally used to sum, average, or perform some other type of numeric operation on your values columns. Interestingly, though, you can apparently supply your own custom function to this parameter instead. Here, the Medium.com author found that he could simply loop through every letter of the answer and re-join them with spaces, effectively return the original answer.

That seems pretty complicated

The use of pivot_table certainly works in this example and it’s pretty sweet to see that you can pass your own custom function to it. However, pandas also has a more generic, pivot function. Could that have worked here?

The answer is: yes. When you google pandas pivot vs pivot_table, one of the top responses is this Stackoverflow.com post that suggests pivot_table only allows numerically-typed columns in the values parameter while pivot will take strings. I don’t think this is quite true, since the above example passed a string column to the values parameter, but it does suggest that pivot might be more disposed to working with strings than pivot_table. Let’s give it a try:

df.pivot(index='person', values='answer', columns='question')
Whaddya know?! Pivot can do the job, too!

Not only can pivot do the transformation, it certainly seems less complicated. Check out my full code here.

Scraping the PyOhio Schedule

The twelfth annual PyOhio conference was held on July 27-28 and…it. was. awesome!

Now, when it comes to planning for a conference, I must admit that I’m a bit “old school.” A day or two before the gathering, I like to print out the schedule and carefully research each session so that I can choose the ones that best meet my work and personal objectives. Often, a conference will let you download a printable schedule; however, I didn’t find any such file on the PyOhio website. No matter, I can write some Python to scrape the schedule from the website and create my own CSV for printing. Here’s what I did:

Step 1: Import the requisite packages

import requests
from bs4 import BeautifulSoup
import csv

Step 2: Grab the schedule page

result = requests.get('https://www.pyohio.org/2019/events/schedule/')
soup = BeautifulSoup(result.content, 'lxml')

Step 3: Parse out the sessions

Unfortunately, I can only attend Saturday, so my code just focuses on pulling the Saturday sessions:

day_2_list = [['start_end', 'slot1', 'slot2', 'slot3', 'slot4']]
day_2 = soup.select('div.day')[1]  # get just Saturday
talks_section = day_2.find('h3', string='Keynotes, Talks, & Tutorials').parent

# iterate across each time block
for time_block in talks_section.select('div.time-block'):
    start_end = time_block.find('div', {'class': 'time-wrapper'}).get_text().replace('to', ' - ')
    time_rec = [start_end, '', '', '', '']
    # now, iterate across each slot within a time block.  a time block can have 1-4 time slots
    for slot in time_block.select('div.time-block-slots'):
        for i, card in enumerate(slot.select('div.schedule-item')):
            class_title = card.select_one('h3').get_text()
            presenter = (card.select('p')[0]).get_text()
            location = (card.select('p')[1]).get_text()
            time_rec[i+1] = '{0}\n{1}\n{2}'.format(class_title, presenter, location)
    day_2_list.append(time_rec)  # after grabbing each slot, write the block to my "day 2" list

Step 4: Write the scraped results to a CSV

csv_file = 'pyohio_20190727_schedule.csv'

with open(csv_file, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(day_2_list)

Sweet! Now I can choose just the right sessions to attend. Get my complete code here.

Ten things I like to do in Jupyter Markdown

One of the great things about Jupyter Notebook is how you can intersperse your code blocks with markdown blocks that you can use to add comments or simply more context around your code. Here are ten ways I like to use markdown in my Jupyter Notebooks.

1. Use hashes for easy titles

In your markdown cell, enter a line like this:

# This becomes a H1 header/title

That line will render as a header (h1 element).

2. Use asterisks and hyphens for bullet points

Try these lines in your markdown cell:

* this is one way to do a bullet point
- this is another way to do a bullet point

Both render as bullet point lists.

3. Use asterisks and underscores for emphasis

Next, try this:

*these words become italicized*
__these words become bold__
Wait…that didn’t render quite as expected

The phrase I wanted to italicize italicized and the phrase I wanted to bold went bold, but both phrases rendered on the same line. What gives? I’ve noticed that some markdown behaves like this, but here’s a simple solution: add a <br> (HTML for line break) at the end of each line where you want a line break. So, write this in your markdown cell:

*these words become italicized*<br>
__these words become bold__

4. Center my headers with some HTML

Instead of using the hashtag shortcut, code your header elements directly and style them to center:

<h1 style="text-align: center">This header is centered</h1>

Interestingly, I’ve noticed that my centering works in Jupyter Notebook, but not in Jupyter Lab.

5. Create thick dividing lines with HTML

My notebooks that do a lot of exploratory data analysis before jumping into data modeling can get quite lengthy. I find that a nice, thick dividing line between sections can be a great visual indicator of the changing focus of my notebook. In a markdown cell, give this a try:

<hr style="border-top: 5px solid purple; margin-top: 1px; margin-bottom: 1px"></hr>

6. Write mathematical formulas

I’m more coder than math guy, but a formula or two can sometimes be helpful explaining your solution to a problem. Jupyter markdown cells support LaTeX, so give this a whirl:

linear regression: $y = ax + b$
two dimensions: $y = a_{1}x_{1} + a_{2}x_{2} + b$
Ridge Regression: standard OLS loss function + $\alpha \times \sum_{i=1}^{n} a^{2}_i$

7. Create hyperlinks

Hyperlinks are easy in markdown:

[Google](https://google.com)

8. Drop in images with HTML

A picture is worth a thousand words:

<img src="mind_blown.gif" style="max-width:50%; max-height:50%"></img>

9. Create nice tables

Use pipes and dashes to create a table in your markdown:

|| sepal length (cm) | sepal width (cm) |
|----|----|----|
|0|5.1|3.5|
|1|4.9|3.0|

10. Escape text with three tick marks

Occasionally, I’ll want to show a code snippet in my markdown or other kind of escaped text. You can do that by surrounding your snippet with three back-tick characters:

```
sample code goes here
```

Bonus: change the background color of your markdown cells

It never occurred to me until recently, but Notebooks bring with them a variety of style classes that you can leverage in your own markdown. Here are four examples (note: this is yet another markdown trick that works in Jupyter Notebook, but not in Jupyter Lab…at least the version I’m presently running):

<div class="alert alert-block alert-info">
This is a blue background
</div>
<div class="alert alert-block alert-warning">
This is a yellow background
</div>
<div class="alert alert-block alert-success">
This is a green background
</div>
<div class="alert alert-block alert-danger">
This is a red background
</div>

For all of this code, check out my notebook here. Also, here are two other great posts on more markdown tips and tricks.

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑