How do you transpose a Question/Answer dataset?

Recently, a friend came to me with an interesting challenge. He had a dataset of questions and answers where each record contained a single question and the answer to the question. Arguably, this dataset was already in a tidy format, but my friend wanted to transpose the data such that each unique question became a column of its own with the answers as values.

Before I could come to his aid, my friend already found a great answer at Medium.com using the pandas function: pivot_table.

Here’s what he did:

Let’s suppose you have this table of question/answer, tab-delimited data:

person	question	answer
Sir Robin	What is your name?	Sir Robin of Camelot
Sir Robin	What is your quest?	To seek the Holy Grail
Sir Robin	What is the capital of Assyria?	I don't know that
Sir Lancelot	What is your name?	Sir Lancelot of Camelot
Sir Lancelot	What is your quest?	To seek the Holy Grail
Sir Lancelot	What is your favorite colour?	Blue
Sir Galahad	What is your name?	Sir Galahad of Camelot
Sir Galahad	What is your quest?	I seek the Grail
Sir Galahad	What is your favorite colour?	"Blue, no Yellow"
King Arthur	What is your name?	"Arthur, King of the Britons"
King Arthur	What is your quest?	I seek the Holy Grail
King Arthur	What is the air speed of an unladened swallow?	What do you mean?  An African or European swallow?

Step 1: Import pandas and read in your data

import pandas as pd

df = pd.read_csv('questions.csv', sep='\t')

Step 2: pivot_table

df_pivotted = df.pivot_table(index='person', values=['answer'], 
                             columns=['question'], aggfunc=lambda x: ' '.join(str(v) for v in x))
df_pivotted.head()

The trick here is the aggfunc operation. The aggfunc parameter is normally used to sum, average, or perform some other type of numeric operation on your values columns. Interestingly, though, you can apparently supply your own custom function to this parameter instead. Here, the Medium.com author found that he could simply loop through every letter of the answer and re-join them with spaces, effectively return the original answer.

That seems pretty complicated

The use of pivot_table certainly works in this example and it’s pretty sweet to see that you can pass your own custom function to it. However, pandas also has a more generic, pivot function. Could that have worked here?

The answer is: yes. When you google pandas pivot vs pivot_table, one of the top responses is this Stackoverflow.com post that suggests pivot_table only allows numerically-typed columns in the values parameter while pivot will take strings. I don’t think this is quite true, since the above example passed a string column to the values parameter, but it does suggest that pivot might be more disposed to working with strings than pivot_table. Let’s give it a try:

df.pivot(index='person', values='answer', columns='question')

Whaddya know?! Pivot can do the job, too!

Not only can pivot do the transformation, it certainly seems less complicated. Check out my full code here.

How do you transpose a Question/Answer dataset?

Step 1: Import pandas and read in your data

Step 2: pivot_table

That seems pretty complicated

Recent Posts

Recent Comments

Archives

Meta

How do you transpose a Question/Answer dataset?

Step 1: Import pandas and read in your data

Step 2: pivot_table

That seems pretty complicated

Recent Posts

Recent Comments

Archives

Tags

Meta