Python makes it pretty darn easy to replace text in a string:
s = 'The quick brown studebaker jumps over the lazy dog'
print(s.replace('studebaker', 'fox'))
However, when I’m cleaning larger data sets, I find myself perform multiple replacement operations. You can chain replace operations together:
s = 'The quick brown studebaker jumps over the lazy chupacabra'
print(s.replace('studebaker', 'fox').replace('chupacabra', 'dog'))
…but if you have a lot of text to replace, chaining replace operations together can go sideways pretty fast. Here are two techniques I’ve found to replace a large number of text in a cleaner way.
Using the Zip Function
The zip function is a neat technique to join tuples together for easy iteration. In the case of replacing text, we have a tuple of the text needing to be replaced and a tuple of the text that will be the substitutes:
s = 'The quick blue jackalope jumps under the lazy chupacabra'
old_words = ('blue', 'jackalope', 'under', 'chupacabra')
new_words = ('brown', 'fox', 'over', 'dog')
for check, rep in zip(old_words, new_words):
s = s.replace(check, rep)
print(s)
Using Replace in a Pandas Dataframe
Often, I’ll have text in a pandas dataframe that I need to replace. For such circumstances, pandas provides a variety of solutions. I’ve found using a dictionary can be a clean way to solve this problem:
s = 'The quick blue jackalope jumps under the lazy chupacabra'
df = pd.DataFrame(s.split(' '), columns=['word'])
print(df) # print the error-laden dataframe
replacements = {'blue': 'brown', 'jackalope': 'fox', 'under': 'over', 'chupacabra': 'dog'}
df['word'] = df.word.replace(replacements)
print(df) # now print the cleaned up dataframe
Happy replacing!
Recent Comments