Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
I was recently doing some text scrubbing and had difficulty working out how to remove the '†' character from strings.
e.g. I had a string like this:
>>> u'foo †'
u'foo \u2020'
I wanted to get rid of the '†' character and then strip any trailing spaces so I’d end up with the string 'foo'. I tried to do this in one call to 'replace':
>>> u'foo †'.replace(" †", "")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)
It took me a while to work out that "† " was being treated as ASCII rather than UTF-8. Let’s fix that:
>>> u'foo †'.replace(u' †', "")
u'foo'
I think the following call to unicode, which I’ve written about before, is equivalent:
>>> u'foo †'.replace(unicode(' †', "utf-8"), "")
u'foo'
Now back to the scrubbing!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.