15 Jul 2015

Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

I was recently doing some text scrubbing and had difficulty working out how to remove the '†' character from strings.

e.g. I had a string like this:

PYTHON >>> u'foo †'
u'foo \u2020'

I wanted to get rid of the '†' character and then strip any trailing spaces so I’d end up with the string 'foo'. I tried to do this in one call to 'replace':

PYTHON >>> u'foo †'.replace(" †", "")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

It took me a while to work out that "† " was being treated as ASCII rather than UTF-8. Let’s fix that:

PYTHON >>> u'foo †'.replace(u' †', "")
u'foo'

I think the following call to unicode, which I’ve written about before, is equivalent:

PYTHON >>> u'foo †'.replace(unicode(' †', "utf-8"), "")
u'foo'

Now back to the scrubbing!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.