Python: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)
I’ve been trying to write some Python code to extract the players and the team they represented in the Bayern Munich/Barcelona match into a CSV file and had much more difficulty than I expected.
I have some scraping code (which is beyond the scope of this article) which gives me a list of (player, team) pairs that I want to write to disk. The contents of the list is as follows:
$ python extract_players.py
(u'Sergio Busquets', u'Barcelona')
(u'Javier Mascherano', u'Barcelona')
(u'Jordi Alba', u'Barcelona')
(u'Bastian Schweinsteiger', u'FC Bayern M\xfcnchen')
(u'Dani Alves', u'Barcelona')
I started with the following script:
with open("data/players.csv", "w") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(["player", "team"])
for player, team in players:
print player, team, type(player), type(team)
writer.writerow([player, team])
And if I run that I’ll see this error:
$ python extract_players.py
...
Bastian Schweinsteiger FC Bayern München <type 'unicode'> <type 'unicode'>
Traceback (most recent call last):
File "extract_players.py", line 67, in <module>
writer.writerow([player, team])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)
So it looks like the 'ü' in 'FC Bayern München' is causing us issues. Let’s try and encode the teams to avoid this:
with open("data/players.csv", "w") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(["player", "team"])
for player, team in players:
print player, team, type(player), type(team)
writer.writerow([player, team.encode("utf-8")])
$ python extract_players.py
...
Thomas Müller FC Bayern München <type 'unicode'> <type 'unicode'>
Traceback (most recent call last):
File "extract_players.py", line 70, in <module>
writer.writerow([player, team.encode("utf-8")])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 8: ordinal not in range(128)
Now we’ve got the same issue with the 'ü' in Müller so let’s encode the players too:
with open("data/players.csv", "w") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(["player", "team"])
for player, team in players:
print player, team, type(player), type(team)
writer.writerow([player.encode("utf-8"), team.encode("utf-8")])
$ python extract_players.py
...
Gerard Piqué Barcelona <type 'str'> <type 'unicode'>
Traceback (most recent call last):
File "extract_players.py", line 70, in <module>
writer.writerow([player.encode("utf-8"), team.encode("utf-8")])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)
Now we’ve got a problem with Gerard Piqué because that value has type string rather than unicode. Let’s fix that:
with open("data/players.csv", "w") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(["player", "team"])
for player, team in players:
if isinstance(player, str):
player = unicode(player, "utf-8")
print player, team, type(player), type(team)
writer.writerow([player.encode("utf-8"), team.encode("utf-8")])
Et voila! All the players are now successfully written to the file.
An alternative approach is to change the default encoding of the whole script to be 'UTF-8', like so:
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
with open("data/players.csv", "w") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(["player", "team"])
for player, team in players:
print player, team, type(player), type(team)
writer.writerow([player, team])
It took me a while to figure it out but finally the players are ready to go!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.