This is a follow-up to the previous Python Mystery of the Day regarding a “TypeError: decoding unicode is not supported” exception.
Ok, so what if you receive a unicode string that has clearly been converted to unicode using the wrong encoding? In my case, the string was originally encoded as UTF-8. The source did not specify an encoding, so the encoding applied was ISO-8859-1, also known as Latin-1.
By the way, there are some really great Python unicode tutorials. As of yesterday, I could barely tell you the difference between Latin-1 and ISO-8859-1. Today, I can tell you that they are the same thing. Thanks, tutorials! Seriously, the tutorial linked at the beginning of this paragraph is awesome if you’re just starting to bash your head against unicode in Python.
With that, here is a way to fix a string that was encoded as “Latin-1″ (that is, ISO-8859-1), when it really should have been encoded as “UTF-8″.
We’ll use the example string “Nuñoz”. If the original is encoded as UTF-8, it will be represented with these bytes (written here as a python literal):
‘\x4E\x75\xC3\xB1\x6F\x7A’
All of the bytes that are less than 128 are regular ASCII characters, as ASCII is a subset of UTF-8. The 0xC3 and 0xB1 characters together represent the accented “n” in UTF-8.
Let’s see how this works in Python:
>>> rawstring = ‘\x4E\x75\xC3\xB1\x6F\x7A’
>>> rawstring
‘Nu\xc3\xb1oz’
>>> print rawstring
Nu??oz
Note that the rawstring contains a string of bytes, but when we try to print it, the terminal window will attempt to display the non-ASCII bytes. How they are displayed depends on the computer. On my Windows machine, it displays as garbled mousetext.
Now, if our software knew the proper encoding, it could do the right thing and convert this raw byte string into Python’s internal Unicode representation like this:
>>> utf8string = unicode(rawstring, ‘utf-8′)
>>> print utf8string
Nuñoz
>>> utf8string
u’Nu\xf1oz’
Now we see that when we print the string, it shows the proper accented “n” character. When we look at the internal representation of utf8string, we see that the non-ascii character has been coverted to 0xF1. The two-byte UTF-8 sequence is represented by 0×00F1 in Python’s internal unicode format (which is usually a form of UTF-16. Theoretically we’re not supposed to worry about Python’s internal format too much, but it becomes important when encodings start going haywire).
Now, what happens if our software is wrong, and it thinks our rawstring was originally encoded using ISO-8859-1? Let’s see:
>>> rawstring
‘Nu\xc3\xb1oz’
>>> iso8859string = unicode(rawstring, ‘iso-8859-1′)
>>> iso8859string
u’Nu\xc3\xb1oz’
>>> print iso8859string
Traceback (most recent call last):
File “”, line 1, in ?
File “C:\Python24\lib\encodings\cp437.py”, line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: ‘charmap’ codec can’t encode character u’\xc3′ in position 2
: character maps to undefined
Oh no! The iso-8859-1 encoding took the UTF-8 byte string exactly as-is and stored the bytes in Python’s internal format without changing anything. This is what we told it to do - Python’s internal UTF-16 contains the entire iso-8859-1 character set as-is, just zero-extended to 16 bits. But this is a bad situation, because 0xC3 and 0xB1 are now two characters that can’t be printed to my terminal, when the originally represented a single character.
This gives us a hint that we may be able to detect strings that were improperly encoded with iso-8819-1 when they should have been encoded as UTF-8. This confusion is very common, as both of these encodings are used frequently on websites, and the source will quite often not tell you which encoding is used. If the string is examined byte-by-byte and some two or three-byte sequences exist with each byte greater than 128, we can check to see if those sequences are valid UTF-8 sequences. If so, there’s a good chance that the string was mis-encoded. I haven’t tried this yet, so we’ll defer this possibility until a future posting.
The easiest situation is when you know for sure that the original was improperly encoded as iso-8859-1 and it should have been encoded as utf-8. You can convert it back to a “raw” string, then re-convert it to unicode using the proper utf-8 encoding, like this:
>>> iso8859string
u’Nu\xc3\xb1oz’
>>> rawfromiso = iso8859string.encode(’iso-8859-1′)
>>> rawfromiso
‘Nu\xc3\xb1oz’
>>> properUTF8string = unicode(rawfromiso, ‘utf-8′)
>>> properUTF8string
u’Nu\xf1oz’
>>> print properUTF8string
Nuñoz
Starting with the incorrectly-encoded string “iso8859string”, we convert it back to a raw byte string by using ‘encode’, passing in the incorrect encoding that we want to strip. We then take that “rawfromiso” raw byte string and encode it using utf-8. Note that after encoding, the utf-8 two-byte sequence is converted into a proper UTF-16 character.
With unicode, the wrong thing seems to happen more often than one would normally expect. This is mostly an issue of getting used to the idea of strings of text no longer being well defined by one simple standard. Once various encodings come into play, it’s no longer just text - it is a string of 8-bit, 16-bit or 32-bit integers, possibly little endian or big-endian, plus encoding meta-data describing what all of it is supposed to mean. If that encoding meta-data is missing, wrong, or not what is expected, garbled text and exceptions are the result.
0 Responses to “Python Unicode - Fixing UTF-8 encoded as Latin-1 / ISO-8859-1”