The following Python exception has been driving me nuts:
“TypeError: decoding Unicode is not supported”
This happens when trying to do something like this:
encodedstring = unicode(normalstring, ‘utf-8′)
If “normalstring” is a regular string (type ’str’), and it happens to contain raw utf-8 data (for example, read from a file that you know was encoded in utf-8), the above will convert the string from utf-8 to the default unicode encoding that your Python interpreter is using. The result will be an object of type ‘unicode’.
However, if “normalstring” is already a unicode object, you will get the not-so-obvious TypeError exception “decoding Unicode is not supported”. It’s not obvious, because if “normalstring” is coming to you from some other library, you might not know whether that string was already encoded as unicode or not.
As an example of how this could happen, let’s say some other library passed off a string to you, and you noticed when printing it that some characters were garbled. Upon further inspection, you realize that the garbled characters represent a single utf-8 encoded character. So, you tell Python to encode the string as utf-8. If the string is already represented as unicode, the above exception fires. This situation can easily happen if whatever processed the string in the first place applied the wrong encoding (for example, if iso-8559-1 was incorrectly applied to a utf-8 stream).
The best thing to do in this case is figure out why the stream was originally unicode-encoded with the wrong encoding. Once that’s fixed, the re-conversion attempt that throws the mysterious exception is no longer needed.
0 Responses to “Python Mystery of the Day”