Archive for July, 2006

Big Box Economics - My Plan To Milk Wal-Mart

Yesterday, the Chicago City Council passed the so-called “Big Box” wage law by a vote of 35 to 14, enough votes to prevent a veto by Mayor Richard Daley. The law requires that employees at large retailers be paid a minimum of $9.25 an hour, with an additional $1.50 per hour in fringe benefits such as health care coverage. Those rates will rise to $10 and $3 by 2010, with automatic increases thereafter.

Proponents of the law argue that any job should pay enough to cover the bare necessities of life, such as rent and food, and that large retailers can afford to pay more. Opponents argue that an artificial floor on the hourly wage will decrease the number of jobs, and that large retailers will choose to locate outside of the city.

Who is right? I think a more interesting question is, how do you determine that $9.25 is the perfect hourly wage to put into the law? Among the supporters of the law, this precise amount must have been the topic of much debate.

There were no doubt the scentists who calculated a rent payment, plus cost of food and other necessities for survival, and divided it by an average number of expected work hours to come up with an acceptable pay rate. Perhaps there were others who argued for just a dollar more, pointing to the huge revenues of these retailers as evidence of their ability to pay it. No doubt comparisons were made to jobs where the rates are heavily influenced by unions.

It is undeniably difficult to decide the correct hourly rate to set into law. But I would like to help, in case this debate comes up again in another city.

I have come up with a diabolical plan to enforce the perfect hourly rate, almost down to the penny. It is an approach that the supporters of the big-box wage law would like, because it is designed to extract the most possible jobs at the highest possible hourly rate from those huge, wealthy corporations.

It works like this. I start by pretending to be hands-off, letting the retailer come into town, and letting them set their own hourly rate for all of the jobs they would like to offer. I make sure that they tell the potential employees in the area how much they intend to pay, and that they are crystal clear about their benefits package, or lack thereof. Luckily, they will do this part of the job for me, since they have big human resource departments who are trained to do this kind of thing.

Now, I have them right where I want them. If they have set the compensation way too low, nobody will even apply for the jobs, and they will be FORCED to raise their hourly rates. Of course, they are cunning, and would never make this mistake.

They might set the hourly rate ALMOST high enough but not quite, such that they get enough people hired to do the work. But my plan works here, too - if the hourly rate is just not quite high enough, these new employees will stop showing up to work, or they’ll quit, or they might even go get a job at a higher paying big-box competitor down the street. This causes turnover, which is expensive, and we all know that the big-box retailers hate spending money, so they will be forced to raise their hourly rate in this case as well.

Now, this is where my plan shines. Once they have raised their hourly rate high enough to not only attract employees but keep them coming day after day to do their jobs, I have succeeded in milking these greedy corporations for as much as I possibly can. The highest number of people will be employed, and they will each be getting paid the perfect hourly rate, with the perfect amount of benefits required to keep them each coming in to work at the big-box retailer every day.

With this plan, I’ve got the retailers locked in forever - if they ever slip up and lower their pay too much, the employees will stop showing up to work, or start working somewhere else, forcing the big-box to up the pay rate again.

And the best part? My plan can’t be overturned for being unconstitutional.

Proposal for Local Commit in Subversion

Proposal for Local Commit in Subversion

I propose a “local commit”, such as “svn commit –local”. Performing a local commit would save the current modified version in the local .svn directory, along with a local revision number, without needing to connect to the repository. I could do this as many times as I like, and each local commit would exist as another copy (or just a diff) with another local revision number (such as r541.0, r541.1, r541.2, etc.). Each local commit would also have a comment, just like a real commit.

I would be able to use Subversion to compare these local versions, so “svn diff” would show me the difference between the latest locally committed version and my working copy. I could specify a specific local revision to “svn diff” and get the differences with that revision, all the way back to the head revision that I started with when I disconnected in the first place. For example, “svn diff myfile.c -r541.23″ would compare my working copy of myfile.c with the 24th locally committed version of myfile.c. Existing commands would still work normally, for example “svn diff myfile.c -r541″ would compare myfile.c with version 541, the latest version I got from the actual repository. However, “svn diff myfile.c” could automatically compare with my latest locally committed revision, or alternatively, a “–local” flag could be added in cases like this to indicate when we want to pay attention to local commits.

Even if I’m not disconnected from the repository, these local commits allow me to work incrementally and implement huge, earth-shattering changes, one small step at a time, without bothering anybody else who is using the main repository until I’m done. While working on these large scale changes, I still get the advantage of versioning in small increments by doing local commits.

Once I am ready to commit these local changes to the repository, I would do a “svn commit” like normal. Subversion would automatically do a sequence of commits, starting from my first local revision and continuing through each locally committed revision up to and including my local working copy, applying the commit comments that I had supplied with each local commit. My local incremental changes and comments would then be commited to the main repository and be given real revision numbers by the repository - the local “541.23″ version numbers would go away, but the sequence of revisions would still be preserved. After the commit to the repository, the versions that had been locally committed now appear as individual revisions in the central repository, as if I had been committing directly to the repository all along. All of these commits would be atomic. Multiple local commits on multiple files could be committed in one atomic operation.

I have only been heavily using Subversion for a short time, but I believe adding “local commit” would fit very nicely into the way Subversion currently works. Whether it would be technically challenging to add it, I don’t know, but that is less important than the way the idea of local commits would interact with today’s Subversion. I think local commits could be added in a way that would not change the way Subversion is used at all, and it may even be possible to add it in a way that works well with existing tools that interact with Subversion.

With “local commit” support, Subversion would go a long way towards solving the types of problems that projects like Bazaar-NG are trying to solve. There are a handful of “distributed” version control solutions in the works, including Bazaar-NG, that are trying to create solutions where there is no central repository. Allowing full-featured disconnected work is a major advantage of such systems. However, the advantage of having a central repository can’t be ignored - it defines a central place for all revisions to be integrated. With distributed systems, it is easy to imagine small clusters of developers in sync with each other, but wildly out of sync from cluster to cluster. Subversion plus local commit allows full incremental commits while disconnected, and loses nothing when the actual commit to the central repository is done.

So… will it happen? I suppose it’s my itch, so in the tradition of open source, I should scratch it. But if I’m lucky, someone hotshot with lots of Subversion development experience will have the same itch after reading this and implement it in 10 minutes… if you do, let me know :)

Using Subversion Disconnected

When choosing version control, I look for reliability (do things right or don’t do them at all), workflow integration (full IDE integration so I don’t need to think about version control as a separate problem when working), and the ability to work disconnected from the server.

Subversion has been reliable for me so far, and part of the reason why I chose it over some other interesting possibilities (like bzr) is that it is being used by very many other people on large scale projects much larger than my own projects. If problems arise, they will get fixed. I have had no problems with it so far that didn’t stem from not reading the documentation, so Subversion gets high scores for reliability, as expected.

IDE integration has been great, since I’ve been using eclipse and the Subclipse plugin works nicely. I have only seen one major shortcoming in Subclipse so far - it won’t let you assign eclipse shortcut keys to Subclipse functions. This is a peculiar ommission, and I wouldn’t be surprised if it is possible but I just haven’t figured out how to do it yet. Otherwise, the integration with Eclipse is fantastic. Synchronizing with the repository is actually fun.

Working disconnected with Subversion is much more functional than I would have thought at first. With SourceSafe and Perforce, I am used to doing a “Checkout” or an “Open for Edit” on files before editing them. In fact, with SourceSafe and Perforce, if you start editing a file without first asking the server if you can edit, you enter into an undefined bizarro world where nothing will work as intended. When discussing Subversion with a friend who uses only Subversion, I inquired about the ability to have a local repository so that I could check out files when not connected to the Internet. He didn’t understand the question at all - with Subversion, you can just go ahead and start editing, and the changes (including conflicting changes) are worked out later when you are reconnected.

If you get on an airplane with an up to date working copy of a repository, you can modify any file in that repository without connecting to the server, and when you reconnect, things will work as expected. Again, for Subversion users, this is obvious. For SourceSafe and Perforce users like me, this is a miracle.

Subversion also keeps a copy of the latest version from the repository, so while you are disconnected, you can still to a diff on the latest repository version. This is fantastic, as it is the most common thing I do with files under version control - oh no, I broke everything. How? Show me the version that worked, and what I changed. “svn diff” can do this without connecting to the repository, and thus can be done while disconnected.

Similarly, Subversion can tell you all of the files that have been modified locally, again without hitting the actual repository. It just checks every local file against its local head revision copy and reports back a list of the ones that have changed.

To work this magic, Subversion needs to keep a bunch of metadata on the local machine. This is a bit strange at first, coming from SourceSafe and Perforce. With those tools, there is little if any local metadata, so when you are disconnected from the server, the local client has no idea what is under version control and what is not.

With Subversion, detailed information is kept locally, telling Subversion what files are under version control, when the latest version was retrieved from the repository, a copy of that latest version - basically everything needed to browse and examine the status of the entire repository, without connecting to the server. Working disconnected is easy - the local client has all the information it needs about the structure of the repository, so it can tell you what you’ve done since you were last connected.

You can’t commit changes to the repository while disconnected (”commit” in Subversion is like submitting a changelist in Perforce or doing a Check In in SourceSafe). However, I think with a little bit of work, it would be possible to do this. First, you need to understand why I think this would be a convenient thing to do.

Version control is used to synchronize files among many different users. At the same time, it is also used to track small changes from version to version, so that when something goes wrong, you can look back in the history and narrow down the change that caused the problem, or look in the commit comments to find out why a change was made in the first place. Viewing the historical changes to a file can be life-saving on a multi-year project, and let’s face it, every 2-week project has secret ambitions to be a 10 year project.

If commits are made very frequently, then viewing differences from one revision to the next is useful. It is very easy to see what changed and why from one revision to the next. And, if a new change breaks something, there is not much to lose by reverting to the most recent revision.

On the other hand, if commits are far apart, viewing differences between two versions is not useful - there may be 100 differences, all completely unrelated. It is not obvious which changes can be reverted and which cannot, or which ones might be the cause of a problem. If a local modification is very different from the head revision from the repository, and the local version is completely broken, reverting to the repository means losing large amounts of work, so it just won’t be done.

So, to make revision control useful in this sense, it is important to be able to frequently commit changes. When working disconnected, even if I’m only disconnected for four hours, I may make 20 incremental changes and choose to commit each change once I have tested it.

Even when I am connected, I may want to “commit” small changes so that I can view the differences from one version to the next, but I may not yet be ready to expose those changes to everyone else who uses the repository. I may make a small change in preparation for a larger change, a small change that I am not ready to “commit” to and may choose to revert later, but I want to put a stake in the ground after that small change, before going on to make further changes, so that I can locally revert and compare revisions before commiting to the repository.

read more in “Proposal for Local Commit in Subversion

Hacking Subversion “entries” file

Using Subversion for the first time brings with it some surprises. Matching-but-not-identical repository URLs can cause hard to understand problems with baffling error messages. And, not understanding the importance of Subversion’s local meta-data can get things off to a bad start. The hacks described below can help you diagnose and fix some of these problems. The discussion below describes these problems on a Windows machine, but the problems are very similar on unix-like systems as well.

I was new to Subversion. I created a repository, then created a “projects” subdirectory in the repository via the TortoiseSVN repo-browser. This was a mistake, but I didn’t know it at the time - it was a mistake because Subversion didn’t get the opportunity to write its hidden meta-data files to my local “projects” subdirectory, but I had no idea. I then added four projects to Subversion one by one. The directory structure on disk matched the structure in Subversion, so I figured everything was ok. However, as far as Subversion was concerned, my local projects were all totally independent, it didn’t know what to do with the parent directory since it had never been added directly, and trying to synchronize all of the projects from the parent directory resulted in a “not versioned” error. I had created a mess.

Subversion keeps meta-data on the local file system in a hidden directory called “.svn”. This meta-data tells Subversion whether or not the files in that directory really are under control of Subversion. If the meta-data doesn’t contain information about the local files, Subversion won’t let you update or commit those files, even if an exact match already exists in the repository. Since I had created my “projects” directory in the repository, and not “added” the directory from my local filesystem, the local meta-data for the “projects” directory was not there.

I attempted to add the “projects” parent directory non-recursively to Subversion, thinking that would solve the problem. However, doing an update from the parent directory now resulted in a “file already exists” error for every project subdirectory, even though those projects were versioned in Subversion and matched the Subversion repository.

Some quick searching on this problem turned up some suggested solutions that were usually a combination of stating the obvious plus an IT-related insult, like “Just delete everything and update from Subversion. If it doesn’t work, you do have backups, don’t you????” However, when an entire subdirectory and everything below it causes the “file already exists” error, it’s not so easy to just blow everything away. There are likely hundreds of files in those subdirectories that aren’t in the repository, such as compiler output (that may have taken a long time to generate), notes that are in text files that I don’t want to delete yet, and perhaps graphics or source code that is in a half-baked state that should probably be cleaned out but that I don’t feel comfortable cleaning out just yet. And if that subdirectory contains a few gigabytes of output data, backing it up and restoring it isn’t fun when I’m just trying to get Subversion to work.

Luckily, there is a way to make the project parent directory know about the subprojects. Note that this is a hack, and I don’t recommend it, though it worked like a charm. I was unable to find a useful “correct” way to fix this problem from the svn command line, so we have to manually make things right.

If we want to tell a parent directory about sub-projects that it should know about, but doesn’t due to my incompetence in setting up the projects originally and Subversion’s reluctance to fix my mistakes for me, we do a simple text-editing hack. The hack involves editing a file in the parent directory’s “.svn” subdirectory. The file is called “entries” and it tells Subversion what it knows about that directory.

If you happen to have a local project checked out from Subversion, go look for the hidden “.svn” directory. You should find the “entries” file there. Look inside it. It is an easy to read XML file and it can be edited with any text editor (on Windows, the command-line “edit” works great and handles unix end-of-lines nicely). Saving the file with “Windows” style end-of-lines didn’t seem to cause any trouble, so don’t worry about that.

In the “entries” file, there are “entry” tags for each file and subdirectory that is under version control. Many of those entry tags refer to the individual files in the directory. We can’t easily recreate those entries, because they include a lot of information about the file, like dates, checksum, and more. Don’t touch those entries. For individual files that give you a “file already exists” error, the best approach really is to delete the local file and then update the local file from the repository, assuming again that the file in the repository is up to date in the first place. The “entry” will then be created for the file, including the proper dates and checksum.

For directories, we’re in luck. The entry for a directory looks like this:

< entry name="directoryname" kind="dir" />

So, my situation was this. I had a project directory called “projects”, and four subprojects called “potatogun”, “plasticwrap”, “sinker” and “oatmeal”. My subversion repository had the same structure - all four projects were in a “projects” folder in the repository. The local subversion “entries” file for the “projects” directory just didn’t know it.

By editing the “entries” file in the “projects/.svn” directory and adding four “entry” lines exactly like the one above, but with “potatogun”, “plasticwrap”, “sinker” and “oatmeal” in place of “directoryname”, I magically had everything back to normal. I could update the subprojects with no problem. All was well.

Except that I left out one major detail. In two of the subprojects, I somehow told Subversion that my repository was called “file:///C:/svn”. In the other two projects, I had called the repository “file:///c:/svn” - note the lower-case “c” in the second version. Of course, on a Windows machine, these two are the same thing. As far as Subversion is concerned, however, these two things are different.

Subversion this time tells us that “svn: ‘file:///c:/svn/projects/potatogun’ is not the same repository as ‘file:///C:/svn’” - again, note the case difference in the drive letter.

So the parent project thinks the repository is located at C:/svn while the subproject thinks the repository is located at c:/svn. Subversion believes these two are different, and I have no problem with that. I just need a way to fix it. There are some explanations out there in the Googleverse about why the problem happens, but no good solutions.

The answer again lies in the “entries” file. We know perfectly well that our projects are all coming from the same repository. Subversion does a direct case-sensitive string compare when determining when two repositories are the same, so two URLs that mean the same thing will generate the “not the same repository” entry if the urls are entered in slightly different ways. There are many examples of this, not just case-sensitive problems. The “entries” file stores the URL of the repository that was specified when the project was first checked out of Subversion, and all we have to do is edit all of our “entries” files and make them all match.

The first “entry” tag in the entries file contains two items that need to be changed. One is the “url=” line, and the other is the “repos=” line. The “url=” line contains the URL that points to the subversion subfolder that contains the file. The “repos=” line points to the parent repository.

So, in my example, in my “projects/.svn/entries” file, the lines look like this:

url=”file:///C:/svn/projects”

repos=”file:///C:/svn”

The “projects/potatogun/.svn/entries” file has lines that look like this:

url=”file:///c:/svn/projects/potatogun”

repos=”file:///c:/svn”

Again, note the lower-case drive letter in the potatogun entries. Subversion thinks these two projects came from two different repositories, so we need to make them match. The convention is to use upper-case drive letters. You can choose to use lower-case if you like, Windows understands both, but if you ask Windows what the drive letter is, it will give you an upper-case response.

We now edit “/projects/potatogun/.svn/entries” with a text editor so that the lines look like this:

url=”file:///C:/svn/projects/potatogun”

repos=”file:///C:/svn”

It is important to change BOTH the “url” and “repos” entries so that the repository paths match. If there is an inconsistency between those two lines, Subversion will think it didn’t read the “entries” file correctly and it will get really confused.

We also have to do this for any subdirectories in project potatogun, and for any other subprojects that might have mismatched repository URLs. If you have the means to do a case-sensitive, recursive search-and-replace on all “entries” files, do that. Of course, if the source repositories REALLY WERE different in the first place, this hack isn’t going to help, but if you’re still reading at this point, you know that already.

Now we can go to the “projects” directory and type “svn up” - instead of getting the “not the same repository” error, we get a nice “At revision 492.” message like we should.

Now that my projects are all nicely set up, I know enough about Subversion to do it right the first time. But for Subversion newbies, equivilent-but-not-equal repository URLs can be very frustrating. It’s also surprising when Subversion can’t figure out that the local file structure exactly matches the repository. Then, when you try to let Subversion know that the local files match the repository, the “file already exists” errors are even more frustrating. But with a little hacking of the “entries” file, we can fix our mistakes and get back to work.

Kevin Smith, John Lydon and Inverse-Schadenfreude

I’m now in my mid-30s, so this is the perfect time in my life to start noticing successful, famous people in their mid-30s and wonder why I’m not as successful and famous as they are. This is an unhealthy habit that I seem to have developed in the last year or so as I entered the mid-portion of my life, but I’m sure it will pass, once I become successful and famous myself.

My most recent bout of inverse-schadenfreude happened this weekend. There was an article in the Chicago Tribune about Clerks II, and the article contained a nice little interview with Kevin Smith, the man who created the classic Clerks. The main character in the sequel is deciding whether to keep his low-paying job as a clerk and stay true to his friends, or go live with his rich uncle. Clerks II is clearly about Smith’s recent past and future in moviemaking, his own sort of mid-life crisis, and the difficulty of leaving the “View Askewniverse” behind, versus the opportunity to “grow” or move on to other movie styles, characters, and stories.

The interview nicely captures Smith’s non-desire to grow and move on. Thankfully, the original Clerks crew is back to do Clerks II, and they all know the responsibility they bear, so it is unlikely that Clerks II will be in the same category of sequel-disaster as Caddyshack II or Meatballs II.

Now, back to the inverse-schadenfreude. Regular schadenfreude is taking pleasure in someone else’s misery. My inverse-schadenfreude results in misery for me when I learn of other people’s success. Like I said, not healthy. But if you’re my age, and if you have ever listened to music, tell me you don’t feel it after reading this…

During the interview, Smith mentions that he recently held a fundraiser at his house for his daughter’s school. That’s nice, I would like to be able to raise some funds for my daughter’s school too. No problem so far.

Some famous people attended. Eddie Izzard, I don’t care, I barely know who he is. Kathy Bates, interesting, don’t care. Eva Longoria, total babe, I’m sure a very nice lady, don’t care. John Lydon, -

Wait. John Lydon? No, it can’t be! Yes, it is. John Lydon, a.k.a. Johnny Rotten of the Sex Pistols, and if you didn’t need me to explain who he was, you may be starting to feel some inverse-schadenfreude here.

So yeah, Kevin Smith is 35. He’s made a bunch of movies, he’s famous, and whatever career high points and low points he may experience in the future, I’m sure he’ll continue to create great things.

But to have John Lydon over to do some readings to benefit his daughter’s school? And to have him insult the politics of everyone in attendance? And to then ask Kevin afterwards if it was OK, because he didn’t really mean to make anyone mad??? That is just way too cool, and I know I’ll never experience this by the time I’m 35 because… well because I’m already 36.

So, true to the nature of my recent habit of being miserable when learning about the success of others, I mentioned this situation to my wife.

Me: “Kevin Smith, you know, the guy who made Clerks?”

Wife: “Sure” (she wasn’t just being polite, either, she rattled off Mallrats and Chasing Amy to make it clear that she actually knew who I was talking about).

Me: “He’s 35.”

Wife: “Oh, wow” (she knew that some complaint about his success was coming next)

Me: “He recently had a fundraiser at his house, for his daughter’s school, with some famous people.”

Wife: “Uh huh…”

Me: “And John Lydon was there! You know, Johnny Rotten, Sex Pistols, Public Image Limited! If I had him over to my house I’d be a hero! I mean, everyone I know would think I was the coolest person on the planet! It would be the biggest thing ever, people would be talking about it for the rest of my life! And Kevin Smith just has him over for a fundraiser!” (she probably didn’t need the detailed explanation of who John Lydon is either, but I had to make it clear how big a deal this was).

Wife: “Wait, you say it was a fundraiser?”

Me: “Yeah.”

Wife: “So…”

Me: “So what…?”

Wife: “So, really, John Lydon actually PAID to go to Kevin Smiths house.”

Me: “AAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHH!!!!!”

svn: Out of date in transaction

On one project I’m using the Eclipse IDE with the subclipse plugin to enable subversion directly in the IDE. This setup is new to me within the last year, and I’m still learning about it.

I like the fact that my files get a little gold medal icon on them in the Eclipse Navigator when they are in source control and match the repository version - at a glance, I know what files have been modified and what changes need to be committed. For a nicely organized project, where all files in a subdirectory are either in version control or have been ignored via svn:ignore, the folder itself gets a little gold medal when everything in that folder is committed and matches the repository. Thus, the status of an entire project can be seen by glancing at its root project folder.

If a file has been modified locally, it gets a little brown asterisk. It is nowhere near as pleasing as a gold medal, and it encourages me to commit changes when I’m ready to do so. Today, however, some subfolders in my project still had the little brown asterisk even though everything in that folder was in sync with the repository.

So, I right-clicked on the folder itself and selected Team->Commit…

This action is a bit strange to me, because I’m not completely sure what it means to commit a directory. With Subversion, I believe it means that I am committing any information relative to that directory, such as what files in that directory should be ignored. In Subclipse, the commit operation simply shows the path to the folder with “Property Status” listed as “modified”.

The commit causes an error. In the Subversion dialog box, it looks particularly nasty, something like this:

org.tigris.subversion.javahl.ClientException: Transaction is out of date

svn: Commit failed (details follow):

svn: Out of date: ‘/path/to/folder’ in transaction ‘xxx-x’

where xxx-x is the transaction number.

I happened to run across this discussion of the problem that contained these useful tidbits of information:

a) “The directory is not at the latest revision, hence you cannot commit changes to it. Simply “svn update”

b) “the error message is not as helpful as it might be. Once you know what it means then of course it is very obvious, but the first time evidently it is not.”

The solution to this error message is to simply right-click on the folder, choose “Team->Update”, then right-click again and do “Team->Commit” and it will then work.

Python Unicode - Fixing UTF-8 encoded as Latin-1 / ISO-8859-1

This is a follow-up to the previous Python Mystery of the Day regarding a “TypeError: decoding unicode is not supported” exception.
Ok, so what if you receive a unicode string that has clearly been converted to unicode using the wrong encoding? In my case, the string was originally encoded as UTF-8. The source did not specify an encoding, so the encoding applied was ISO-8859-1, also known as Latin-1.

By the way, there are some really great Python unicode tutorials. As of yesterday, I could barely tell you the difference between Latin-1 and ISO-8859-1. Today, I can tell you that they are the same thing. Thanks, tutorials! Seriously, the tutorial linked at the beginning of this paragraph is awesome if you’re just starting to bash your head against unicode in Python.

With that, here is a way to fix a string that was encoded as “Latin-1″ (that is, ISO-8859-1), when it really should have been encoded as “UTF-8″.

We’ll use the example string “Nuñoz”. If the original is encoded as UTF-8, it will be represented with these bytes (written here as a python literal):

‘\x4E\x75\xC3\xB1\x6F\x7A’

All of the bytes that are less than 128 are regular ASCII characters, as ASCII is a subset of UTF-8. The 0xC3 and 0xB1 characters together represent the accented “n” in UTF-8.

Let’s see how this works in Python:

>>> rawstring = ‘\x4E\x75\xC3\xB1\x6F\x7A’

>>> rawstring

‘Nu\xc3\xb1oz’

>>> print rawstring

Nu??oz

Note that the rawstring contains a string of bytes, but when we try to print it, the terminal window will attempt to display the non-ASCII bytes. How they are displayed depends on the computer. On my Windows machine, it displays as garbled mousetext.

Now, if our software knew the proper encoding, it could do the right thing and convert this raw byte string into Python’s internal Unicode representation like this:

>>> utf8string = unicode(rawstring, ‘utf-8′)

>>> print utf8string

Nuñoz

>>> utf8string

u’Nu\xf1oz’

Now we see that when we print the string, it shows the proper accented “n” character. When we look at the internal representation of utf8string, we see that the non-ascii character has been coverted to 0xF1. The two-byte UTF-8 sequence is represented by 0×00F1 in Python’s internal unicode format (which is usually a form of UTF-16. Theoretically we’re not supposed to worry about Python’s internal format too much, but it becomes important when encodings start going haywire).

Now, what happens if our software is wrong, and it thinks our rawstring was originally encoded using ISO-8859-1? Let’s see:

>>> rawstring

‘Nu\xc3\xb1oz’

>>> iso8859string = unicode(rawstring, ‘iso-8859-1′)

>>> iso8859string

u’Nu\xc3\xb1oz’

>>> print iso8859string

Traceback (most recent call last):
File “”, line 1, in ?
File “C:\Python24\lib\encodings\cp437.py”, line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: ‘charmap’ codec can’t encode character u’\xc3′ in position 2
: character maps to undefined

Oh no! The iso-8859-1 encoding took the UTF-8 byte string exactly as-is and stored the bytes in Python’s internal format without changing anything. This is what we told it to do - Python’s internal UTF-16 contains the entire iso-8859-1 character set as-is, just zero-extended to 16 bits. But this is a bad situation, because 0xC3 and 0xB1 are now two characters that can’t be printed to my terminal, when the originally represented a single character.

This gives us a hint that we may be able to detect strings that were improperly encoded with iso-8819-1 when they should have been encoded as UTF-8. This confusion is very common, as both of these encodings are used frequently on websites, and the source will quite often not tell you which encoding is used. If the string is examined byte-by-byte and some two or three-byte sequences exist with each byte greater than 128, we can check to see if those sequences are valid UTF-8 sequences. If so, there’s a good chance that the string was mis-encoded. I haven’t tried this yet, so we’ll defer this possibility until a future posting.

The easiest situation is when you know for sure that the original was improperly encoded as iso-8859-1 and it should have been encoded as utf-8. You can convert it back to a “raw” string, then re-convert it to unicode using the proper utf-8 encoding, like this:

>>> iso8859string
u’Nu\xc3\xb1oz’
>>> rawfromiso = iso8859string.encode(’iso-8859-1′)
>>> rawfromiso
‘Nu\xc3\xb1oz’
>>> properUTF8string = unicode(rawfromiso, ‘utf-8′)
>>> properUTF8string
u’Nu\xf1oz’
>>> print properUTF8string
Nuñoz

Starting with the incorrectly-encoded string “iso8859string”, we convert it back to a raw byte string by using ‘encode’, passing in the incorrect encoding that we want to strip. We then take that “rawfromiso” raw byte string and encode it using utf-8. Note that after encoding, the utf-8 two-byte sequence is converted into a proper UTF-16 character.

With unicode, the wrong thing seems to happen more often than one would normally expect. This is mostly an issue of getting used to the idea of strings of text no longer being well defined by one simple standard. Once various encodings come into play, it’s no longer just text - it is a string of 8-bit, 16-bit or 32-bit integers, possibly little endian or big-endian, plus encoding meta-data describing what all of it is supposed to mean. If that encoding meta-data is missing, wrong, or not what is expected, garbled text and exceptions are the result.

Python Mystery of the Day

The following Python exception has been driving me nuts:

TypeError: decoding Unicode is not supported”

This happens when trying to do something like this:

encodedstring = unicode(normalstring, ‘utf-8′)

If “normalstring” is a regular string (type ’str’), and it happens to contain raw utf-8 data (for example, read from a file that you know was encoded in utf-8), the above will convert the string from utf-8 to the default unicode encoding that your Python interpreter is using. The result will be an object of type ‘unicode’.

However, if “normalstring” is already a unicode object, you will get the not-so-obvious TypeError exception “decoding Unicode is not supported”. It’s not obvious, because if “normalstring” is coming to you from some other library, you might not know whether that string was already encoded as unicode or not.

As an example of how this could happen, let’s say some other library passed off a string to you, and you noticed when printing it that some characters were garbled. Upon further inspection, you realize that the garbled characters represent a single utf-8 encoded character. So, you tell Python to encode the string as utf-8. If the string is already represented as unicode, the above exception fires. This situation can easily happen if whatever processed the string in the first place applied the wrong encoding (for example, if iso-8559-1 was incorrectly applied to a utf-8 stream).

The best thing to do in this case is figure out why the stream was originally unicode-encoded with the wrong encoding. Once that’s fixed, the re-conversion attempt that throws the mysterious exception is no longer needed.

Through The Looking Glass - The Lucidizer Works

About a month ago, I wrote an application for my Treo called The Lucidizer. Periodically, during daylight hours, The Lucidizer sets off a vibrate alarm and displays a dialog box that says “You Are Awake.”

The point of this is to remind me to check whether or not I am awake. Once this becomes a habit, I am more likely to check whether I’m awake when I am actually dreaming.

A few days ago, the morning of July 1st, it worked. I was dreaming about playing golf. Since I’m not a golfer, this dream wasn’t as enjoyable as you might otherwise think.

It was a classic dream - I was about to tee off, I grabbed my driver, and as I walked toward the tee, I noticed that my driver was cracked in half, totally useless. I knew I had a 3 wood in my bag that I could use instead. I walked back to my bag, looked in it, and it was empty.

So far, the dream had all the characteristics of an “oh no, I’m not wearing pants” kind of dream. But, thanks to The Lucidizer, something else happened. I was, of course, shocked that my golf bag was empty. But then I realized that it is exactly this type of thing that happens all the time in dreams - things change from moment to moment in surprising ways. As I stood there staring at my empty bag, I decided I was probably dreaming.

I still wasn’t quite sure I was dreaming, and I didn’t yet have any control over what was happening. But then I tried something. I decided I shouldn’t do anything too crazy or embarrasing, like summon a flying dragon, or leap into a canyon or anything, just in case I really was at a real golf course. So I chose to run.

It worked. I ran, and suddenly I had full awareness of my entire surroundings. It felt like breaking out of a shell. And, as it turns out, running while dreaming is just as fun as flying or anything else. It felt really, really cool. I can still remember what it felt like, a kind of total weightlessness, no heaviness of a physical entity, yet still able to perform physical feats like running. It was pure brain-stuff - interacting with my brain’s model of the world, the way it understands movement, separate from any sensorial feedback.

At that point in the dream, some other parts of my brain must have started to notice that something strange was going on, and I started to wake up. I tried spinning my arms around in big windmill circles, and that helped a little bit (and was pretty fun to do), but that was it - I woke up.

The next morning, it wasn’t until I was in the shower that I remembered the whole thing, and then I could remember the sensation of running in my dream. It reminded me of the importance of being able to remember dreams - after all, even if I end up summoning a dragon or jumping into a canyon in a dream, I won’t know about it unless I remember it the next day.

It’s been almost a week since this first success, but The Lucidizer is working. I also have some new ideas for the interface - instead of simply saying “you are awake”, it can do a random exercise, such as show a number of objects, blank the screen, then show that SAME number of objects. Or, show a word, blank the screen, then show the same word. In a dream, the second item shown would most likely be different. If I can dream about The Lucidizer going off in my pocket, I’ll be back in control - then maybe I can work on my golf game while I sleep.