Archive for the 'Eclectic Tech' Category

The Future Triumph of Video over IP

This should be categorized as “Stating the Obvious.”

At Costco today, I was struck by two things. The first was how cheap the huge, flat-panel HD TVs have become (a gigantic Mitsubishi LCD was $2000). The second was how horrible the picture quality was.

The incumbent video delivery industries are focused on moving to “HD” standards. In the short term, this means “highly compressed with lots of artifacts” - see, for example, Comcast HD. In the very long term, the compression problems may go away as equipment gets better and as more bandwidth is dedicated to HD services. But “HD” standard screen resolutions are pretty well set in stone and aren’t going to change for a long, long time.

On the other hand, in the very short term, display technology will allow much higher resolution displays at ever decreasing cost. Similarly, video recording and digital editing equipment will allow higher and higher resolutions at consumer-level costs. As Internet bandwidth increases and drops in price, it can easily address the video quality problem on all fronts - less compression needed as a full connection is dedicated to one or two video signals, quicker adoption of new, better compression technology as software upgrades are immediately available and expected, and higher resolutions to match cheap, high-quality displays (by leveraging increased bandwidth and improved compression methods). By comparison, “standard” equipment with set resolutions and set video codecs will age very quickly and not very gracefully.

Those who are used to standards staying the same for decades will stick with “HD” and think it’s great. Meanwhile, a disruption should occur - consumer-grade video quality will outstrip the capabilities of “pro” equipment, and that high quality video will only be viewable on “non-standard” systems - ultra-high resolution monitors connected to the Internet, instead of a standard “high def” TV connected to cable or satellite.

Yes, this means that in the future we’ll have ultra-high-visual-quality videos with bad lighting and horrible content a-la YouTube. But it also means that the cable TV and satellite TV services need to move quickly to video over IP to avoid disruption. These days, a data service that is dedicated to providing only video makes about as much sense as running another cable to your house to provide MapQuest service.

I’m stating the obvious - the cable and satellite companies surely see video over IP on the horizon - but hey, stating the obvious is what blogs are for. I’m just doing my part.

Software Schedule Estimation Via Bullet Points

I don’t know the secret to accurate software schedule estimation, but I have a pretty good technique that is simple enough for anybody to do. With this technique, you will be able to estimate your software project schedules with the same accuracy as major software development houses, large corporations, and experienced consulting companies.

The process starts with a list of bullet points. This is similar to a “requirements list” but less formal, you can do it as bullets in Word or even individual lines of text in a simple text editor. Each bullet point should be 75 characters or less (a few characters will be added later as you will see).

No matter where you are in the project (early conceptual stages, initial planning stages, or even if development is fully ramped up and in progress) sit down with a blank list and start making single line entries, 75 characters or less, that describe the things that need to happen for the project to be completed.

Don’t limit it to product features, include mundane things like “buy new computer to replace the one that exploded”, and “hire new developers and get cubicles allocated”. As you can see, some of the items will be very specific (”buy new computer”) and some will be very broad (”hire new developers”). Try to keep each line focused on one thing - split “hire new developers and get cubicles allocated” into two items, “hire new developers” and “get cubicles allocated”.

After a while you’ll run out of things to put on the bullet list, or the list will get so long that you’ll begin to shake and cry uncontrollably. When either of these things happen, it’s time to stop.

Now, count the number of bullets. This is the number of days you’ll need until the next phase. When translating this to calendar days, make sure you leave out weekends. You will need the weekends anyway, but don’t include them in this number.

Let’s say you had 30 bullet points. This translates to exactly six weeks. For the next six weeks, work on the items on the bullet point list. When you complete one, put a “+” to the left of it. If you decide not to do one, put a “-” one next to it. For all of the other bullet points, leave them as they are - those are the ones that need to be worked on next. Note that even though we allocated one day per bullet point, this does not mean that each bullet point can or will be done in one day. The actual amount of time it takes to complete each bullet point depends entirely on the resources you have at your disposal. Do the best you can. Some bullet points won’t get done. Don’t worry, we’ve got that covered, as we’ll see in a bit.

Often, you will come across a bullet point that is really more than one task. Let’s say you had a bullet point that originally said “Implement user interface”. When you come upon that one, you won’t be able to just do it as-is. You’ll want to expand it out into some more bullet points. For example, “Implement user interface” might expand out to 50 more bullet points, itemizing the widgets that are needed, the various screens and menus that need to be designed, the visual styles that need to be determined, etc.

When you reach the end of the first phase, you will still have your bullet point list, but it will have been transformed. You need to clean it up at the end of each phase. Go through the list and remove all of the lines with “+” or “-” at the beginning of the line. Those have been addressed and you don’t need to ever think about them again until testing time. There will be quite a few bullet points from the start of Phase 1 that have not yet been addressed or thought about at all. That is ok, leave those on the list.

After you have cleaned out all of the addressed items, scan through the list including the newly expanded bullet points. No doubt some things will occur to you that weren’t on the first bullet point list, and are still not there. Add those to the list. Also, you may see some bullet points that can now be expanded out to more detail. Go ahead and expand those out now. As with Phase 1, keep doing this until you run out of ideas to add to the list, or until you begin to shake and cry uncontrollably.

Now it is time to start Phase 2. Count the number of bullet points that are on your cleaned up list and allocate one day per bullet point again. Let’s say you have 120 items on your list now. That is 24 weeks. Start into the new bullet point list exactly as you did with Phase 1. When an item is completed, mark it with a “+”. If you decide an item is not needed or will not be done, mark it with a “-”. If you find a bullet point that needs added detail, expand it into a new list of more detailed bullet points.

As you can probably tell, Phase 2 follows the exact same methodology as Phase 1. In fact, every phase follows this same approach. That is the beauty of this technique. Once you know the basic procedure, you know the whole methodology, and no special software is needed. In fact, this entire methodology can be implemented with a pad of paper and a pencil.

Eventually, the product will need to ship. The forces that determine the ship date of a product are typically beyond anybody’s control, no matter how powerful any individual person may be. You will typically have some early warning of when this date is, and you will need to allocate some time to fix any problems that exist with the software. Do the best you can. No doubt you’ll still be trying to implement items off your bullet point list during this time, so technically any testing you do will be invalid. Plan for some sort of automatic online updating system for the software after it ships.

And that’s about it! Unfortunately, this method doesn’t tell you the full schedule of the project on day 1. All you can say on Day 1 is when Phase 2 will start, and due to the unpretictability of each bullet point (some happen quickly, but most take longer than expected, and almost all of them expand into more bullet points), knowing when Phase 2 starts isn’t really much help. But it’s the best you can do, and it’s about as good as the best professionals can do. In fact, it’s better than a schedule that sets very specific dates in stone, because those kinds of schedules just make everyone shake and cry uncontrollably when they realize there is no way those dates can be met. With the bullet point approach, the shake-and-cry periods are planned out ahead of time and are expected, and so they are easier to deal with.

Am I serious? I started writing this partially to be funny, and partially to put down in words the actual approach that I use for software development. There are projects that I can schedule very well, but those projects are basically repeats of other projects I have done, using well understood technology and doing the exact same tasks as a very similar project that has already been successfully completed.

As the world of software matures, the number of repeat projects increases - this is the software industry’s version of building strip malls. If you’ve been doing it for a while, you know how long it takes to lay the grade, roll the pavement, and slap together the strip of shops. But often in software, there is some innovation required, so the project is more like building a spaceship that can also drive at highway speeds and travel waterways like a motorboat, with an outer skin that can display high resolution full color images and a life support system that includes the capability to manufacture cotton candy from sugar synthesized from molecules formed by shooting lasers at particles being accelerated in a 5-mile radius loop. These types of projects tend to be difficult to schedule, but the bullet-point list will get you as close as you can get to something reliable.

Oh, and we’re gonna need a killer stereo system in that thing. Add it to the list.

The True Meaning of Windows XP?

I couldn’t sleep, so I broke out my laptop to work on some code that is rattling around in my brain. I had to cold boot Windows XP, since I had shut it completely down earlier to install some Windows updates.

It’s dark, it’s quiet, the black Windows XP loading screen glows dimly on my screen…

“XP” … it sounds so natural now. I run “XP”. Most of my machines have “XP”, a few have “2000″. These are brands, everybody knows them. But why XP?

It rolls off the tongue easily enough, it sounds professional, but as trademarks go, it’s not so great. A two-letter non-word isn’t the most defensible trademark. That doesn’t matter so much, obviously the XP brand has power…

Why those letters?

The loading bar zooms left to right, left to right, never giving off a hint of how much is left to go, just letting me know it is still working on it…

Did it really stand for “Windows Experience?” The 2001 Microsoft press release claims this is so, but that sounds pretty weak, way too meaningless… almost like a convenient answer to satiate those who are curious enough to ask…

What else… around 2001, “Extreme Programming” had quite a buzz going in the programming world, and that was always referred to as “XP”. Perhaps some engineers at Microsoft liked what XP stood for and thought that it represented what Windows XP had under the hood?

But that doesn’t make much sense either. Extreme Programming was something happening outside of Microsoft, and it seems unlikely that they would take on the nickname of someone else’s methodology as their product name…

The loading bar continues.. left to right… left to right… “I’m loading… don’t you worry about how much is left to go, I’m on it…” XP… XP… what could it be…

Or, as on the loading screen, lowercase xp… xp… p is the greek letter “rho” … hm… oh yeah and of course, x is chi… “c.r.” does that stand for anything? hmm… chi, rho… chi, rho…

CAIRO!!!!

Of course, Cairo!! A famous Microsoft codename originally used for Windows NT 4.0, but also used to describe a set of technologies that didn’t all make it into 4.0. Cairo was envisioned as the ultimate operating system, and an ambitious feature set was announced, much of which took longer than expected to complete. The remaining technologies that were originally targeted for Cairo were eventually added bit by bit to all of the versions of windows between Windows NT 4.0 and Windows XP… xp being the real Cairo.

All of those technologies are now in the ultimate operating system, XP… except, of course, for the object-based file system WinFS, originally annoucned for Cairo, and recently cancelled yet again for Vista. But that’s ok, I prefer plain old file systems that store and retrieve hunks of bytes, that has always worked well for me. Perhaps whoever chose the name XP decided that Windows XP was finally close enough to the Cairo vision to be called Chi-rho.

Ah, my machine has booted finally… my Windows chi-rho, cairo, xp machine… now, time to get to work. But, on second thought, I guess I have time to type up a blog entry before really digging in…

perl Out of memory! (with solution)

On occassion, I will run a perl script and be shocked when I get an “Out of memory!” error back from perl. The shock is always a result of the fact that I’m doing something relatively benign, running a small script that generates some data but certainly not enough to overrun the massive capabilities of the box I’m running the script on.

It happened to me again today, and I thought I’d put a little post here about it to perhaps save somebody some frustration.

Today’s perl Out of memory error happened while I was upgrading SpamAssassin on the Red Mercury mail server. You wouldn’t think that something as simple as upgrading a few perl modules would be enough to run out of memory, particularly since this machine had nothing else running on it at the time, and the machine has a gig or so of RAM backed by a couple gigs of swap space.

The answer, in my case anyway, always turns out to be “ulimit”. What is “ulimit”? If you live inside a unix world, you probably know all about ulimit. For mere mortals like myself, it’s something that I only care about when it prevents me from doing what I need to do.

“ulimit” is short for “user limit” - it limits the amount of resources that a particular user can consume. This is of course very useful as it prevents any particular user from hogging all of the memory or CPU time on a system. If you run a process from the command line that goes absolutely crazy, ulimit can be the sanity check that keeps the rest of the box humming along nicely.

“ulimit” is usually a built-in shell command, so it varies from shell to shell. I happen to be using zshell on an OpenBSD box, but even if what I describe here doesn’t match exactly what you have on your machine, it should get you headed in the right direction (in fact, just saying “ulimit” may be enough to get you headed in the right direction). It could be that the stock ulimit settings on OpenBSD are smaller than on other default installs, but when I first had this problem, I saw a LOT of people asking this question.
So first, I type “ulimit” to see what’s up:

[ultrabox]: ulimit
unlimited

Ok no problem, right? But as it turns out, with no command line arguments, my ulimit will default to showing me my limit on file size, and that is unlimited. Let’s find out what all of the limits are:

[ultrabox]: ulimit -a
cpu time (seconds) ulimited
file size (blocks) unlimited
data seg size (kbytes) 65536
stack size (kbytes) 4096
resident set size (kbytes) 1446704
…(and some others)…

Now we’re seeing some interesting things. The stack size is 4MB. That might be small. The “resident set size” is set to almost the size of the RAM in the machine (well over a gigabyte) and it is maxed out and can’t go any higher. The “data seg size” is 64MB (65536 kbytes above) - that should be enough for anything, right?

It has been my experience that, when working with perl, a 64MB data seg size is too small. I have to make it larger before running most of my reasonably-complex data processing scripts. And, today, I had to make it larger before doing something simple like upgrading SpamAssassin via CPAN. Even though I ran the upgrade with “sudo”, I still had to raise the data seg size ulimit in my shell, since it was limiting the data seg size for everything run from that shell.

So, Greg Graffin, I HAVE THE ANSWER!!!

Simply boost the size of your data seg and make that stupid perl Out of Memory! error go away:

[ultrabox]: ulimit -d 200000

[ultrabox]: perl irsProcessAllUSTaxReturns.pl

Processing… 130,728,360 tax returns… Finished!
[ultrabox]:

It worked!

For scripts that I run often, or from a cron job, I make a little shell script that first sets the “ulimit -d 200000″ and then runs the script. Other possibilities would be to set the needed ulimit settings in a login script, or to bump up the default system-wide (though the method for doing this system-wide varies by system and you might not have the required permissions to do it anyway).

Please note the number of zeros there in the ulimit setting - five zeros - this sets the ulimit to close to 200MB up from the 64MB it was set at previously. You may need more. You may need less! On the less side it doesn’t matter so much - this is just a top limit, it doesn’t mean everything will use 200MB of data segment space, it just allows things to use that much space.

Now, as for the bigger question of WHY perl needs over 64MB of data segment space to do something like install spamassassin… well, in the words of a Gary Sommer, guy I used to work with, “that’s beyond my attention span.”

Proposal for Local Commit in Subversion

Proposal for Local Commit in Subversion

I propose a “local commit”, such as “svn commit –local”. Performing a local commit would save the current modified version in the local .svn directory, along with a local revision number, without needing to connect to the repository. I could do this as many times as I like, and each local commit would exist as another copy (or just a diff) with another local revision number (such as r541.0, r541.1, r541.2, etc.). Each local commit would also have a comment, just like a real commit.

I would be able to use Subversion to compare these local versions, so “svn diff” would show me the difference between the latest locally committed version and my working copy. I could specify a specific local revision to “svn diff” and get the differences with that revision, all the way back to the head revision that I started with when I disconnected in the first place. For example, “svn diff myfile.c -r541.23″ would compare my working copy of myfile.c with the 24th locally committed version of myfile.c. Existing commands would still work normally, for example “svn diff myfile.c -r541″ would compare myfile.c with version 541, the latest version I got from the actual repository. However, “svn diff myfile.c” could automatically compare with my latest locally committed revision, or alternatively, a “–local” flag could be added in cases like this to indicate when we want to pay attention to local commits.

Even if I’m not disconnected from the repository, these local commits allow me to work incrementally and implement huge, earth-shattering changes, one small step at a time, without bothering anybody else who is using the main repository until I’m done. While working on these large scale changes, I still get the advantage of versioning in small increments by doing local commits.

Once I am ready to commit these local changes to the repository, I would do a “svn commit” like normal. Subversion would automatically do a sequence of commits, starting from my first local revision and continuing through each locally committed revision up to and including my local working copy, applying the commit comments that I had supplied with each local commit. My local incremental changes and comments would then be commited to the main repository and be given real revision numbers by the repository - the local “541.23″ version numbers would go away, but the sequence of revisions would still be preserved. After the commit to the repository, the versions that had been locally committed now appear as individual revisions in the central repository, as if I had been committing directly to the repository all along. All of these commits would be atomic. Multiple local commits on multiple files could be committed in one atomic operation.

I have only been heavily using Subversion for a short time, but I believe adding “local commit” would fit very nicely into the way Subversion currently works. Whether it would be technically challenging to add it, I don’t know, but that is less important than the way the idea of local commits would interact with today’s Subversion. I think local commits could be added in a way that would not change the way Subversion is used at all, and it may even be possible to add it in a way that works well with existing tools that interact with Subversion.

With “local commit” support, Subversion would go a long way towards solving the types of problems that projects like Bazaar-NG are trying to solve. There are a handful of “distributed” version control solutions in the works, including Bazaar-NG, that are trying to create solutions where there is no central repository. Allowing full-featured disconnected work is a major advantage of such systems. However, the advantage of having a central repository can’t be ignored - it defines a central place for all revisions to be integrated. With distributed systems, it is easy to imagine small clusters of developers in sync with each other, but wildly out of sync from cluster to cluster. Subversion plus local commit allows full incremental commits while disconnected, and loses nothing when the actual commit to the central repository is done.

So… will it happen? I suppose it’s my itch, so in the tradition of open source, I should scratch it. But if I’m lucky, someone hotshot with lots of Subversion development experience will have the same itch after reading this and implement it in 10 minutes… if you do, let me know :)

Using Subversion Disconnected

When choosing version control, I look for reliability (do things right or don’t do them at all), workflow integration (full IDE integration so I don’t need to think about version control as a separate problem when working), and the ability to work disconnected from the server.

Subversion has been reliable for me so far, and part of the reason why I chose it over some other interesting possibilities (like bzr) is that it is being used by very many other people on large scale projects much larger than my own projects. If problems arise, they will get fixed. I have had no problems with it so far that didn’t stem from not reading the documentation, so Subversion gets high scores for reliability, as expected.

IDE integration has been great, since I’ve been using eclipse and the Subclipse plugin works nicely. I have only seen one major shortcoming in Subclipse so far - it won’t let you assign eclipse shortcut keys to Subclipse functions. This is a peculiar ommission, and I wouldn’t be surprised if it is possible but I just haven’t figured out how to do it yet. Otherwise, the integration with Eclipse is fantastic. Synchronizing with the repository is actually fun.

Working disconnected with Subversion is much more functional than I would have thought at first. With SourceSafe and Perforce, I am used to doing a “Checkout” or an “Open for Edit” on files before editing them. In fact, with SourceSafe and Perforce, if you start editing a file without first asking the server if you can edit, you enter into an undefined bizarro world where nothing will work as intended. When discussing Subversion with a friend who uses only Subversion, I inquired about the ability to have a local repository so that I could check out files when not connected to the Internet. He didn’t understand the question at all - with Subversion, you can just go ahead and start editing, and the changes (including conflicting changes) are worked out later when you are reconnected.

If you get on an airplane with an up to date working copy of a repository, you can modify any file in that repository without connecting to the server, and when you reconnect, things will work as expected. Again, for Subversion users, this is obvious. For SourceSafe and Perforce users like me, this is a miracle.

Subversion also keeps a copy of the latest version from the repository, so while you are disconnected, you can still to a diff on the latest repository version. This is fantastic, as it is the most common thing I do with files under version control - oh no, I broke everything. How? Show me the version that worked, and what I changed. “svn diff” can do this without connecting to the repository, and thus can be done while disconnected.

Similarly, Subversion can tell you all of the files that have been modified locally, again without hitting the actual repository. It just checks every local file against its local head revision copy and reports back a list of the ones that have changed.

To work this magic, Subversion needs to keep a bunch of metadata on the local machine. This is a bit strange at first, coming from SourceSafe and Perforce. With those tools, there is little if any local metadata, so when you are disconnected from the server, the local client has no idea what is under version control and what is not.

With Subversion, detailed information is kept locally, telling Subversion what files are under version control, when the latest version was retrieved from the repository, a copy of that latest version - basically everything needed to browse and examine the status of the entire repository, without connecting to the server. Working disconnected is easy - the local client has all the information it needs about the structure of the repository, so it can tell you what you’ve done since you were last connected.

You can’t commit changes to the repository while disconnected (”commit” in Subversion is like submitting a changelist in Perforce or doing a Check In in SourceSafe). However, I think with a little bit of work, it would be possible to do this. First, you need to understand why I think this would be a convenient thing to do.

Version control is used to synchronize files among many different users. At the same time, it is also used to track small changes from version to version, so that when something goes wrong, you can look back in the history and narrow down the change that caused the problem, or look in the commit comments to find out why a change was made in the first place. Viewing the historical changes to a file can be life-saving on a multi-year project, and let’s face it, every 2-week project has secret ambitions to be a 10 year project.

If commits are made very frequently, then viewing differences from one revision to the next is useful. It is very easy to see what changed and why from one revision to the next. And, if a new change breaks something, there is not much to lose by reverting to the most recent revision.

On the other hand, if commits are far apart, viewing differences between two versions is not useful - there may be 100 differences, all completely unrelated. It is not obvious which changes can be reverted and which cannot, or which ones might be the cause of a problem. If a local modification is very different from the head revision from the repository, and the local version is completely broken, reverting to the repository means losing large amounts of work, so it just won’t be done.

So, to make revision control useful in this sense, it is important to be able to frequently commit changes. When working disconnected, even if I’m only disconnected for four hours, I may make 20 incremental changes and choose to commit each change once I have tested it.

Even when I am connected, I may want to “commit” small changes so that I can view the differences from one version to the next, but I may not yet be ready to expose those changes to everyone else who uses the repository. I may make a small change in preparation for a larger change, a small change that I am not ready to “commit” to and may choose to revert later, but I want to put a stake in the ground after that small change, before going on to make further changes, so that I can locally revert and compare revisions before commiting to the repository.

read more in “Proposal for Local Commit in Subversion

Hacking Subversion “entries” file

Using Subversion for the first time brings with it some surprises. Matching-but-not-identical repository URLs can cause hard to understand problems with baffling error messages. And, not understanding the importance of Subversion’s local meta-data can get things off to a bad start. The hacks described below can help you diagnose and fix some of these problems. The discussion below describes these problems on a Windows machine, but the problems are very similar on unix-like systems as well.

I was new to Subversion. I created a repository, then created a “projects” subdirectory in the repository via the TortoiseSVN repo-browser. This was a mistake, but I didn’t know it at the time - it was a mistake because Subversion didn’t get the opportunity to write its hidden meta-data files to my local “projects” subdirectory, but I had no idea. I then added four projects to Subversion one by one. The directory structure on disk matched the structure in Subversion, so I figured everything was ok. However, as far as Subversion was concerned, my local projects were all totally independent, it didn’t know what to do with the parent directory since it had never been added directly, and trying to synchronize all of the projects from the parent directory resulted in a “not versioned” error. I had created a mess.

Subversion keeps meta-data on the local file system in a hidden directory called “.svn”. This meta-data tells Subversion whether or not the files in that directory really are under control of Subversion. If the meta-data doesn’t contain information about the local files, Subversion won’t let you update or commit those files, even if an exact match already exists in the repository. Since I had created my “projects” directory in the repository, and not “added” the directory from my local filesystem, the local meta-data for the “projects” directory was not there.

I attempted to add the “projects” parent directory non-recursively to Subversion, thinking that would solve the problem. However, doing an update from the parent directory now resulted in a “file already exists” error for every project subdirectory, even though those projects were versioned in Subversion and matched the Subversion repository.

Some quick searching on this problem turned up some suggested solutions that were usually a combination of stating the obvious plus an IT-related insult, like “Just delete everything and update from Subversion. If it doesn’t work, you do have backups, don’t you????” However, when an entire subdirectory and everything below it causes the “file already exists” error, it’s not so easy to just blow everything away. There are likely hundreds of files in those subdirectories that aren’t in the repository, such as compiler output (that may have taken a long time to generate), notes that are in text files that I don’t want to delete yet, and perhaps graphics or source code that is in a half-baked state that should probably be cleaned out but that I don’t feel comfortable cleaning out just yet. And if that subdirectory contains a few gigabytes of output data, backing it up and restoring it isn’t fun when I’m just trying to get Subversion to work.

Luckily, there is a way to make the project parent directory know about the subprojects. Note that this is a hack, and I don’t recommend it, though it worked like a charm. I was unable to find a useful “correct” way to fix this problem from the svn command line, so we have to manually make things right.

If we want to tell a parent directory about sub-projects that it should know about, but doesn’t due to my incompetence in setting up the projects originally and Subversion’s reluctance to fix my mistakes for me, we do a simple text-editing hack. The hack involves editing a file in the parent directory’s “.svn” subdirectory. The file is called “entries” and it tells Subversion what it knows about that directory.

If you happen to have a local project checked out from Subversion, go look for the hidden “.svn” directory. You should find the “entries” file there. Look inside it. It is an easy to read XML file and it can be edited with any text editor (on Windows, the command-line “edit” works great and handles unix end-of-lines nicely). Saving the file with “Windows” style end-of-lines didn’t seem to cause any trouble, so don’t worry about that.

In the “entries” file, there are “entry” tags for each file and subdirectory that is under version control. Many of those entry tags refer to the individual files in the directory. We can’t easily recreate those entries, because they include a lot of information about the file, like dates, checksum, and more. Don’t touch those entries. For individual files that give you a “file already exists” error, the best approach really is to delete the local file and then update the local file from the repository, assuming again that the file in the repository is up to date in the first place. The “entry” will then be created for the file, including the proper dates and checksum.

For directories, we’re in luck. The entry for a directory looks like this:

< entry name="directoryname" kind="dir" />

So, my situation was this. I had a project directory called “projects”, and four subprojects called “potatogun”, “plasticwrap”, “sinker” and “oatmeal”. My subversion repository had the same structure - all four projects were in a “projects” folder in the repository. The local subversion “entries” file for the “projects” directory just didn’t know it.

By editing the “entries” file in the “projects/.svn” directory and adding four “entry” lines exactly like the one above, but with “potatogun”, “plasticwrap”, “sinker” and “oatmeal” in place of “directoryname”, I magically had everything back to normal. I could update the subprojects with no problem. All was well.

Except that I left out one major detail. In two of the subprojects, I somehow told Subversion that my repository was called “file:///C:/svn”. In the other two projects, I had called the repository “file:///c:/svn” - note the lower-case “c” in the second version. Of course, on a Windows machine, these two are the same thing. As far as Subversion is concerned, however, these two things are different.

Subversion this time tells us that “svn: ‘file:///c:/svn/projects/potatogun’ is not the same repository as ‘file:///C:/svn’” - again, note the case difference in the drive letter.

So the parent project thinks the repository is located at C:/svn while the subproject thinks the repository is located at c:/svn. Subversion believes these two are different, and I have no problem with that. I just need a way to fix it. There are some explanations out there in the Googleverse about why the problem happens, but no good solutions.

The answer again lies in the “entries” file. We know perfectly well that our projects are all coming from the same repository. Subversion does a direct case-sensitive string compare when determining when two repositories are the same, so two URLs that mean the same thing will generate the “not the same repository” entry if the urls are entered in slightly different ways. There are many examples of this, not just case-sensitive problems. The “entries” file stores the URL of the repository that was specified when the project was first checked out of Subversion, and all we have to do is edit all of our “entries” files and make them all match.

The first “entry” tag in the entries file contains two items that need to be changed. One is the “url=” line, and the other is the “repos=” line. The “url=” line contains the URL that points to the subversion subfolder that contains the file. The “repos=” line points to the parent repository.

So, in my example, in my “projects/.svn/entries” file, the lines look like this:

url=”file:///C:/svn/projects”

repos=”file:///C:/svn”

The “projects/potatogun/.svn/entries” file has lines that look like this:

url=”file:///c:/svn/projects/potatogun”

repos=”file:///c:/svn”

Again, note the lower-case drive letter in the potatogun entries. Subversion thinks these two projects came from two different repositories, so we need to make them match. The convention is to use upper-case drive letters. You can choose to use lower-case if you like, Windows understands both, but if you ask Windows what the drive letter is, it will give you an upper-case response.

We now edit “/projects/potatogun/.svn/entries” with a text editor so that the lines look like this:

url=”file:///C:/svn/projects/potatogun”

repos=”file:///C:/svn”

It is important to change BOTH the “url” and “repos” entries so that the repository paths match. If there is an inconsistency between those two lines, Subversion will think it didn’t read the “entries” file correctly and it will get really confused.

We also have to do this for any subdirectories in project potatogun, and for any other subprojects that might have mismatched repository URLs. If you have the means to do a case-sensitive, recursive search-and-replace on all “entries” files, do that. Of course, if the source repositories REALLY WERE different in the first place, this hack isn’t going to help, but if you’re still reading at this point, you know that already.

Now we can go to the “projects” directory and type “svn up” - instead of getting the “not the same repository” error, we get a nice “At revision 492.” message like we should.

Now that my projects are all nicely set up, I know enough about Subversion to do it right the first time. But for Subversion newbies, equivilent-but-not-equal repository URLs can be very frustrating. It’s also surprising when Subversion can’t figure out that the local file structure exactly matches the repository. Then, when you try to let Subversion know that the local files match the repository, the “file already exists” errors are even more frustrating. But with a little hacking of the “entries” file, we can fix our mistakes and get back to work.

svn: Out of date in transaction

On one project I’m using the Eclipse IDE with the subclipse plugin to enable subversion directly in the IDE. This setup is new to me within the last year, and I’m still learning about it.

I like the fact that my files get a little gold medal icon on them in the Eclipse Navigator when they are in source control and match the repository version - at a glance, I know what files have been modified and what changes need to be committed. For a nicely organized project, where all files in a subdirectory are either in version control or have been ignored via svn:ignore, the folder itself gets a little gold medal when everything in that folder is committed and matches the repository. Thus, the status of an entire project can be seen by glancing at its root project folder.

If a file has been modified locally, it gets a little brown asterisk. It is nowhere near as pleasing as a gold medal, and it encourages me to commit changes when I’m ready to do so. Today, however, some subfolders in my project still had the little brown asterisk even though everything in that folder was in sync with the repository.

So, I right-clicked on the folder itself and selected Team->Commit…

This action is a bit strange to me, because I’m not completely sure what it means to commit a directory. With Subversion, I believe it means that I am committing any information relative to that directory, such as what files in that directory should be ignored. In Subclipse, the commit operation simply shows the path to the folder with “Property Status” listed as “modified”.

The commit causes an error. In the Subversion dialog box, it looks particularly nasty, something like this:

org.tigris.subversion.javahl.ClientException: Transaction is out of date

svn: Commit failed (details follow):

svn: Out of date: ‘/path/to/folder’ in transaction ‘xxx-x’

where xxx-x is the transaction number.

I happened to run across this discussion of the problem that contained these useful tidbits of information:

a) “The directory is not at the latest revision, hence you cannot commit changes to it. Simply “svn update”

b) “the error message is not as helpful as it might be. Once you know what it means then of course it is very obvious, but the first time evidently it is not.”

The solution to this error message is to simply right-click on the folder, choose “Team->Update”, then right-click again and do “Team->Commit” and it will then work.

Python Unicode - Fixing UTF-8 encoded as Latin-1 / ISO-8859-1

This is a follow-up to the previous Python Mystery of the Day regarding a “TypeError: decoding unicode is not supported” exception.
Ok, so what if you receive a unicode string that has clearly been converted to unicode using the wrong encoding? In my case, the string was originally encoded as UTF-8. The source did not specify an encoding, so the encoding applied was ISO-8859-1, also known as Latin-1.

By the way, there are some really great Python unicode tutorials. As of yesterday, I could barely tell you the difference between Latin-1 and ISO-8859-1. Today, I can tell you that they are the same thing. Thanks, tutorials! Seriously, the tutorial linked at the beginning of this paragraph is awesome if you’re just starting to bash your head against unicode in Python.

With that, here is a way to fix a string that was encoded as “Latin-1″ (that is, ISO-8859-1), when it really should have been encoded as “UTF-8″.

We’ll use the example string “Nuñoz”. If the original is encoded as UTF-8, it will be represented with these bytes (written here as a python literal):

‘\x4E\x75\xC3\xB1\x6F\x7A’

All of the bytes that are less than 128 are regular ASCII characters, as ASCII is a subset of UTF-8. The 0xC3 and 0xB1 characters together represent the accented “n” in UTF-8.

Let’s see how this works in Python:

>>> rawstring = ‘\x4E\x75\xC3\xB1\x6F\x7A’

>>> rawstring

‘Nu\xc3\xb1oz’

>>> print rawstring

Nu??oz

Note that the rawstring contains a string of bytes, but when we try to print it, the terminal window will attempt to display the non-ASCII bytes. How they are displayed depends on the computer. On my Windows machine, it displays as garbled mousetext.

Now, if our software knew the proper encoding, it could do the right thing and convert this raw byte string into Python’s internal Unicode representation like this:

>>> utf8string = unicode(rawstring, ‘utf-8′)

>>> print utf8string

Nuñoz

>>> utf8string

u’Nu\xf1oz’

Now we see that when we print the string, it shows the proper accented “n” character. When we look at the internal representation of utf8string, we see that the non-ascii character has been coverted to 0xF1. The two-byte UTF-8 sequence is represented by 0×00F1 in Python’s internal unicode format (which is usually a form of UTF-16. Theoretically we’re not supposed to worry about Python’s internal format too much, but it becomes important when encodings start going haywire).

Now, what happens if our software is wrong, and it thinks our rawstring was originally encoded using ISO-8859-1? Let’s see:

>>> rawstring

‘Nu\xc3\xb1oz’

>>> iso8859string = unicode(rawstring, ‘iso-8859-1′)

>>> iso8859string

u’Nu\xc3\xb1oz’

>>> print iso8859string

Traceback (most recent call last):
File “”, line 1, in ?
File “C:\Python24\lib\encodings\cp437.py”, line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: ‘charmap’ codec can’t encode character u’\xc3′ in position 2
: character maps to undefined

Oh no! The iso-8859-1 encoding took the UTF-8 byte string exactly as-is and stored the bytes in Python’s internal format without changing anything. This is what we told it to do - Python’s internal UTF-16 contains the entire iso-8859-1 character set as-is, just zero-extended to 16 bits. But this is a bad situation, because 0xC3 and 0xB1 are now two characters that can’t be printed to my terminal, when the originally represented a single character.

This gives us a hint that we may be able to detect strings that were improperly encoded with iso-8819-1 when they should have been encoded as UTF-8. This confusion is very common, as both of these encodings are used frequently on websites, and the source will quite often not tell you which encoding is used. If the string is examined byte-by-byte and some two or three-byte sequences exist with each byte greater than 128, we can check to see if those sequences are valid UTF-8 sequences. If so, there’s a good chance that the string was mis-encoded. I haven’t tried this yet, so we’ll defer this possibility until a future posting.

The easiest situation is when you know for sure that the original was improperly encoded as iso-8859-1 and it should have been encoded as utf-8. You can convert it back to a “raw” string, then re-convert it to unicode using the proper utf-8 encoding, like this:

>>> iso8859string
u’Nu\xc3\xb1oz’
>>> rawfromiso = iso8859string.encode(’iso-8859-1′)
>>> rawfromiso
‘Nu\xc3\xb1oz’
>>> properUTF8string = unicode(rawfromiso, ‘utf-8′)
>>> properUTF8string
u’Nu\xf1oz’
>>> print properUTF8string
Nuñoz

Starting with the incorrectly-encoded string “iso8859string”, we convert it back to a raw byte string by using ‘encode’, passing in the incorrect encoding that we want to strip. We then take that “rawfromiso” raw byte string and encode it using utf-8. Note that after encoding, the utf-8 two-byte sequence is converted into a proper UTF-16 character.

With unicode, the wrong thing seems to happen more often than one would normally expect. This is mostly an issue of getting used to the idea of strings of text no longer being well defined by one simple standard. Once various encodings come into play, it’s no longer just text - it is a string of 8-bit, 16-bit or 32-bit integers, possibly little endian or big-endian, plus encoding meta-data describing what all of it is supposed to mean. If that encoding meta-data is missing, wrong, or not what is expected, garbled text and exceptions are the result.

Python Mystery of the Day

The following Python exception has been driving me nuts:

TypeError: decoding Unicode is not supported”

This happens when trying to do something like this:

encodedstring = unicode(normalstring, ‘utf-8′)

If “normalstring” is a regular string (type ’str’), and it happens to contain raw utf-8 data (for example, read from a file that you know was encoded in utf-8), the above will convert the string from utf-8 to the default unicode encoding that your Python interpreter is using. The result will be an object of type ‘unicode’.

However, if “normalstring” is already a unicode object, you will get the not-so-obvious TypeError exception “decoding Unicode is not supported”. It’s not obvious, because if “normalstring” is coming to you from some other library, you might not know whether that string was already encoded as unicode or not.

As an example of how this could happen, let’s say some other library passed off a string to you, and you noticed when printing it that some characters were garbled. Upon further inspection, you realize that the garbled characters represent a single utf-8 encoded character. So, you tell Python to encode the string as utf-8. If the string is already represented as unicode, the above exception fires. This situation can easily happen if whatever processed the string in the first place applied the wrong encoding (for example, if iso-8559-1 was incorrectly applied to a utf-8 stream).

The best thing to do in this case is figure out why the stream was originally unicode-encoded with the wrong encoding. Once that’s fixed, the re-conversion attempt that throws the mysterious exception is no longer needed.