ATBG: Why UTF-16?

NOTE: This post was originally a link to my post on the Coverity blog, which has been taken down. An archive of the original article is here.


Today on Ask The Bug Guys we have a language design question from reader Filipe, who asks:

Why does C# use UTF-16 as the default encoding for strings instead of the more compact UTF-8 or the fixed-width UTF-32?

Good question. First off I need to make sure that all readers understand what these different string formats are. Start by reading Joel’s article about character sets if you’re not clear on why there are different string encodings in the first place. I’ll wait.

.
.
.
.

Welcome back.

Now you have some context to understand Filipe’s question. Some Unicode formats are very compact: UTF-8 has one byte per character for the sorts of strings you run into in American programs, and most strings are pretty short even if they contain characters more commonly seen in Europe or Asian locales. However, the downside is that it is difficult to index into a string to find an individual character because the character width is not a fixed number of bytes. Some formats waste a lot of space: UTF-32 uses four bytes per character regardless; a UTF-32 string can be four times larger than the equivalent UTF-8 string, but the character width is constant.

UTF-16, which is the string format that C# uses, appears to be the worst of both worlds. It is not fixed-width; the “surrogate pair” characters require two 16 bit words for one character, most characters require a single 16 bit word. But neither is it compact; a typical UTF-16 string is twice the size of a typical UTF-8 string. Why does C# use this format?

Let’s go back to 1993, when I started at Microsoft as an intern on the Visual Basic team. Windows 95 was still called Chicago. This was well before the Windows operating system had a lot of Unicode support built in, and there were still different versions of Windows for every locale. My job, amongst other things, was to keep the Korean and Japanese Windows machines in the build lab running so that we could test Visual Basic on them.

Speaking of which: the first product at Microsoft that was fully Unicode internally, so that the same code could run on any localized operating system, was Visual Basic; this effort was well underway when I arrived. The program manager for this effort had a sign on his door that said ENGLISH IS JUST ANOTHER LANGUAGE. That is of course a commonplace attitude now but for Microsoft in the early 1990s this was cutting edge. No one at Microsoft had ever attempted to write a single massive executable that worked everywhere in the world. (UPDATE: Long time Microsoftie Larry Osterman has pointed out to me that NT supported UCS-2 in 1991, so I might be misremembering whether or not VB was the first Microsoft product to ship the same executable worldwide. It was certainly among the first.)

The Visual Basic team created a string format called BSTR, for “Basic String”. A BSTR is a length-prefixed UCS-2 string allocated by the BSTR allocator. The decision was that it was better to waste the space and have the fixed width than to use UTF-8, which is more compact but is hard to index into. Compatibility with the aforementioned version of NT was likely also a factor. As the intern who, among other things, was given the vexing task of fixing the bugs in the Windows 3.1 non-Unicode-based DBCS far east string libraries used by Visual Basic, I heartily approved of this choice.

Wait a minute, what on earth is UCS-2? It is a Unicode string consisting of 16 bit words, but without surrogate pairs. UCS-2 is fixed width; there are no characters that consist of two 16 bit words, as there are in UTF-16.

But… how on earth did that work? There are more than two to the sixteen Unicode characters! Well, it was 1993! UTF-16 was not invented until 1996.

So Visual Basic used UCS-2. OLE Automation, the COM technology that lets VB talk to components, of course also used the BSTR format.

Then UTF-16 was invented and is compatible with UCS-2, so “for free” VB and OLE Automation got upgraded to UTF-16 a few years later.

When the .NET runtime was invented a few years after that of course it used length-prefixed UTF-16 strings to be compatible with all the existing COM / Automation / VB code out there.

C# is of course compatible with the .NET runtime.

So there you go: C# uses length-prefixed UTF-16 strings in 2014 because Visual Basic used length-prefixed UCS-2 BSTRs in 1993. Obviously!

So how then does C# deal with the fact that there are strings where some characters take a single 16 bit word and some take two?

It doesn’t. It ignores the problem. Just as it also ignores the problem that it is legal in UTF-16 to have a character and its combining accent marks in two adjacent 16 bit words. And in fact, that’s true in UTF-32 too; you can have UTF-32 characters that take up two 32-bit words because the accent is in one word and the character is in the other; the idea that UTF-32 is fixed-width in general is actually rather suspect.

Strings with surrogate pairs are rare in the line-of-business programs that C# developers typically write, as are combining mark characters. If you have a string that is full of surrogate pairs or combining marks or any other such thing, C# doesn’t care one bit. If you ask for the length of the string you get the number of 16 bit words in the string, not the number of logical characters. If you need to deal with strings measured in terms of logical characters, not 16 bit words, you’ll have to call methods specifically designed to take these cases into account.

By ignoring the problem, C# gets back into the best of both worlds: the string is still reasonably compact at 16 bits per character, and for practical purposes the width is fixed. The price you pay is of course that if you care about the problems that are ignored, the CLR and C# work against you to some extent.

19 thoughts on “ATBG: Why UTF-16?

    • A 65,504-glyph set would probably be sufficient, if one were to assume languages such as non-simplified Chinese would be handled using different fonts and context-sensitive representations. Given that it’s not generally possible to handle multi-language text properly without using different fonts for different languages (problems arise even when using nothing but Western scripts which would easily fit in a 65,520-glyph set), the assumption that 65,504 glyphs would be sufficient when not mixing fonts was not unreasonable.

      • UTF-16 isn’t restricted to 16-bits, and UTF-8 isn’t restricted to 8-bits.

        Both are fully capable of representing the entire Unicode code space.

        UTF employs a scheme where it uses bits in each ‘word’ (8-bit word, or 16-bit word, etc) to indicate how many continuation words there are.
        In UTF-8, 0bxxxx xxx0 means the code point was encoded using a single word, and thus, we’re able to encode characters with code point values = 2^7, I’d have to pack the bits into two words, marking that there is a continuation byte.

        0bxxxx xx11 0bxxxx xx10 – I the first word says ’11’ meaning one continuation byte; the second word says ’10’ meaning it is a continuation byte.

        If I need even more bits: 0bxxxx x111 0bxxxx xx10 0bxxx xx10.

        UTF-16 does the same thing, but instead, words are always 16-bits, eg:

        0bxxxx xxxx xxxx xxx0 = no continuation word

        0bxxxx xxxx xxxx xx11 0bxxxx xxxx xxxx xx10 = one continuation word.

        0bxxxx xxxx xxxx x111 0bxxxx xxxx xxxx xx10 0bxxxx xxxx xxxx xx10 = two continuation words.

        The problem with UTF-16 is that if the text being represented typically has low-value codepoints, you end up not using most bits.

        The problem with UTF-8 is that if the text being represented typically has high-value codepoints, you end up having to use a lot of continuation words, and more bits are used to store signaling than used to store code point bits.

        UTF-8 spends ~1/4 of bits on signaling. UTF-16 spends ~1/8 of bits.

        Technically, UTF-8 can encode fewer total codepoints than UTF-16, since you run out of bits in the first word to indicate how many continuation words follow. However, this is not currently a problem for unicode, since the unicode codespace doesn’t define codepoints high enough to exhaust UTF-8.

        If Unicode were to eventually define codepoints past what UTF-8 can encode, then some, perhaps rare, texts could not be encoded in UTF-8 but would be encoded in UTF-16; UTF-8 could still be used for everything else though.

        This is the reason why C# uses UTF-16; the developer can’t choose which internal coding to use, it has to be hardwired into the implementation, so C# has no choice but to use UTF-16 to remain future proof.

        • The comment system broke my reply: the paragraph that reads “UTF employs a scheme where it uses bits in each” should read:

          UTF employs a scheme where it uses bits in each ‘word’ (8-bit word, or 16-bit word, etc) to indicate how many continuation words there are.
          In UTF-8, 0bxxxx xxx0 means the code point was encoded using a single word, and thus, we’re able to encode characters with code point values less than 2^7 in a single word.

          If the code point was greater-or-equal than 2^7, I’d have to pack the bits into two words, marking that there is a continuation byte.

  1. Do you know to what extent the decision to specify that all instances of `String` will use the same representation was motivated by a desire to simply get stuff out the door (not always a bad thing) versus any sort of performance-tradeoff analysis? Would having the internal content of a `String` generally be unspecified (from the standpoint of external code), but including methods to test whether a range of characters was pure ASCII, copy a range of characters to a specified region of a byte array (if possible), and copy a range of characters to a specified region of a char array, have caused problems? It would be difficult to know, without measuring how strings are used, what string formats would offer sufficient benefits to be be worth the trouble of implementing them, but if code outside `String` limited itself to defined interfaces, future implementations of `String` would have been free to use alternate storage methods. Do you know if such an approach was ever considered, or what factors counted for or against it?

    Also, it’s interesting that you regard UCS-16 as the best of both worlds, given that it imposes 100% overhead on the majority of strings that are generated by machines for machines (e.g. base64, xml, etc.), but won’t allow code to correctly accept arbitrary text without having to know about variable-length characters. The decision to use UCS-16 may have been reasonable when it was made, but I would consider it unfortunate in retrospect.

    • Interestingly, Python made a similar decision to use UTF-16 for its strings in Python 3, but recently switched to a flexible internal representation while still keeping them UTF-16 from the programmer’s perspective. PEP 393 (http://www.python.org/dev/peps/pep-0393/) has the gory details, but the upshot is that there are significant memory savings in ASCII-heavy apps at the cost of a little bit of performance.

      • I wonder how performance of UTF-8 strings would compare with having each individual string either be a an array of bytes or shorts which uses one, two, or three bytes per character (so a sixteen-character string with one or more non-BMP characters would be stored using 48 bytes) or an array of arrays, the first of which would be reserved for an int[] with the character offsets associated with the rest (each of which would be a byte[], short[], or long[])? Concatenation of two three-items composite strings (a+b+c) and (d+e+f) would, depending upon the lengths of c and d, either produce a six-item composite string (a+b+c+d+e+f) or a new string cd [with the contents of c and d] and a five-item composite string a+b+cd+e+f. Individual-character lookups would be slightly more expensive than with a single linear array, but consecutive accesses to large, nearly-identical strings which share much of their content could be accessed faster than would be access to disjoint large strings because of caching.

        Each type of string would need to include copy-range-to-array-range and compare-range-to-array-range methods for every array format, and e.g. comparing two three-item composite strings would likely require performing five separate compare-to-array-range operations for the different overlapped regions, but I would expect that if one avoided having sub-arrays of less than e.g. 256 characters except as the first or last component in a string, overhead should be pretty minimal–generally less than that of having to examine each character individually to ascertain its length.

        • That reminds me if the ‘rope’ class that was included in the original SGI implementation of the STL back in pre-standard C++ days. std::rope, unfortunately, didn’t make the cut for standardization, but it was a very interesting class.

  2. I’m going to get slightly cynical and say, no, no it doesn’t, it supports UCS-2. A string class whose methods make clear that you are talking in logical characters would be necessary to say that UTF-16 is supported.

    Ideally, the class could encode the string in UTF-8 or -16 internally as is most efficient for the particular string. Writing such a thing out to a file or stream would involve the stream extracting characters out of the string in the encoding it expects.

    With the class we have now, it’s easy to ignore multiword characters, so in general code can’t be trusted with multiword characters.

  3. Now why would anyone want to support the ‘full-on’ UTF16, let alone UTF32, in a *programming language*? The only practical reason I can think of is a search application that searches through some hypothetical huge library that stores ancient manuscripts in dead languages, coverted into electronic representation. Seriously, how many programmers are out there who would put comments in their C# code in, say, ancient Egyptian hieroglyphs, or Sumerian cueiforms, or Maya pictograms?
    Speaking of which. I’m not sure whether it was Visual Studio 2010 or MonoDevelop, but one of those have slowed down to a *halt* on me when I loaded a C# program with Cyrillic comments into it. Any ideas, anyone…? 🙂

      • From an even more basic perspective, it one says a dialog box only supports characters chosen from a relatively small list, users may not like it, but most will at least understand what’s allowed and what isn’t. If one says it supports characters within the Basic Multilingual Plane, most users will have no idea what that means, and even those that do know what it means will seldom know whether to expect a particular string to work. Add in the fact that many controls will accept strings outside the BMP, but code which uses them may behave strangely, and users will have an even harder time knowing what they can and cannot expect to do.

    • Now why would anyone want to support the ‘full-on’ UTF16, let alone UTF32, in a *programming language*?

      Well, the meaning of “support” that is implied by the post and other comments is “can correctly handle text encoded in”, not so much “can write program code in”.

      But, either way, anyone who wants their application or compiler to just work in an international environment will find UTF-* support a necessary but not sufficient condition.

      Personally, I hate solutions that let you say, “yeah, sure, it’s internationalised, but, um, not for you”. I also write dead languages from time to time, and use software for same.

  4. I’ve always questioned the decision to use UTF-16 _on the basis of “easy to index into”_. IT IS NOT!

    You see, every time you “easily” index into it, you are basically doing the same thing as you would if you “easily” indexed into a UTF-8 encoded byte array. You might just end up in the middle of a codepoint!

    It is just as much a bug to go “myString.Substring(5)” as it is to cut off a UTF-8 byte array from the sixth byte. The only difference is that the UTF-8 situation everyone knows about and expects, and those who don’t find out soon, when a Russian or Chinese speaker tries to use their program.

    But if you go “myString.Substring(5)”, you will likely never know of the bug, because you only broke the characters outside of the BMP. BMP includes basically every spoken language out there. So you end up thinking you support all of Unicode, when you actually don’t.

    Moral: UTF-16 is evil and gives you a false sense of doing the right thing!

    Alternative moral: “supports Unicode” actually means “supports the BMP” and nobody cares about the truly fancy Unicode characters.

Leave a comment