Programming language designers and users talk a lot about the “height” of language features; some languages are considered to be very “high level” and some are considered to be very “low level”. A “high level” language is generally speaking one which emphasizes the business concerns of the program, and a low-level language is one that emphasizes the mechanisms of the underlying hardware. As two extreme examples, here’s a program fragment in my favourite high-level language, Inform7:
Continue reading
Random programming
FYI, my friend Bob has started a programming blog; as he is an expert on code quality, effective use of template libraries, programming language design, and many other interesting aspects of programming, I’m looking forward to checking it out. It’s at randomprogramming.com.
Volcanoes and fried foods (rerun)
Today on Fun For Friday FAIC, a classic episode first posted in 2009. Enjoy!
I’ve returned from a brief vacation, visiting friends on the island of Maui. I’d never been to that part of the world before. Turns out, it’s a small island in the middle of the Pacific Ocean, entirely made out of volcanoes. Weird! But delightful.
ATBG: Warnings vs errors
NOTE FROM 2021: This article was originally posted on the Coverity Development Testing blog, as part of the “Ask The Bug Guys” series. Those blogs have been taken down; I’ve copied the original content here.
Heartbleed and static analysis
In the wake of the security disaster that is the Heartbleed vulnerability, a number of people have asked me if Coverity’s static analyzer detects defects like this. It does not yet, but you’d better believe our security team is hard at work figuring out ways to detect and thereby prevent similar defects. (UPDATE: We have shipped a hotfix release that has a checker that will find defects like the HeartBleed defect. The security team works fast!)
I’ll post some links to some articles below, but they’re a big jargonful, so I thought that a brief explanation of this jargon might be appropriate. The basic idea is as follows: Continue reading
Standard and Daylight are different
A couple weeks ago I had an online meeting with some European colleagues; I showed up in the chat room at what I thought was the agreed-upon time and they did not, which was odd, but whatever, I waited ten minutes and then rescheduled the meeting. It turns out they did the same an hour later. I’m sure you can guess why.
If you have been sent a link to this page, it is to remind you that “Eastern Standard Time” is not defined as “whatever time it is in New York City right now”, it is defined as “Eastern Time not adjusted for Daylight Saving Time“. Parts of the world in the eastern time zone that do not observe Daylight Saving Time — Panama, for instance — stay in Eastern Standard Time all year, so it is an error to assume that Eastern Standard Time and Eastern Time are the same time.
Continue reading
ATBG: Why UTF-16?
NOTE: This post was originally a link to my post on the Coverity blog, which has been taken down. An archive of the original article is here.
Today on Ask The Bug Guys we have a language design question from reader Filipe, who asks:
Why does C# use UTF-16 as the default encoding for strings instead of the more compact UTF-8 or the fixed-width UTF-32?
Good question. First off I need to make sure that all readers understand what these different string formats are. Start by reading Joel’s article about character sets if you’re not clear on why there are different string encodings in the first place. I’ll wait.
.
.
.
.
Welcome back.
Now you have some context to understand Filipe’s question. Some Unicode formats are very compact: UTF-8 has one byte per character for the sorts of strings you run into in American programs, and most strings are pretty short even if they contain characters more commonly seen in Europe or Asian locales. However, the downside is that it is difficult to index into a string to find an individual character because the character width is not a fixed number of bytes. Some formats waste a lot of space: UTF-32 uses four bytes per character regardless; a UTF-32 string can be four times larger than the equivalent UTF-8 string, but the character width is constant.
UTF-16, which is the string format that C# uses, appears to be the worst of both worlds. It is not fixed-width; the “surrogate pair” characters require two 16 bit words for one character, most characters require a single 16 bit word. But neither is it compact; a typical UTF-16 string is twice the size of a typical UTF-8 string. Why does C# use this format?
Let’s go back to 1993, when I started at Microsoft as an intern on the Visual Basic team. Windows 95 was still called Chicago. This was well before the Windows operating system had a lot of Unicode support built in, and there were still different versions of Windows for every locale. My job, amongst other things, was to keep the Korean and Japanese Windows machines in the build lab running so that we could test Visual Basic on them.
Speaking of which: the first product at Microsoft that was fully Unicode internally, so that the same code could run on any localized operating system, was Visual Basic; this effort was well underway when I arrived. The program manager for this effort had a sign on his door that said ENGLISH IS JUST ANOTHER LANGUAGE. That is of course a commonplace attitude now but for Microsoft in the early 1990s this was cutting edge. No one at Microsoft had ever attempted to write a single massive executable that worked everywhere in the world. (UPDATE: Long time Microsoftie Larry Osterman has pointed out to me that NT supported UCS-2 in 1991, so I might be misremembering whether or not VB was the first Microsoft product to ship the same executable worldwide. It was certainly among the first.)
The Visual Basic team created a string format called BSTR, for “Basic String”. A BSTR is a length-prefixed UCS-2 string allocated by the BSTR allocator. The decision was that it was better to waste the space and have the fixed width than to use UTF-8, which is more compact but is hard to index into. Compatibility with the aforementioned version of NT was likely also a factor. As the intern who, among other things, was given the vexing task of fixing the bugs in the Windows 3.1 non-Unicode-based DBCS far east string libraries used by Visual Basic, I heartily approved of this choice.
Wait a minute, what on earth is UCS-2? It is a Unicode string consisting of 16 bit words, but without surrogate pairs. UCS-2 is fixed width; there are no characters that consist of two 16 bit words, as there are in UTF-16.
But… how on earth did that work? There are more than two to the sixteen Unicode characters! Well, it was 1993! UTF-16 was not invented until 1996.
So Visual Basic used UCS-2. OLE Automation, the COM technology that lets VB talk to components, of course also used the BSTR format.
Then UTF-16 was invented and is compatible with UCS-2, so “for free” VB and OLE Automation got upgraded to UTF-16 a few years later.
When the .NET runtime was invented a few years after that of course it used length-prefixed UTF-16 strings to be compatible with all the existing COM / Automation / VB code out there.
C# is of course compatible with the .NET runtime.
So there you go: C# uses length-prefixed UTF-16 strings in 2014 because Visual Basic used length-prefixed UCS-2 BSTRs in 1993. Obviously!
So how then does C# deal with the fact that there are strings where some characters take a single 16 bit word and some take two?
It doesn’t. It ignores the problem. Just as it also ignores the problem that it is legal in UTF-16 to have a character and its combining accent marks in two adjacent 16 bit words. And in fact, that’s true in UTF-32 too; you can have UTF-32 characters that take up two 32-bit words because the accent is in one word and the character is in the other; the idea that UTF-32 is fixed-width in general is actually rather suspect.
Strings with surrogate pairs are rare in the line-of-business programs that C# developers typically write, as are combining mark characters. If you have a string that is full of surrogate pairs or combining marks or any other such thing, C# doesn’t care one bit. If you ask for the length of the string you get the number of 16 bit words in the string, not the number of logical characters. If you need to deal with strings measured in terms of logical characters, not 16 bit words, you’ll have to call methods specifically designed to take these cases into account.
By ignoring the problem, C# gets back into the best of both worlds: the string is still reasonably compact at 16 bits per character, and for practical purposes the width is fixed. The price you pay is of course that if you care about the problems that are ignored, the CLR and C# work against you to some extent.
Speaking in Los Angeles Monday
I’ll be the guest speaker the evening of this coming Monday, April 7th, at the Los Angeles .NET meetup; come on out and we’ll play a few rounds of everyone’s favourite game, Spot the Defect.
Hope to see you in LA!
C# and VB are open sourced
For literally years now the Roslyn team has been considering whether or not to release the C# and VB analyzers as open source projects, and so I was very happy but not particularly surprised to watch on Channel 9 a few minutes ago Anders announce that Roslyn is now available on CodePlex.
What astonished me was that its not just a “reference” license, but a full on liberal Apache 2.0 license. And then to have Miguel announce that Xamarin had already got Roslyn working on linux was gobsmacking.
Believe me, we cloned that repo immediately.
I’m still mulling over the consequences of this awesome announcement; I’m watching Soma discuss Roslyn on Channel 9 right now, and Anders is coming up again soon for a Q&A session. (At 12:10 Pacific Daylight Time, here.)
I am also on a personal level very excited and a little nervous to finally have a product that I spent years of my life working on widely available in source code form. Since I always knew that open sourcing was a possibility I tried to write my portions of it as cleanly and clearly as possible; hopefully I succeeded.
Congratulations to the whole Roslyn team, and thanks for taking this big bold step into the open source world.
Reordering optimizations
In a previous episode I discussed whether it is a good idea to remove a lock which protects an integer field. I said that without a lock, the read of a field can be moved arbitrarily far backwards in time on a thread, and that this can cause unexpected behaviour in a program. It is perhaps not clear why this can be a problem; it seems like a read shouldn’t have any side effects that can be noticed if it happens earlier. A number of readers asked whether making the read into a volatile read prevented the reordering problem.
The answer to the general question of whether a volatile read can be re-ordered is yes, yes it can. Though a volatile read in a loop cannot be cached once and elided, a volatile read can be moved backwards in time with respect to a volatile write.
The answer to the more specific question of whether making the field volatile makes the program I was given correct cannot really be answered because that was an oversimplified toy program. Deducing the correctness of a real program by analyzing a toy program is perhaps a bad idea.
Instead, I’ll give you an example that (1) can actually happen even on the more restrictive memory model imposed by x86 hardware, and (2) heck let’s make everything volatile while we’re at it; it won’t help! We’ll elide a bunch of locks and see what goes wrong. Check out this program fragment:
static volatile bool q = false;
static volatile bool r = false;
static volatile bool s = false;
static volatile bool t = false;
static object locker = new object();
static bool GetR() { return r; } // No lock!
static void SetR() { lock(locker) { r = true; } }
static void MethodOne()
{
q = true;
if (!GetR())
s = true;
}
static void MethodTwo()
{
SetR();
if (!q)
t = true;
}
The rest of the program, which I have not shown, behaves as follows. First, the static initializers run normally, so the four Booleans and the monitor are all created and assigned to their given values. Then the program creates two threads. Thread one runs MethodOne and thread two runs MethodTwo. Both threads run to completion and these are the only invocations of the two methods. The question is: can the original thread now observe s and t to both be true?
Give that some thought before you read on.
.
.
.
.
.
.
It would appear that the answer is no, by the following logic.
- If one thread runs
q=true;before the other runsif(!q)thentremainsfalse. - Otherwise, one thread runs
q=true;after the other runsif(!q). Clearly that implies thatif(!GetR())must have run afterSetR();. That implies that!GetR()isfalse. Thereforesremainsfalse. - There are only these two possible cases and in each at least one of
sandtremainsfalse, therefore it is impossible that both becometrue.
This logic is wrong; specifically the second statement is completely bogus and therefore the conclusion does not follow. As is often the case with my puzzles, the portion of the analysis marked “clearly” is the flag that indicates the error.
The lack of a lock on the read of r means that the CPU may re-order the read by moving it arbitrarily far backwards in time even with respect to volatile writes, and an x86 will occasionally do so if the stars align correctly! An x86 will not reorder two reads with respect to each other and will not reorder two writes with respect to each other, but it has no problem reordering a read of one variable with respect to a write of another. The CPU is permitted to move the read backwards in time by pretending that you actually wrote:
static void MethodOne()
{
bool temp = r;
q = true;
if (!temp)
s = true;
}
Since this optimization is invisible to the code on the current thread, it is legal. But now there is an obvious interleaving in which both s and t become true. First we assign false to temp on one thread, then all of MethodTwo runs on the other thread, so t is true, and then the remainder of MethodOne sets s to true.
Now imagine that your program depends on s and t never being both true, and this situation happens let’s say one time out of every billion executions of this code. If it runs a thousand times a day on a thousand machines that’s one failure in every three years. How on earth would you debug it? You’d be likely to suspect hardware failure rather than a bug in the program, but there assuredly is a bug in the program. These are the nightmare scenarios you run into when you think you’re clever enough to elide locks. The moment you depart even slightly from one of the “blessed” low-lock patterns you are off in the weeds.
The right thing to do here is first, if you can avoid the shared memory situation in the first place, do so. If you cannot, lock everything all the time. In this particular case, locking the read of r or the write of q would likely be sufficient to ensure that s and t are never both true, but I discourage the attitude that walking as close as possible to the edge of a cliff is a great way to be safe. Walk as far away from that cliff as you possibly can! If you’ve got shared memory then put a lock around all usages of it. Simply making a shared variable volatile is insufficient to ensure thread safety in all cases.
Thanks to threading expert Joe Duffy for bringing this scenario to my attention many years ago.