Monitor madness, part two

In the previous exciting episode I ended on a cliffhanger; why did I put a loop around each wait? In the consumer, for example, I said:

    while (myQueue.IsEmpty)
      Monitor.Wait(myLock); 

It seems like I could replace that “while” with an “if”. Let’s consider some scenarios. I’ll consider just the scenario for the loop in the consumer, but of course similar scenarios apply mutatis mutandis for the producer.

Scenario one: Everything is awesome, everything is cool when you’re part of a team. Suppose the consumer is moved from its wait state to the ready state because the producer has put something on the queue. Now the queue is definitely no longer empty, and we are ready to enter the monitor again. Suppose we fail to do so right away due to a race with the producer. The producer might enter the monitor again and put more stuff on the queue, but eventually the queue will fill up, the producer will put itself into the wait state, and then the consumer is then the only thread left attempting to get into the monitor. Success is guaranteed, and there seems to be no need to check to see if the queue is empty; if we managed to re-enter the monitor it was because something was put on the queue. The loop is unnecessary.

Scenario two: Some other thread got ahold of myLock and for reasons of its own decided to pulse the monitor. That thread is not the producer, so it did not ensure that the queue was non-empty. The consumer must be defensive and say “re-entering the monitor is not a guarantee that my desired condition was met, therefore I must check again.” If it is by design that a third thread can pulse the monitor then there needs to be a loop; if it is not by design then the existence of such a third thread is a bug in the program. If we can assume that there is no such third thread then we don’t need a loop.

Scenario three: The producer genuinely did put something on the queue, and at some time after that, the consumer re-entered the monitor. But between those two events, a third thread won the race and correctly removed the item from the queue for reasons of its own. Again, if that’s a by-design scenario then the consumer has to be willing to check the condition again. If it’s not a by-design scenario then there’s no need for a loop.

So let’s suppose there are only two threads, guaranteed, producer and consumer, that access this lock object and party on this queue. Our second and third scenarios do not apply, so the loop is unnecessary, right? Unfortunately there is a fourth scenario:

Scenario four: Everything is terrible! One time in a hundred billion runs a waiting thread wakes up and goes to the ready state even if it was never pulsed. Suppose we have no loop, and this rare event happens. A possible ordering of events is:

  • The consumer enters the monitor, checks the queue, it is empty, it puts itself to bed.
  • While the producer is running around looking for work, not touching the queue, the consumer thread spuriously wakes up, re-enters the monitor, and without looping, continues running, assuming the queue is non-empty. The queue code produces an unhandled exception and the consumer thread dies a horrible death.

In a world where spurious wakeups are a possibility, you have to always check your conditions in a loop. See, the loop mitigates the terrible scenario; if a thread wakes up spuriously then it checks its condition again, and goes back to sleep if it is not met.

Are spurious wakeups a possibility in C#? This is a surprisingly hard question. Let me list some facts.

Fact one: Spurious wakeups are known to be a rare but observable possibility when using condition variables (a locking mechanism very similar to what we’ve been discussing in this series) on operating systems that use POSIX threads. In particular, on linux when a process is signaled there is a race condition. The choices faced by the designers of linux were, I gather, (1) allow the race to cause spurious wakeups, (2) allow the race to cause some wakeups to be lost; clearly unacceptable, the consumer would never come back and eventually the queue would fill up, or (3) create an implementation with unacceptably high performance costs.

Fact two: Spurious wakeups are similarly documented as being a problem with Windows condition variables. “Condition variables are subject to spurious wakeups […] you should recheck a predicate (typically in a while loop) after a sleep operation returns.”

Fact three: The Java documentation states

“A thread can also wake up without being notified, interrupted, or timing out, a so-called spurious wakeup. While this will rarely occur in practice, applications must guard against it […] waits should always occur in loops.”

Apparently the designers of Java explicitly endorse the theory that spurious wakeups are a real thing.

Fact four: Joe Duffy notes in “Concurrent Programming on Windows” that the claim that Windows suffers from spurious wakeups is somewhat histrionic:

“[…] threads must be resilient to something called spurious wake-ups […] This is not because the implementation will actually do such things […] but rather due to the fact that there is no guarantee around when a thread that has been awakened will become scheduled. Condition variables are not fair. It’s possible – and even likely – that another thread will acquire the associated lock and make the condition false again before the awakened thread has a chance to reacquire the lock and return to the critical region.”

Basically, Joe is saying here that in many situations our “scenario three” is likely.

Fact five: The documentation for Monitor.Wait() says nothing about spurious wakeups or always waiting in a loop.

Fact six: Apparently the CLR does not actually use condition variables as its mechanism for implementing monitors, and therefore reasoning from the shortcomings of condition variables to the shortcomings of C# locks is poor reasoning. We really ought to examine the mechanisms the CLR actually uses if we want to know if they are subject to this problem. No, I’m not going to; see Stephen Cleary’s comment below for some links.

Fact seven: Many expert C# programmers like Jon Skeet (UPDATE: see comments!) and Joseph Albahari recommend always waiting in a loop. And some static analyzers look for missing loops around waits and flag them as a bad code smell; using a loop is a cheap and safe way to make such analyzers stop complaining.

Spurious wakeups in C# seem to be somewhat mythical beasts; people are afraid of them without ever having encountered one in the wild.

So what would I do here?

Well, the first thing I would do is of course not write programs that shared memory across threads! It’s a terrible thing to do! Look at me; I’m a pretty smart guy and I cannot tell you whether to write if or while without writing a seven-item list of pros and cons that thoroughly contradicts itself, makes false analogies and rests upon appeals to authority and the absence of warnings in documentation! This would be a pretty weak foundation upon which to base a coding decision that has real consequences.

If I had to write a program that shared memory across threads then I would use the highest level tool in my toolbox. I would use a thread safe collection written by experts in this case. (Of course that simply begs the question; the expert must know how to do so safely using lower-level mechanisms! I presume they know better than I do.)  If for some reason that was unavailable then I would use a higher-level construct for signaling, like an auto reset event, or a reader-writer lock, or whatever.

Were I forced to write code like this that uses monitors at a low level then I would grit my teeth, embrace cargo-cultism, put a banana in my ear, and write the loop even without being able to give a solid justification for why doing so keeps the alligators away.

24 thoughts on “Monitor madness, part two

  1. Pingback: Monitor madness, part one | Fabulous adventures in coding

  2. My part of Fact 7 is almost certainly due to parroting other (more genuine) experts, with a similar banana in my ear and similar scepticism about it *really* being a problem.

    I used to think I was something of a threading expert, but I’m a lot more sceptical of that now, too… especially in the face of mind-melting-but-theoretically-valid JIT optimizations. But hey, maybe one day we’ll have an easy-to-understand memory model for C# 🙂

  3. “I would use a higher-level construct for signaling, like an auto reset event”

    This had me breathing a sigh of relief. I use auto reset events for thread synchronization in C#. Not because I believed they were better, but simply because those are what I know. 🙂

  4. Unmanaged threads are even worse: it turns out that just a Pulse is not safe (even with a loop) because threads can be borrowed by kernel-mode DPCs and thus miss the pulse (http://blogs.msdn.com/b/oldnewthing/archive/2005/01/05/346888.aspx). There’s a long history of incorrect Condition Variable implementations on Windows before they were added to the kernel (http://www.cs.wustl.edu/~schmidt/win32-cv-1.html , also tons of old Dr. Dobbs articles 🙂 ).

    Monitor avoids this in .NET by managing its own wait queue. It “lifts” queue management into managed code (http://blog.stephencleary.com/2009/09/if-pusleevent-is-broken-what-about.html).

    Interestingly, I’ve seen a similar kind of “lift” in my AsyncEx library, which has a suite of asynchronous coordination primitives. I “lift” the queue management (of tasks, not threads) into managed code. And this has really interesting implications on things like lock hierarchies.

  5. Pingback: The Morning Brew - Chris Alcock » The Morning Brew #1973

  6. I wouldn’t quite call it cargo cult programming; you know why it’s supposed to help, just not if it’s necessary. I also don’t think it’s quite the same thing as Ernie’s ear banana… One obvious problem with his idea is that he takes the lack of alligators is due to the banana. That, I suppose, is similar to not knowing whether or not looping helps with spurious wake-ups. However, the banana is also silly because it wouldn’t help if it had the opportunity! Putting the Wait in a loop may be warding a mythical creature, but at least we are reasonably sure (based on the makeup of the alleged creature) that it would help if we encountered it. So it seems that, should we end up using Wait instead of a higher-level tool, it’s not so silly to put it in a loop.

  7. Another fact: System requirements change, and such changes may make it necessary to post events which someone MIGHT be interested in without knowing if anyone actually IS.

    BTW, is there any means of writing a monitor-based queue such that it can be reliably shut down by code in a Constrained Execution Region? It would seem like it would not have been expensive to implement such a facility with a “Monitor.Invalidate” method which would–without having to acquire a monitor lock–set a flag in it so that current or future wait operations would throw an immediate exception, but I don’t think any such thing exists. Is there any way to achieve such semantics without it?

      • To allow finalizer-based cleanup of objects which encapsulate service-providing threads. The idea would be that a client-facing wrapper object holds a reference to the “main” object used by a service thread and a finalizable object whose finalizer should notify the service thread that nobody cares any more about the service it had been providing.

  8. I would code based on Scenario 3 even if I knew in this particular iteration of the program it couldn’t happen, because it’s low cost, unsurprising to folks used to reading synchronization code, and is robust to future design changes.

    • Alex has hit the reasons I would use the while loop.

      I would assume that it is likely that someone will decide at some point it is a good idea to have more than one consuming thread, so as to speed up the systems.

  9. I have an interesting story to share, about your advice of using AutoResetEvent instead of Monitor where applicable.

    In our application, we had to implement timed execution for a piece of functionality which must complete execution within the configured time or be aborted, and that used ARE to signal work threads when an item was available for processing. If the thread completed work within the allowed time everything was green. However, if the timeout was reached, we Thread.Abort()ed the worker thread and returned an error.

    It would happen every so often, and on only one machine, that the ARE backing this mechanism would be in corrupted state and entire signaling mechanism would fall to pieces requiring a process restart.

    Switching to Monitor.Wait( )/Monitor.Pulse implementation fixed that particular bug for us, using the example code given at http://www.albahari.com/threading/part4.aspx#_Signaling_with_Wait_and_Pulse

    We have not found any reason or explanation of why it was corrupting state on just one machine. I guess the moral of the story here is not to take advice religiously just because someone famous said it.

    • There’s another moral too: Abort is atrocious abomination that should be dealt with with utmost caution and respect. I’ve tried to write abort-safe code a few times; it’s a great way into insanity, if you’re into that kind of thing. It doesn’t help that (understandably), almost nothing in the BCL, C# and runtime where the distinction makes sense is abort-safe (including ARE, lock and using). Abort is for unloading app domains, you get very few safety and consistency guarantees in any other scenario (and even when unloading an appdomain, you might have trouble when using any kind of unmanaged resource, including a file, socket or a database, for example). If you really need preëmptively abortable tasks, start a new process. Killing a process is much safer than aborting a thread.

Leave a comment