Reordering optimizations

In a previous episode I discussed whether it is a good idea to remove a lock which protects an integer field. I said that without a lock, the read of a field can be moved arbitrarily far backwards in time on a thread, and that this can cause unexpected behaviour in a program. It is perhaps not clear why this can be a problem; it seems like a read shouldn’t have any side effects that can be noticed if it happens earlier. A number of readers asked whether making the read into a volatile read prevented the reordering problem.

The answer to the general question of whether a volatile read can be re-ordered is yes, yes it can. Though a volatile read in a loop cannot be cached once and elided, a volatile read can be moved backwards in time with respect to a volatile write.

The answer to the more specific question of whether making the field volatile makes the program I was given correct cannot really be answered because that was an oversimplified toy program. Deducing the correctness of a real program by analyzing a toy program is perhaps a bad idea.

Instead, I’ll give you an example that (1) can actually happen even on the more restrictive memory model imposed by x86 hardware, and (2) heck let’s make everything volatile while we’re at it; it won’t help! We’ll elide a bunch of locks and see what goes wrong. Check out this program fragment:

static volatile bool q = false;
static volatile bool r = false;
static volatile bool s = false;
static volatile bool t = false;
static object locker = new object();

static bool GetR() { return r; }  // No lock!
static void SetR() { lock(locker) { r = true; } }

static void MethodOne()
{
  q = true;
  if (!GetR())
    s = true;
}

static void MethodTwo()
{
  SetR();
  if (!q)
    t = true;
}

The rest of the program, which I have not shown, behaves as follows. First, the static initializers run normally, so the four Booleans and the monitor are all created and assigned to their given values. Then the program creates two threads. Thread one runs MethodOne and thread two runs MethodTwo. Both threads run to completion and these are the only invocations of the two methods. The question is: can the original thread now observe s and t to both be true?

Give that some thought before you read on.

.
.
.
.
.
.

It would appear that the answer is no, by the following logic.

  • If one thread runs q=true; before the other runs if(!q) then t remains false.
  • Otherwise, one thread runs q=true; after the other runs if(!q). Clearly that implies that if(!GetR()) must have run after SetR();. That implies that !GetR() is false. Therefore s remains false.
  • There are only these two possible cases and in each at least one of s and t remains false, therefore it is impossible that both become true.

This logic is wrong; specifically the second statement is completely bogus and therefore the conclusion does not follow. As is often the case with my puzzles, the portion of the analysis marked “clearly” is the flag that indicates the error.

The lack of a lock on the read of r means that the CPU may re-order the read by moving it arbitrarily far backwards in time even with respect to volatile writes, and an x86 will occasionally do so if the stars align correctly! An x86 will not reorder two reads with respect to each other and will not reorder two writes with respect to each other, but it has no problem reordering a read of one variable with respect to a write of another. The CPU is permitted to move the read backwards in time by pretending that you actually wrote:

static void MethodOne()
{
  bool temp = r;
  q = true;
  if (!temp)
    s = true;
}

Since this optimization is invisible to the code on the current thread, it is legal. But now there is an obvious interleaving in which both s and t become true. First we assign false to temp on one thread, then all of MethodTwo runs on the other thread, so t is true, and then the remainder of MethodOne sets s to true.

Now imagine that your program depends on s and t never being both true, and this situation happens let’s say one time out of every billion executions of this code. If it runs a thousand times a day on a thousand machines that’s one failure in every three years. How on earth would you debug it? You’d be likely to suspect hardware failure rather than a bug in the program, but there assuredly is a bug in the program. These are the nightmare scenarios you run into when you think you’re clever enough to elide locks. The moment you depart even slightly from one of the “blessed” low-lock patterns you are off in the weeds.

The right thing to do here is first, if you can avoid the shared memory situation in the first place, do so. If you cannot, lock everything all the time. In this particular case, locking the read of r or the write of q would likely be sufficient to ensure that s and t are never both true, but I discourage the attitude that walking as close as possible to the edge of a cliff is a great way to be safe. Walk as far away from that cliff as you possibly can! If you’ve got shared memory then put a lock around all usages of it. Simply making a shared variable volatile is insufficient to ensure thread safety in all cases.


Thanks to threading expert Joe Duffy for bringing this scenario to my attention many years ago.

26 thoughts on “Reordering optimizations

  1. I seem to remember that the Java memory model wouldn’t allow this re-ordering. Is C#’s notion of “volatile” closer to that of C than that of Java then?

    • No the JMM behaves identically here.

      A volatile read has acquire semantics, and a volatile write has release semantics, basically this means that you can’t move anything *before* a volatile read and can’t move anything *past* a volatile write.

      But the converse is perfectly fine, what this means is that

      volatile write
      normal write
      normal read
      

      can be reordered to (assuming the single threaded behavior would be identical)

      normal write
      normal read
      volatile write
      

      while

      normal write
      normal read
      volatile read
      

      can be changed to

      volatile read
      normal write
      normal read
      

      If you want stronger guarantees you’ll have to put bidirectional memory barriers in place yourself – or use a lock

  2. You just made me curious enough to read through the JMM specification again (it’s been a long time) and try to prove my intuition. I believe the reordering would be invalid in Java by the following argument:

    Let us reduce the problematic program to four actions:
    w.q: Write to q
    r.r: Read from r
    w.r: Write to r
    r.q: Read from q

    All of these actions are synchronization actions according to 17.4.2, since they are volatile reads and writes. According to 17.4.4, the synchronization order of an execution is a total order over all synchronization actions that is consistent with the program order of each thread. That means w.q must come before r.r in the synchronization order, and w.r must come before r.q.

    However, that means all valid synchronization orders start with either w.r or w.q. And by 17.4.4,

    “A write to a volatile variable v synchronizes-with all subsequent reads of v by any thread (where “subsequent” is defined according to the synchronization order). ”

    Since the synchronizes-with relation implies the happens-before relation, and that in turn guarantees that the read will observe the write, we will always read either r or q as true.

    I think the effort needed already to reason about such a simple example proves Eric’s point though: Just use locks and be happy that you don’t have to think about these fine details. Locks already leave you with enough to worry about 🙂

    • Thanks for the analysis. FYI the C# specification says very clearly that an implementation is not required to make a consistent ordering of volatile accesses as observed from different threads. Interesting that Java requires a consistent total order.

      • Is C# specification stated this way because of x86 memory ordering?

        I don’t remember of a processor with memory ordering stronger than x86, so, for Java to be stronger the compiler probably needs to emit a lot of mfences.

        Is it possible to do the “mfence” in C# without locks?

          • Is the decision of when to introduce memory fences a function of the architecture, or of the processor? From what I can tell, the “volatile read” and “volatile write” behaviors are designed to guard the object identified by a reference stored in a field, rather than the field itself; the fact that a variable is volatile does not expedite stores to the variable itself (unlike in C), but rather ensures that it is delayed until after all previous stores (volatile or not) have completed. Likewise, the fact that a variable is volatile does not ensure that a read of that variable will yield current data, but to the contrary ensures that it won’t see a value written after a write fence unless all subsequent reads are sure to see all data written before it. I wouldn’t think a processor design would require that particular usage pattern when different situations require different kinds of memory barriers.

          • The initial intention of the volatile specifier in C was to facilitate communicating with memory-mapped devices – you certainly don’t want the compiler changing the meaning of your code by, for example, writing to the “execute command” port before it finishes writing the command to be executed.

            Using volatile in C as an attempt to provide safe multithreading is, and has always been, flawed, since it does absolutely nothing to prevent the processor from reordering your accesses (it’s just assumed that the processor won’t sabotage itself in cases where you’re actually writing to and reading from a memory-mapped device).

            C# seems to inherit this definition, and hence using volatile in any case other than controlling a memory-mapped device means you’re probably doing it wrong. Considering how often you’d typically use managed code to do low-level talking to memory-mapped devices, I’m of the opinion that that was a mistake. At least in Java it’s actually useful for something, even if that something is also usually a bad idea.

          • No, C# volatile is very different from C volatile, in that C# volatile was designed with multi-threaded scenarios in mind. The semantics of volatile in C# are that reads and writes are special events that must be observed to be ordered in certain ways with respect to other special events such as: entering a lock, leaving a lock, throwing an exception, starting a thread, and so on. See the C# specification for details.

            In C, you’re correct, volatile means nothing with respect to threads unless your particular compiler vendor makes it mean something.

        • x86 actually gives stronger guarantees than the JMM demands in many regards and is in general a horrible fit to the memory models of modern languages (JMM and C++’s memory_order_seq_cst at least), complete ordering overkill. You can very efficiently implement the JMM on processors with weaker ordering guarantees (so basically everything; the new ARMv8 ISA was apparently explicitly designed to work well with the JMM and similar models and is more similar to ia64 [itanium] than x86)

          If you’re interested in the JMM and how that works together with many different ISAs I suggest http://g.oswego.edu/dl/jmm/cookbook.html

    • “All of these actions are synchronization actions according to 17.4.2, since they are volatile reads and writes. According to 17.4.4, the synchronization order of an execution is a total order over all synchronization actions”

      No you’re mixing different things up here. We are only interested in synchronization actions and (important!) what they synchronize with! This is handled in 17.4.4:
      “A write to a volatile variable v (§8.3.1.4) synchronizes-with all subsequent reads *of v* (focus mine) by any thread”
      but as you see it does not synchronize with reads of other variables.

      17.4.5 goes on with happens-before relationships but that again only specifies:
      “A write to a volatile field (§8.3.1.4) happens-before every subsequent read of that field.”

      So my reading of the JMM says that there’s nothing stopping you from doing the reorderings here.

      • > but as you see it does not synchronize with reads of other variables.

        True, but I never used this statement with reads and writes of different variables, so the objection does not apply.

        > We are only interested in synchronization actions and (important!) what they synchronize with.

        We definitely are interested in that – it’s how we determine whether the writes are guaranteed to be visible to the reads. But the very rule you cited says that the synchronizes-with relation is influenced by the synchronization order:

        “A write to a volatile variable v (§8.3.1.4) synchronizes-with all subsequent reads of v by any thread (where “subsequent” is defined *according to the synchronization order*)” (emphasis mine this time)

        So the synchronization order determines whether a volatile write synchronizes-with a volatile read from the same variable. I reasoned about the synchronization order to show that for either r or q, the write must be ordered before the read, which (according to the rule above) induces the synchronizes-with relationship that we are interested in.

        If I am mixing things up, please tell me exactly where. I’m not an expert on the JMM so it’s very well possible that I misunderstood something, but if so I don’t see it at the moment.

        • True, but I never used this statement with reads and writes of different variables, so the objection does not apply.

          It looks to me like you were, here:


          That means w.q must come before r.r in the synchronization order, and w.r must come before r.q.

          My understanding of the Java memory model concurs with voo’s. But then I’ve never used volatile (in any language) because locks are so much easier to reason about, and therefore safer.

          • I see now where you are coming from. Of course you cannot apply the rule in question directly to my statement you quoted above, because it concerns orderings between reads and writes to *different* variables.

            But that’s not what I intended to say. I left out an important step in between because I thought it was too simple to spell out. Let’s do this in slow motion, starting from the statement you quoted:

            1. “That means w.q must come before r.r in the synchronization order, and w.r must come before r.q.”

            2. “However, that means all valid synchronization orders start with either w.r or w.q.” (They can’t start with r.r or r.q because that would conflict with 1)

            3. (missing from the original argument): If the syncrhonization order starts with w.r, then it orders w.r before r.r. If the order starts with w.q, then it orders w.q before r.q.

            4. (also missing): From 2 and 3 follows: The synchronization order either orders w.r before r.r, or it orders w.q before r.q.

            At this point it should be obvious how the rule in question applies.

        • Mhm on a second reading of the spec I think you are right about *volatile* reads/writes. I.e. you can’t reorder volatile reads and writes, *but* you can do so for normal read/writes.

          That means
          st/ld # can be reordered with respect to next volatile store if non- volatile
          volatile ld
          st/ld # can not be reordered
          respectively
          st/ld # can never be reordered with respect to next volatile store
          volatile st
          st/ld # can be reordered if non-volatile

          • I’m glad we agree now on the volatile reorderings :). As for re-ordering non-volatile relative to volatile, I somehow don’t want to dive back into the specs right now to see what exactly is legal and what isn’t, but I agree that non-volatile reads/writes can be re-ordered past volatile ones in some situations.

        • What prevents the write to q to be moved below the read to r? What in the standard prevents this:

          static void MethodOne()
          {
          q = true;
          if (!GetR())
          s = true;
          }

          from being optimized to this?:

          static void MethodOne()
          {
          if (!GetR())
          s = true;
          q = true;
          }

  3. I wish there were a nice easy way for a program to specify that a group of threads should be confined to running in such fashion as to have a stronger memory model (e.g. on some machines, being limited to running on the two “halves” of a hyper-threaded core that share an L1 cache). The cost of the full memory barriers associated with locks is going to increase as machines add more and more cores, and it would seem that in an increasing number of cases, running a task on one core with no memory barriers could take less CPU time than dividing it among multiple cores (especially if the other cores have something else useful they can be doing).

    Otherwise, how much would efficiency be hurt if generated code barred volatile reads from being sequenced before volatile writes on the same thread (if nothing else, by adding a barrier in any case where one might not already exist)? I would think the cost would be less than having to add locks which would otherwise not be be required.

    • I would think this would be a rare case, and that programmers should and do strive mightily to ensure that it is a rare case.

      Best practise is to cut down the amount of shared data between threads as much as possible, regardless of what synchronisation mechanism is used. Your desired feature would only see use when that wasn’t possible for some reason.

      In which case I personally would, yeah, just give up write it in one thread for simplicity’s sake.

      • The C# version of “volatile” is designed around the idea that code will write many values (which need not be volatile), then write one “data ready” indication (perhaps a reference to the values just written). Other code will read the “data ready” indication (using a volatile read) and, if data is ready, read the data. This approach requires one write barrier when setting an arbitrary number of items, and one read barrier when examining them.

        An alternative approach, and the one used by some other systems, would be to include a write barrier after storing each item, and before reading each item. Only the last item written and first item read would need the barrier, but including the barrier (especially when writing) only on those items where it was needed could be tricky. The C# volatile model tries to avoid the need to make such determination. Unfortunately, it is ineffective in scenarios which don’t fit the “bunch of non-volatile variables guarded by volatile readiness indicator”.

        I’d suggest that the right approach for a programmer who wants to think about memory barriers is to not use volatile variables at all, but instead specify memory barriers everyplace a cause-and-effect relationship is required. If a language wants to provide a way of easing that burden, it should have a “volatile” keyword which can ensure that all memory barriers necessary to maintain cause and effect relationships among volatile variables will be included. In addition to the existing barriers, it should include a read barrier before any volatile read which could possibly follow a volatile write (as opposed to another read), and a write barrier after any volatile write which could be followed by a volatile read without another intervening volatile write. Such an option could I think be perfectly compatible with C# (nothing ever promises that memory barriers won’t be inserted), and would greatly reduce the likelihood of Heisenbugs associated with `volatile` variables.

        • The C# version of “volatile” is designed around the idea that code will write many values (which need not be volatile), then write one “data ready” indication (perhaps a reference to the values just written). Other code will read the “data ready” indication (using a volatile read) and, if data is ready, read the data. This approach requires one write barrier when setting an arbitrary number of items, and one read barrier when examining them.

          But the non-volatile reads and writes can be re-ordered around the volatile one, so this surely is not safe. The reader thread is not guaranteed to see all the written values.

          If I wished to pass many values from one thread to another, I would put them all in an immutable object, and lock the reference to that object.

          • Actually they do IF
            – No writes (of regular variables) can be moved to after a volatile write
            – No reads (of regular variables) can be moved to before a volatile read

            –> This is the whole idea of a memory barrier. AFAIK the volatile will add half-barriers like that (see comments above. I didn’t read the C# spec for volatiles myself)

          • The .NET memory model has no such thing as volatile reads and volatile writes. Instead it has reads, writes, thread-global read barriers, thread-global write barriers, and thread-global read/write barriers. No distinction is made between reads and writes which should be affected by a memory barrier, and those which should not. A memory model in which “ordinary” reads and writes could not affected by some form of memory barrier would compel the use of volatile reads and writes even for immutable objects–thus defeating one of the reasons for using immutable objects in the first place; memory models differ, though, in the kinds of ordering that can be imposed upon normal reads and writes.

  4. Mfences, the servants of the Incoherent Ones.

    Really, such articles make me to recall Lovecraft’s stories. Because a man thinks of a world as regular and sensible, but then he looks through the borders of ordered space, and sees chaos and madness. And the unholy demon prince Arm-eleven-toth.

    And a comment above has this gem: “the right approach for a programmer who wants to think about memory barriers is…” Why would a C# programmer want to think about memory barriers? In fact, why would any high-level programming language user want to think about them? That stuff belongs to those who write optimizers, or automatic paralellization tools (in fact, our department has been working on such a project, http://ops.rsu.ru/en/about.shtml, for some time but thankfully, I am not a part of it), or something like this.

  5. Pingback: Can I skip the lock when reading an integer? | Fabulous adventures in coding

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s