ATBG: How does a lock work?

Today on the Coverity Development Testing Blog‘s continuing series Ask The Bug Guys, I answer a question I’ve gotten a fair number of times recently: how is it possible to implement locking without already having locking? 

The answer I give is chock full of lies, misinformation and oversimplification, but hopefully it will get across the basic concept that you can build all the threading mechanisms out of simple parts like atomic-test-and-set.

In related news, here are a couple of other recent posts to ATBG. First, my colleague Tim explains that Java’s hashCode has many of the same design guidelines as the equivalent method in C# that I discussed here. And my colleague Jon discusses the perils of accidentally performing overlapping reads and writes to the same buffer in C. [1. My favourite quotation on the subject of poor handling for overlapping buffers has to be there has yet to be an example of an application whose impressive performance can be credited to the absence of overlap-detection code in memcpy. True that! I note also that the C# equivalent Array.Copy method is documented as dealing correctly with overlapping reads and writes to the same array.]

As always, if you have questions about a bug you’ve found in a C, C++, C# or Java program that you think would make a good episode of ATBG, please send your question along with a small reproducer of the problem to TheBugGuys@Coverity.com. We cannot promise to answer every question or solve every problem, but we’ll take a selection of the best questions that we can answer and address them on the dev testing blog every couple of weeks.

23 thoughts on “ATBG: How does a lock work?

  1. I like the article linked in your footnote, though I’m not sure I quite agree with the quote itself. On many platforms, memcpy has historically been implemented as a compiler intrinsic, while memmove was a function call. On platforms which do a good job of optimizing memcpy operations, it may perform more than twice as fast as memmove–a difference which can be significant when it’s used within an inner loop.

    Still, the general point is a good one. A certain level of complexity will be required in any system; failing to include a small amount of complexity in a place where could most easily be accommodated will often force other parts of the system to add much more complexity dealing with it. Part of the art of good system design is recognizing when a small amount of complexity in one case can eliminate a large amount elsewhere.

  2. Since monitor at it simplest form (where there are no other waiters) simply executes Compare and Set and only where there are multiple threads waiting on the monitor it moves to contention resolution by creating a monitor wait event that he constantly blocks on tries to down-set the waiters count and signal the next waiter and so on.

    So there is a considerable chance that the simple monitor implementation might just be better in some cases or the performance might even out especially in fast short code that will not be very likely to have contention. But without the wait free guarantees or proper back off the lock free monitor will not scale well compared to the standard monitor.

    Since the article is good at it’s core (and misleading) wouldn’t it be better to expand it further and work out all of the details ? This would give others a much better understanding on how to build a proper CAS lock etc or a proper Monitor implementation (In case something is missing in the current one).

    Thanks 🙂

  3. Pingback: Dew Drop – February 13, 2014 (#1722) | Morning Dew

  4. I guess presenting Dekker’s algorithm as one possibility for locking that doesn’t have to rely on CAS (or ll/sc) would have been out of scope.

    I’m not sure how I feel about such overview articles though – they don’t help people write better software and can mislead people into a false sense of security (“oh that’s not so complicated after all”). E.g. no mention of the given ordering guarantees of locks – so better hope nobody uses a weak CAS to implement their own version.

    • It is impossible to implement locking via any algorithm without some sort of sequenced atomic primitives. On many historical single-processor machines, and even some multi-processor ones, the reads and writes of word-sized variables were sequenced and atomic (so that if one thread writes x and y, in that order, and another thread reads them in the opposite order, the latter thread can guarantee that if it sees the new x it will also see the new y, and if it saw the old y it will also have seen the old x. Even on a machine with no fancier atomic primitives, sequenced atomic reads and writes are sufficient to implement a mutex. Given that many machines which lacked fancier primitives still had sequenced atomic reads and writes, mutex implementations for such machines were useful.

      Today’s general-purpose processors invariably include atomic primitives which are more powerful than sequenced atomic reads and writes. Further, unlike historical machines where an atomic sequenced read or write would cost the same as any other “real” access to a variable, today’s multi-core machines generally have a heavy execution-time penalty for all forms of sequenced atomic access. If multiple atomic accesses can be replaced with a single compare-and-swap, that’s a major performance win. While Dekker’s algorithm was useful on hardware platforms where the best atomic algorithms were sequenced atomic loads and stores, it is not appropriate for use on modern machines.

      • I’m not sure where you disagree with anything I said? Obviously nobody would use Dekker’s algorithm to implement locks these days, but the only requirement it has is that you have memory orderings or guaranteed sequential consistence to begin with (both of which have nothing to do with atomic instructions themselves).

        And since you only have to read/write a single bit for Dekker it’s rather unimportant whether reads/writes of words or any other unit are atomic or not (you can’t read only half of a bit after all).

        So for me that counts as being free of atomic instructions, even though that’s a much stronger statement than “doesn’t need CAS or ll/sc, etc”.

      • I drifted a bit from what I’d intended to say, which was to agree that Dekker’s algorithm would have been out of scope; it’s an interesting historical curiosity, but doesn’t fit well with modern memory models. Dekker’s algorithm may only look like it uses three bits, but on some machines it will actually use six: processor A’s “flag0”, “flag1”, and “turn”, and processor B’s “flag0”, “flag1”, and “turn”. When processor A executes `flag0=true`, that statement will set processor A’s `flag0`, and it will set processor B’s `flag0`, but on many machines, other actions might occur between the writes to flag0. Whether one wishes to describe the statement as writing two separate variables, or whether one wishes to describe sequential inconsistency via some other way, my point is that on many machines the only way to make Dekker’s algorithm work would be to impose memory barriers that would cost more than just using CompareExchange or load-linked-store-conditional.

    • I did mention ordering requirements; I called out specifically that they were out of scope for the purposes of that article.

      I also mentioned that solving this problem on architectures that lack interlocked compare and swap is a tricky but solved problem.

      My intention in this article was neither to give a history of concurrency primitives nor to describe how to write a monitor with all the features you’d need for a real-world implementation. Rather, I called out that my intention was to show that it is possible to build a working lock out of parts that are far simpler than a lock. Many programmers who don’t have a background in low-level OS programming are curious as to how it’s possible to solve the apparent chicken-and-egg problem of building a lock without already having a lock.

    • I’m not sure how I feel about such overview articles though – they don’t help people write better software and can mislead people into a false sense of security (“oh that’s not so complicated after all”).

      I disagree, myself. Firstly, for someone who does want to get a good understanding of how real locks work, they have to start somewhere. The first introduction to a complex topic necessarily has to be an overview. One of my favourite computer science lecturers at university was blunt on that point – “Teaching is a process of lying to you.”

      Secondly, overview articles like this can help ordinary programmers who don’t want to specialise in the esoteric area. It can demystify the area and reveal that it’s all just code, or hardware in this case, at the end of the day and not magic. Some guy once wrote a blog post about the unhelpfulness of magic thinking in software.

      Lastly, I think this particular article was extremely clear that what was being presented is a simplification of the topic, so it’s unlikely to provoke an “oh that’s not so complicated after all” reaction from anyone.

  5. Article “Locks and exceptions don’t mix” says (about pre-C# 4.0 lock statement) that

    “The problem here is that if the compiler generates a no-op instruction between the monitor enter and the try-protected region then it is possible for the runtime to throw a thread abort exception after the monitor enter but before the try. In that scenario, the finally never runs so the lock leaks, probably eventually deadlocking the program. It would be nice if this were impossible in unoptimized and optimized builds.”

    which naturally raises a question about thread abortion—does the runtime “inject” a thread abort exception into the running thread instead of using TerminateThread?

  6. How complex is the “real” locking story? I thought that ATBG was very interesting but I’m also curious about how the real code works with all those bits you deemed out of scope (eg ordering, how it actually waits, etc.). Is it the sort of thing you could do an article on?

  7. The back end wants to break out if you are not slow enough in the turn,
    and while this is manageable, it is still another thing to deal with
    while trying to compete with the other 15 cars on the track
    with you. For the Linux user 3 – 6 months into their Linux learning
    experience I wouldn’t hesitate to recommend Fedora Linux 10.

    This can be compared to first learning to drive a car.

  8. I don’t even know the way I stopped up right here, however I believed this publish used to
    be good. I don’t recognize who you might be however
    certainly you’re going to a well-known blogger for those who aren’t already.
    Cheers!

  9. After checking out a number oof thhe articles on your web site, I truly appreciate your technique
    of writing a blog. I bookmarked it to my bookmark website list and will be checking baxk in the
    near future. Please visit my website as well and tell me your opinion.

  10. Greetings! I know this is kinda off topic but I
    was wondering if you knew where I could find a captcha plugin for my comment form?
    I’m using the same blog platform as yours and I’m having trouble finding one?
    Thanks a lot!

  11. nuclear familyrevolting disgustingThetwo exact twotimerevolting disgustingextended familytwo revolting
    disgustingfornuclear familytwo twonuclear familyrevolting
    disgustingdryingrevolting disgustingextended familytwo
    twonuclear familyrevolting disgustingthetwofive
    hundred five hundredtwonuclear familyrevolting disgustingtrayextended family tworevolting disgustingadhesivesrevolting disgustingextended family five hundrednuclear
    familyearlier thanextended familytwofive hundred five hundredtworevolting disgustingimpressionrevolting disgustingextended
    familytwo takingrevolting disgustingextended familyfive hundred can be twonuclear
    familynotextended familyfive hundred recognized.

Leave a comment