NO! YOU CAN’T JUST USE BRUTE FORCE HERE! WE NEED TO USE SEGMENT TREES TO GET UPDATE TIME COMPLEXITY DOWN TO O(LOG N)! BREAK THE DATA INTO CHUNKS AT LEAST! OH THE INEFFICIENCY!!!

and the senior developer responds

Ha ha, nested for loops go brrrrrrrrrrrr…

OK, that’s silly and juvenile, but… oh no, I feel a flashback coming on.

…

…

…

It is 1994 and I am a second-year CS student at my first internship at Microsoft on the Visual Basic compiler team, reading the source code for InStr for the first time. InStr is the function in Visual Basic that takes two strings, call them **source** and **query**, and tells you the index at which **query** first appears as a substring of **source**, and the implementation is naive-brute-force.

I am shocked to learn this! Shocked, I tell you!

Let me digress slightly here and say what the naive brute force algorithm is for this problem.

Aside: To keep it simple we’ll ignore all the difficulties inherent in this problem entailed by the fact that VB was the first Microsoft product where one version worked everywhere in the world on every version of Windows no matter how Windows was localized; systems that used Chinese DBCS character encodings ran the same VB binary as systems that used European code pages, and we had to support all these encodings plus Unicode UTF-16. As you might imagine, the string code was a bit of a mess. (And cleaning it up in VBScript was one of my first jobs as an FTE in 1996!)

Today for simplicity we’ll just assume we have a flat, zero-terminated array of chars, one character per char as was originally intended.

The *extremely* naive algorithm for finding a string in another goes something like this pseudo-C algorithm:

bool starts(char *source, char *query) { int i = 0; while (query[i] != '\0') { if (source[i] != query[i]) return false; i = i + 1; } return true; } int find(char *source, char *query) { int i = 0; while (source[i] != '\0') { if (starts(source + i, query)) return i; i = i + 1; } return -1; }

The attentive reader will note that this is the aforementioned **nested for loop**; I’ve just extracted the nested loop into its own helper method. The extremely attentive reader will have already noticed that I wrote a few bugs into the algorithm above; what are they?

Of course there are many nano-optimizations one can perform on this algorithm if you know a few C tips and tricks; again, we’ll ignore those. It’s the algorithmic complexity I’m interested in here.

The action of the algorithm is straightforward. If we want to know if query “banana” is inside source “apple banana orange” then we ask:

- does “apple banana orange” start with “banana”? No.
- does “pple banana orange” start with “banana”? No.
- does “ple banana orange” start with “banana”? No.
- …
- does “banana orange” start with “banana”? Yes! We’re done.

It might not be clear why the naive algorithm is bad. The key is to think about what the worst case is. The worst case would have to be one where there is no match, because that means we have to check the most possible substrings. Of the no-match cases, what are the worst ones? The ones where **starts** does the most work to return false. For example, suppose **source** is “aaaaaaaaaaaaaaaaaaaa” — twenty characters — and **query** is “aaaab”. What does the naive algorithm do?

- Does “aaaaaaaaaaaaaaaaaaaa” start with “aaaab”? No, but it takes five comparisons to determine that.
- Does “aaaaaaaaaaaaaaaaaaa” start with “aaaab”? No, but it takes five comparisons to determine that.
- … and so on.

In the majority of attempts it takes us the maximum number of comparisons to determine that the **source** substring does not start with the **query**. The naive algorithm’s worst case is O(n*m) where n is the length of **source** and m is the length of the **query**.

There are a lot of obvious ways to make minor improvements to the extremely naive version above, and in fact the implementation in VB was slightly better. The implementation in VB was basically this:

char* skipto(char *source, char c) { char *result = source; while (*result != '\0' && *result != c) result = result + 1; return result; } int find(char *source, char *query) { char *current = skipto(source, query[0]); while (*current != '\0;) { if (starts(current, query)) return current - source; current = skipto(current + 1, query[0]); } return -1; }

(WOW, EVEN MORE BUGS! Can you spot them? It’s maybe easier this time.)

This is more complicated but not actually better algorithmically; all we’ve done is moved the initial check in **starts** that checks for equality of the first letters into its own helper method. In fact, what the heck, this code looks *worse*. It does *more work* and is *more complicated*. What’s going on here? We’ll come back to this.

As I said, I was a second year CS student and (no surprise) a bit of a keener; I had read ahead and knew that there were string finding algorithms that are considerably better than O(n*m). The basic strategy of these better algorithms is to do some preprocessing of the strings to look for interesting features that allow you to “skip over” regions of the source string that you know cannot possibly contain the query string.

This is a heavily-studied problem because, first off, obviously it is a “foundational” problem; finding substrings is useful in many other algorithms, and second, because we genuinely do have extremely difficult problems to solve in this space. “Find this DNA fragment inside this genome”, for example, involves strings that may be billions of characters long with lots of partial matches.

I’m not going to go into the various different algorithms that are available to solve this problem and their many pros and cons; you can read about them on Wikipedia if you’re interested.

Anyways, where was I, oh yes, **CS student summer intern vs Senior Developer**.

I read this code and was outraged that it was not the most asymptotically efficient possible code, so I got a meeting with Tim Paterson, who had written much of the string library and had the office next to me.

Let me repeat that for those youngsters in the audience here, **TIM FREAKIN’ PATERSON.** Tim “QDOS” Paterson, who one fine day wrote an operating system, sold it to BillG, and that became MS-DOS,* the most popular operating system in the world.* As I’ve mentioned before, Tim was very intimidating to young me and did not always suffer foolish questions gladly, but it turned out that in this case he was very patient with all-caps THIS IS INEFFICIENT Eric. More patient than I likely deserved.

As Tim explained to me, first off, the reason why VB does this seemingly bizarre “find the first character match, then check if **query** is a prefix of **source**” logic is because the **skipto** method is not written in the naive fashion that I showed here. **The skipto method is a single x86 machine instruction.** (**REPNE SCASB,** maybe? My x86 machine code knowledge was never very good. It was something in the **REP** family at least.) It is *blazingly* fast. It harnesses the power of purpose-built hardware to solve the problem of “where’s that first character at?”

That explains that; it genuinely is a big perf win to let the hardware do the heavy lifting here. But what about the asymptotic problem? Well, as Tim patiently explained to me, guess what? Most VB developers are NOT asking if “aaaab” can be found in “aaaaaaa…”. The vast majority of VB developers are asking is “London” anywhere in this address, or similar problems where the strings are normal human-language strings without a lot of repetitions, and both the source and query strings are *short*. Like, very short. Less than 100 characters short. Fits into a cache line short.

Think about it this way; most **source** strings that VB developers are searching have any given character in them maybe 2% of the time, and so for whatever the start character is of the **query** string, the **skipto** step is going to find those 2% of possible matches *very quickly*. And then the **starts** step is the vast majority of the time going to *very quickly* identify false matches. **In practice the naive brute force algorithm is almost always O(n + m). **

Moreover, Tim explained to me, any solution that involves allocating a table, preprocessing strings, and so on, is going to take longer to do all that stuff than the blazingly-fast-99.9999%-of-the-time brute force algorithm takes to just give you the answer. The additional complexity is simply not worth it in scenarios that are relevant to VB developers. VB developers are developing line-of-business solutions, and their line of business is not typically genomics; if it is, they have special-purpose libraries for those problems; they’re not using **InStr**.

…

…

…

And we’re back in 2020. I hope you enjoyed that trip down memory lane.

It turns out that yes, fresh grads and keener interns *do* complain to senior developers about asymptotic efficiency, and senior developers *do* say “but nested for loops go *brrrrrrr*” — yes, they go *brrrrrr* extremely quickly much of the time, and senior developers know that!

And now I am the senior developer, and I try to be patient with the fresh grads as my mentors were patient with me.

UPDATE: Welcome, Hacker News readers. I always know when I’m linked from Hacker News because of the huge but short-lived spike in traffic. The original author of the meme that inspired this post has weighed in. Thanks for inspiring this trip back to a simpler time!

]]>I have been thinking many times these last four years, and much more these last few days about the late Italian economic historian Carlo Cipolla. Not because of his economic theories, of which I know very little, but rather because of his theory of stupidity. You can read the principles in brief for yourself at the link above, or the original paper here, but I can summarize thus:

**Powerful smart people**take actions that benefit both themselves and others.**Victims**lack the power to protect themselves. They are unable to find actions that benefit themselves, and are victimized to the benefit of others.**Bandits**take actions that benefit themselves at the expense of victims.**Stupid idiots**take actions that benefit neither themselves nor others.

These are value-laden terms so let’s be clear here that neither I nor Cipolla are suggesting that victims, bandits or idiots are not *intelligent*:

- No matter how intelligent you are and how many precautions you take, you can be victimized by a bandit or an idiot. Victims are not to blame for their victimization. We’ll come back to this in a moment.
- Bandits are often very intelligent; they just use their skills to victimize others. Whether that’s because they are genuinely not intelligent enough to make a living helping others, or because they are that intelligent but psychologically enjoy being a bandit, or are bandits for other reasons, it doesn’t matter for our purposes. Assume that bandits are extremely intelligent and devious, but motivated by gain.
- Idiots, ironically, are often very intelligent; a great many idiots have fancy degrees from excellent colleges. As Cipolla points out in his paper,
*there is no characteristic that identifies idiots*other than their inability to act in a way that benefits anyone including themselves. That includes intelligence or lack thereof.

Some key consequences of this model have been on my mind these last few days:

- Bandits, even the psychopaths, are motivated by self-interest and recognize actions that benefit themselves. You can
*reason*with a bandit, but more importantly, you can reason*about*a bandit, and therefore**you can make use of a bandit.**You can make an offer to a powerful bandit and count on them to take it up if it maximizes their gain. **You cannot reason with an idiot.**You can’t negotiate with them to anyone’s advantage because they will take positions that harm themselves at the same time as they harm others. There are no “useful idiots”; any attempt to use an idiot to benefit yourself will backfire horribly as they manage to find a way for everyone to lose.- When the idiots are in power,
**there is no bright line separating the smart from the victims**; rather, there is just a spectrum of more or less power and privilege. Victims by definition lack the power to defend themselves, and the more privileged have no lever to pull to change the course of the idiot, who will act with such brazen disregard for the well-being of everyone including themself that it is hard to devise a strategy.

All this is by way of introduction to say: the position that I am seeing on Twitter and in the media that “soon” is a good time to “re-start the economy” is **without question the stupidest, most idiotic position I have ever heard of in my life** and that includes

I’ll leave you with how Cipolla finishes his essay, because it sums up exactly how I feel at this moment in history.

**In a country which is moving downhill […] one notices among those in power an alarming proliferation of the bandits with overtones of stupidity and among those not in power an equally alarming growth in the number of helpless individuals. Such change in the composition of the non-stupid population inevitably strengthens the destructive power of the stupid fraction and makes decline a certainty. And the country goes to Hell.**

I am posting today from my recently-transformed spare room which is now apparently my office. Scott Hanselman started a great twitter thread of techies showing off their home workspaces; here’s my humble contribution.

We have my work Mac hooked up to two medium-sized HP monitors, one of which cost me all of $20 at a tech thrift store. The Windows game machine is under the desk. You’ll note that I finally found a use for my VSTO 2007 book. The keyboard is the new edition of the Microsoft Natural; my original edition Natural is still on my desk at work and is not currently retrievable.

I am particularly pleased with how the desk came out. I made it myself out of 110 year old cedar fence boards; when I bought my house in 1997 the original fence was still in the back yard and falling down, so I disassembled it, removed the nails, let the boards dry out, planed them down, and figured I’d eventually do something with it. I’ve been building stuff out of it ever since, and this project finished off the last of that stock.

Here’s a better shot of the desk.

The design is my own but obviously it is just a simple mission-style desk. All the joints are dowel and glue; the only metal is the two screws that hold the two drawer knobs on. The finish is just Danish oil with a little extra linseed oil added.

To the right I have a small writing desk:

Which as you may have guessed doubles as my 1954 Kenmore Zigzag Automatic Sewing Machine:

I have not used it in a while; I used to make kites. I might start again.

The manual for this machine is unintentionally hilarious, but that’s a good topic for another day.

Finally, not shown, I’ve got a futon couch and a few plants to make it cosy.

Stay safe everyone, and hunker down.

UPDATE: Obviously I’ve been spending so much time in video chat from home, which is very unusual for me. Unfortunately my setup is such that there is a south-facing window right behind me that overpowers the built-in camera on my laptop even with the curtains drawn.

I took Scott’s advice and got an inexpensive 8 inch ring light, shown here with the room otherwise dark:

The ring light is dimmable LED and has three colour temperatures, so I can now control the specularity of the light directly on my face, and also do some light shaping with the little non-dimmable desk lamp should I wish to. On a bright day the window no longer washes out the webcam image.

Before:

After:

The window is still an almost total white-out, but at least I no longer look like a purple-faced Walking Dead extra.

And good heavens do I ever need a haircut. That’ll have to wait.

]]>

The original question was: suppose we have an asynchronous workflow where we need to get an integer to pass to another method. Which of these is, if any, is the better way to express that workflow?

Task<int> ftask = FAsync(); int f = await ftask; M(f);

or

int f = await FAsync(); M(f);

or

M(await FAsync());

?

The answer of course is that all of these are the same workflow; they differ only in the verbosity of the code. You might argue that when debugging the code it is easier to debug if you have one operation per line. Or you might argue that efficient use vertical screen space is important for readability and so the last version is better. There’s not a clear best practice here, so do whatever you think works well for your application.

(If it is not clear to you that these are all the same workflow, remember that “await” does not magically make a synchronous operation into an asynchronous one, any more than “if(M())” makes M() a “conditional operation”. The await operator is just that: an operator that operates on values; the value returned by a method call is a value like any other! I’ll say more about the true meaning of await at the end of this episode.)

But now suppose we make a small change to the problem. What if instead we have:

M(await FAsync(), await GAsync());

? This workflow is equivalent to:

Task<int> ftask = FAsync(); int f = await ftask; Task<int> gtask = GAsync(); int g = await gtask; M(f, g);

but that causes the start of the GAsync task to be delayed until after the FAsync task finishes! If the execution of GAsync does not depend on the completion of FAsync then we would be better off writing:

Task<int> ftask = FAsync(); Task<int> gtask = GAsync(); int f = await ftask; int g = await gtask; M(f, g);

Or equivalently

Task<int> ftask = FAsync(); Task<int> gtask = GAsync(); M(await ftask, await gtask);

and possibly get some additional efficiency in our workflow; if FAsync is for some reason delayed then we can still work on GAsync’s workflow.

Always remember when designing asynchronous workflows: an await is by definition a position in the workflow where the workflow pauses (asynchronously!) until a task completes. If it is possible to delay those pauses until later in the workflow, you can sometimes gain very real efficiencies!

]]>First off, what is the precedence of operators in C? For our purposes today we’ll consider just three operators: `&&`

, `&`

and `==`

, which I have listed in order of increasing precedence.

What is the problem? Consider:

int x = 0, y = 1, z = 0; int r = (x & y) == z; // 1 int s = x & (y == z); // 0 int t = x & y == z; // ?

Remember that before 1999, C had no Boolean type and that the result of a comparison is either zero for false, or one for true.

Is `t`

supposed to equal `r`

or `s`

?

Many people are surprised to find out that `t`

is equal to `s`

! Because `==`

is higher precedence than `&`

, the comparison result is an input to the `&`

, rather than the `&`

result being an input to the comparison.

Put another way: reasonable people think that

x & y == z

should be parsed the same as

x + y == z

but it is not.

What is the origin of this egregious error that has tripped up countless C programmers? Let’s go way back in time to the very early days of C. In those days there was no `&&`

operator. Rather, if you wrote

if (x() == y & a() == b) consequence;

the compiler would generate code as though you had used the && operator; that is, this had the same semantics as

if (x() == y) if (a() == b) consequence;

so that `a()`

is not called if the left hand side of the `&`

is false. However, if you wrote

int z = q() & r();

then both sides of the `&`

would be evaluated, and the results would be binary-anded together.

That is, the meaning of `&`

was context sensitive; in the condition of an `if`

or `while`

it meant what we now call `&&`

, the “lazy” form, and everywhere else it meant binary arithmetic, the “eager” form.

However, in either context the `&`

operator was lower precedence than the `==`

operator. We want

if(x() == y & a() == b())

to be

if((x() == y) & (a() == b))

and certainly not

if((x() == (y & a())) == b)

This context-sensitive design was quite rightly criticized as confusing, and so Dennis Ritchie, the designer of C, added the `&&`

operator, so that there were now separate operators for bitwise-and and short-circuit-and.

The correct thing to do at this point from a pure language design perspective would have been to make the operator precedence ordering `&&`

, `==`

, `&`

. This would mean that both

if(x() == y && a() == b())

and

if(x() & a() == y)

would mean exactly what users expected.

However, Ritchie pointed out that doing so would cause a potential breaking change. Any existing program that had the fragment `if(a == b & c == d)`

would remain correct if the precedence order was `&&`

, `&`

, `==`

, but would become an incorrect program if the operator precedence was changed without also updating it to use `&&`

.

There were several hundred kilobytes of existing C source code in the world at the time. **SEVERAL HUNDRED KB**. What if you made this change to the compiler and failed to update one of the `&`

to `&&`

, and made an existing program wrong via a precedence error? That’s a potentially disastrous breaking change.

You might say “just search all the source code for that pattern” but this was two years before grep was invented! It was as primitive as can be.

So Ritchie maintained backwards compatibility forever and made the precedence order `&&`

, `&`

, `==`

, effectively adding a little bomb to C that goes off every time someone treats `&`

as though it parses like `+`

, in order to maintain backwards compatibility with a version of C that only a handful of people ever used.

**But wait, it gets worse.**

C++, Java, JavaScript, C#, PHP and who knows how many other languages largely copied the operator precedence rules of C, so *they all have this bomb in them too.* (Swift, Go, Ruby and Python get it right.) Fortunately it is mitigated somewhat in languages that impose type system constraints; in C# it is an error to treat an int as a bool, but still it is vexing to require parentheses where they ought not to be necessary were there justice in the world. (And the problem is also mitigated in more modern languages by providing richer abstractions that obviate the need for frequent bit-twiddling.)

The moral of the story is: *The best time to make a breaking change that involves updating existing code is now, because the bad designs that result from maintaining backwards compat unnecessarily can have repercussions for decades, and the amount of code to update is only going to get larger.* It was a mistake to not take the breaking change when there were only a few tens of thousands of lines of C code in the world to update. It’s fifty years since this mistake was made, and since it has become embedded in popular successor languages we’ll be dealing with its repercussions for fifty more at least, I’d wager.

UPDATE: The most common feedback I’ve gotten from this article is *“you should always use parentheses when it is unclear”*. Well, obviously, yes. But that rather misses the point, which is that **there is no reason for the novice developer to suppose that the expression x & y == z is under-parenthesized when x + y == z works as expected.** The design of a language should lead us to naturally write correct code without having to think *“will I be punished for my arrogance in believing that code actually does what it looks like it ought to?” *

Twitter user Plazmaz brought a scam github repository and web site to my attention; see his thread on Twitter for details. It’s a pretty obviously fake site, and there is some evidence in the metadata Plazmaz uncovered that indicates it is a university cybersecurity student project — or, that the scammers want investigators to think that it is.

The reason it was brought to my attention is because the authors of the site used a photo from this blog as part of their scheme! The scammer blog post is here and my original is here.

If this is a university project: **please do not teach your students that it is acceptable to use other people’s work in your coursework without attribution or permission.** You would not tolerate students passing off someone else’s work as their own in other academic pursuits.

If this is a scam then the fact that they’re using a stolen photo — and one that is easily seen to be stolen! — as part of their scheme might seem like a flaw, but in fact it is a feature of the scam. The scammers are looking for unsophisticated and gullible people who will be easily fooled; making the deception easy to uncover is therefore a filter that excludes people of normal gullibility from the pool of possible victims. This great paper from Microsoft Research goes into the math.

]]>Admiral Picard (retired) apparently has the same 1982 science fiction book club edition of The Complete Robot handy on his desk as I have on mine:

though frankly, his copy seems to be in better shape than mine.

Anyone know what the book below it is?

UPDATE: My friend Brian R has identified a likely candidate for the second book. It appears to be the Easton Press edition of The Three Musketeers:

UPDATE: Later episodes of the series confirm these hypotheses; apparently these were not so much Easter eggs as subtle foreshadowing.

]]>There has been some discussion on tech twitter lately on the subject of whether it is possible to be “successful” in the programming business without working long hours. I won’t dignify the posts which started this conversation off — firmly in the “not possible” camp — with a link; you can find them easily enough I suspect.

My first thought upon seeing this discussion was “*well that’s just dumb*“. The whole thing struck me as basically illogical for two reasons. First, because it was vague; “success” is relative to goals, and everyone has different goals. Second, because any universal statement like *“the only way to achieve success in programming is by working long hours”* can be refuted by a single counterexample and I am one! My career has been a success so far; I’ve worked on interesting technology, mentored students, made friends along the way, and been well compensated. But I have always worked long hours very rarely; only a handful of times in 23 years.

Someone said something dumb on the internet, the false universal statement was* directly refuted* by me in a *devastatingly logical* manner just now, and we can all move on, right?

Well, no.

My refutation — my personal, anecdotal refutation — answers in the affirmative the question *“Is it possible for any one computer programmer, anywhere in the world right now, to be successful without working long hours?”* but that is not an interesting or relevant question. *My first thought was also pretty dumb.*

Can we come up with some better questions? Let’s give it a shot. I’ll start with the personal and move to the general.

*We’ve seen that long hours were not a necessary precondition to my success. What were the sufficient preconditions?*

I was born into a middle-class, educated family in Canada. I had an excellent public education with teachers who were experts in their fields and genuinely cared about their students. I used family connections to get good high school jobs with strong mentors. Scholarships, internships, a supportive family and some talent for math allowed me to graduate from university with in-demand skills and no debt, with a *career* waiting for me, not just a *job.* I’ve been in good health my whole life. When I had problems I had access to professionals who helped me, and who were largely paid by insurance.

Did I *work* throughout all of that? Sure! Was it always *easy*? No! But **my privileged background enabled me to transform working reasonable hours at a desk into success.**

Now it is perhaps more clear why my “refutation” was so dumb, and that brings us to our next better question:

*If we subtract some of those privileges, does it become more and more likely that working long hours becomes a necessary precondition for success in our business?*

If you’re starting on a harder difficulty level — starting from poverty, without industry or academic connections, if you’re self-taught, if you’re facing the headwinds of discrimination, prejudice or harassment, if you have legal or medical or financial or family problems to solve on top of work problems* — there are not that many knobs you can turn* that increase your chance of success. It seems reasonable that “work more hours” is one of those knobs you can turn much more easily than “get more industry contacts”.

The original statement is maybe a little too strong, but what if we weaken it a bit? Maybe to something like “*working long hours is a good idea in this business because it greatly increases your chances of success, particularly if you’re facing a headwind.*” What if we charitably read the original statement more like that?

This is a statement that might be true or it might be false. We could do research to find out — and indeed, there is some research to suggest that there is **not** a clear causation between working more hours and being more successful. But the point here is that the weakened statement is at least *not immediately refutable. *

This then leads us from a question about how the world *is* to how it *ought* to be, but I’m going to come back to that one. Before that I want to dig in a bit more to the original statement, not from the point of view of *correctness*, or even *plausibility*, but from the point of view of* who benefits by making the statement*.

*Suppose we all take to heart the advice that we should be working longer to achieve success. Who benefits?*

I don’t know the people involved, and I don’t like to impute motives to people I don’t know. I encourage people to read charitably. **But I am having a hard time believing the apologia I outlined in the preceding section was intended**. The intended call to action here was not “*let’s all think about how structural issues in our economy and society incent workers from less privileged backgrounds to work longer hours for the same pay.*” *Should* we think about that? Yes. But that was not the *point*. The point being made was a *lot* simpler.

The undeniable subtext to *“you need to work crazy hours to succeed”* is *“anyone not achieving success has their laziness to blame; they should have worked harder, and you don’t want to be like them, do you?”*

**That is propaganda.** When you say the quiet part out loud, it sounds more like *“the income of the idle rich depends on capturing the value produced by the labours of everyone else, so make sure you are always producing value that they can capture. Maybe they will let you see some of that value, someday.” *

Why would anyone choose to produce value to be confiscated by billionaires?** Incentives matter and the powerful control the incentives.** Success is the *carrot*; poverty and/or crippling debt is the *stick*.

Those afforded less privilege get more and more of the stick. **If hard work and long hours could be consistently transformed into “success”, then my friends and family who are teachers, nurses, social workers and factory workers would be far more successful than I am. **They definitely work both longer and harder than I do, but they have far less ability to transform that work into success.

That to me is the real reason to push back on the notion that *long hours and hard work are a necessary precondition of success*: not because it is *false* but because **it is propaganda in service of weakening further the less privileged**. *“It is proper and desirable to weaken the already-weak in order to further strengthen the already-strong”* is as good a working definition of “evil” as you’re likely to find.

The original statement isn’t helpful advice. It isn’t a rueful commentary on disparity in the economy.** It’s a call to produce more profit now in return for little more than a vague promise of possible future success. **

*Should long hours be a precondition for success for anyone irrespective of their privileges?*

First off, I would like to see a world where everyone started with a stable home, food on the table, a high quality education, and so on, and I believe we should be working towards that end as a society, and as a profession.

We’re not there, and* I don’t know how to get there*. Worse, there are powerful forces that prefer increasing disparities rather than reducing them.

Software is in many ways unique. It’s the reification of algorithmic thought. It has effectively zero marginal costs. The industry is broad and affords contributions from people at many skill levels and often irrespective of location. The tools that we build amplify other’s abilities. And we build better tools for the world when the builders reflect the diversity of that world.

I would much rather see a world in which** anyone with the interest in this work could be as successful as I have been**, than this world where the majority have to sacrifice extra time and energy in the service of profits they don’t share in.

Achieving that will be hard, and like I said, I don’t know how to effect a structural change of this magnitude. But we can at least start by recognizing propaganda when we see it, and calling it out.

I hate to end the decade on my blog on such a down note, but 2020 is going to be hard for a lot of people, and we are all going to hear a *lot* of propaganda. Watch out for it, and don’t be fooled by it.

If you’re successful, that’s great; I am very much in favour of success. **See if you can use some of your success in 2020 to increase the chances for people who were not afforded all the privileges that turned your work into that success.**

Happy New Year all; I hope we all do good work that leads to success in the coming year. We’ll pick up with some more fabulous adventures in coding in 2020.

Thanks to my friend @editorlisaquinn for her invaluable assistance in helping me clarify my thoughts for this post.

]]>There is a reason why I did that topic before “Fixing Random”, but sadly I never got to the connection between differential calculus and sampling from an arbitrary distribution. I thought I might spend a couple of episodes sketching out how it is useful to use automatic differentiation when trying to sample from a distribution. I’m not going to write the code; I’ll just give some of the “flavour” of the algorithms.

Before I get into it, a quick refresher on the Metropolis algorithm. The algorithm is:

- We have a probability distribution function that takes a point and returns a double that is the “probability weight” of that point. The PDF need not be “normalized”; that is, the area under the curve need not add up to 1.0. We wish to generate a series of samples that conforms to this distribution.
- We choose a random “initial point” as the
*current*sample. - We randomly choose a
*candidate*sample from some distribution based solely on the current sample. - If the candidate is higher weight than current, it becomes the new current and we yield the value as the sample.
- If it is lower weight, then we take the ratio of candidate weight to current weight, which will be between 0.0 and 1.0. We flip an unfair coin with that probability of getting heads. Heads, we accept the candidate, tails we reject it and try again.
- Repeat; choose a new candidate based on the new current.

The Metropolis algorithm is straightforward and works, but it has a few problems.

**How do we choose the initial point?**

Since Metropolis is typically used to compute a posterior distribution after an observation, and we typically have the prior distribution in hand, we can use the prior distribution as our source of the initial point.

**What if the initial point is accidentally in a low-probability region?**We might produce a series of unlikely samples before we eventually get to a high-probability current point.

We can solve this by “burning” — discarding — some number of initial samples; we waste computation cycles so we would like the number of samples it takes to get to

“convergence” to the true distribution to be small. As we’ll see, there are ways we can use automatic differentiation to help solve this problem.

**What distribution should we use to choose the next candidate given the current sample?**

This is a tricky problem. The examples I gave in this series were basically “choose a new point by sampling from a normal distribution where the mean is the current point”, which seems reasonable, but then you realize that the question has been begged. A normal distribution has two parameters: the mean and the standard deviation. The standard deviation corresponds to “how big a step should we typically try?” If the deviation is too large then we will step from high-probability regions to low-probability regions frequently, which means that we discard a lot of candidates, which wastes time. If it is too small then we get “stuck” in a high-probability region and produce a lot of samples close to each other, which is also bad.

Basically, we have a “tuning parameter” in the standard deviation and it is not obvious how to choose it to get the right combination of good performance and uncorrelated samples.

These last two problems lead us to ask an important question: **is there information we can obtain from the weight function that helps us choose a consistently better candidate? **That would lower the time to convergence and might also result in fewer rejections when we’ve gotten to a high-probability region.

I’m going to sketch out one such technique in this episode, and another in the next.

As I noted above, Metropolis is frequently used to sample points from a high-dimensional distribution; to make it easier to understand, I’ll stick to one-dimensional cases here, but imagine that instead of a simple curve for our PDF, we have a complex multidimensional surface.

Let’s use as our motivating example the mixture model from many episodes ago:

Of course we can sample from this distribution directly if we know that it is the sum of two normal distributions, but let’s suppose that we don’t know that. We just have a function which produces this weight. Let me annotate this function to say where we want to go next if the current sample is in a particular region.

I said that we could use the derivative to help us, but it is very unclear from this diagram how the derivative helps:

- The derivative is small and positive in the region marked “go hard right” and in the immediate vicinity of the two peaks and one valley.
- The derivative is large and positive in the “slight right” region and to the left of the tall peak.
- The derivative is large and negative in the “slight left” region and on the right of the small peak.
- The derivative is small and negative in the “hard left” region and in the immediate vicinity of the peaks and valley.

No particular value for the derivative clearly identifies a region of interest. It seems like we cannot use the derivative to help us out here; what we really want is to move away from small-area regions and towards large-area regions.

Here’s the trick.

Ready?

I’m going to graph the *log* of the weight function below the weight function:

*Now look at the slope of the log-weight*. It is very positive in the “move hard right” region, and becomes more and more positive the farther left we go! Similarly in the “move hard left” region; the slope of the log-weight is very negative, and becomes more negative to the right.

In the “slight right” and “slight left” regions, the slope becomes more moderate, and when we are in the “stay around here” region, the slope of the log-weight is close to zero.* This is what we want.*

(ASIDE: Moreover, this is even more what we want because in realistic applications we often *already* have the log-weight function in hand, not the weight function. Log weights are convenient because you can represent arbitrarily small probabilities with “normal sized” numbers.)

We can then use this to modify our candidate proposal distribution as follows: rather than using a normal distribution centered on the *current* point to propose a candidate, **we compute the derivative of the log of the weight function using dual numbers**, and *we use the size and sign of the slope to tweak the center of the proposal distribution. *

That is, if our current point is far to the left, we see that the slope of the log-weight is very positive, so we move our proposal distribution some amount to the right, and then we are more likely to get a candidate value that is in a higher-probability region. But if our current point is in the middle, the slope of the log-weight is close to zero so we make only a small adjustment to the proposal distribution.

(And again, I want to emphasize: realistically we would be doing this in a high-dimensional space, not a one-dimensional space. We would compute the *gradient* — the direction in which the slope increases the most — and head that direction.)

If you work out the math, which I will not do here, the difference is as follows. Suppose our non-normalized weight function is *p*.

- In the plain-vanilla proposal algorithm we would use as our candidate distribution a normal centered on
*current*with standard deviation*s.* - In our modified version we would use as our candidate distribution a normal centered on
*current + (s / 2) * ∇log(p(current))*, and standard deviation*s*.

Even without the math to justify it, this should seem reasonable. The typical step in the vanilla algorithm is on the order of the standard deviation; we’re making an adjustment towards the higher-probability region of about half a step if the slope is moderate, and a small number of steps if the slope is severe; the areas where the slope is severe are the most unlikely areas so we need to get out of them quickly.

If we do this, we end up doing more math on each step (to compute the log if we do not have it already, and the gradient) but **we converge to the high-probability region much faster.**

If you’ve been following along closely you might have noticed two issues that I seem to be ignoring.

First, we have not eliminated the need for the user to choose the tuning parameter *s*. Indeed, this only addresses one of the problems I identified earlier.

Second, the Metropolis algorithm requires for its correctness that the proposal distribution *not ever be biased in one particular direction! *But the whole *point* of this improvement is to bias the proposal towards the high-probability regions. Have I pulled a fast one here?

I have, but we can fix it. I mentioned in the original series that I would be discussing the Metropolis algorithm, which is the oldest and simplest version of this algorithm. In practice we use a variation on it called Metropolis-Hastings which adds a correction factor to allow non-symmetric proposal distributions.

The mechanism I’ve sketched out today is called the Metropolis Adjusted Langevin Algorithm and it is quite interesting. It turns out that this technique of “walk in the direction of the gradient plus a random offset” is also how physicists model movements of particles in a viscous fluid where the particle is being jostled by random molecule-scale motions in the fluid. (That is, by Brownian motion.) It’s nice to see that there is a physical interpretation in what would otherwise be a very abstract algorithm to produce samples.

**Next time on FAIC:** The fact that we have a connection to a real-world physical process here is somewhat inspiring. In the next episode I’ll give a sketch of another technique that uses ideas from physics to improve the accuracy of a Metropolis process.

]]>

Welcome to this special bonus episode of *Fixing Random,* the immensely long blog series where I discuss ways to add probabilistic programming features into C#. I ran into an interesting problem at work that pertains to the techniques that we discussed in this series, so I thought I might discuss it a bit today.

Let’s suppose we have three forts, Fort Alpha, Fort Bravo and Fort Charlie at the base of a mountain. They are constantly passing messages back and forth by carrier pigeon. Alpha and Charlie are too far apart to fly a pigeon directly, so messages from Alpha to Charlie first go from Alpha to Bravo, and then on from Bravo to Charlie. (Similarly. messages from Charlie to Alpha go via Bravo, but we’ll not worry about that direction for the purposes of this discussion.)

Carrier pigeons are of course an unreliable mechanism for passing messages, so let’s model this as a Bernoulli process; every time we send a message from Alpha to Bravo, or Bravo to Charlie, we flip an unfair coin. Heads, the bird gets through, tails it gets lost along the way.

From this we can predict the reliability of passing a message from Alpha to Charlie via Bravo; the probability of failure is the probability that A-B fails or B-C fails (or both). Equivalently, the probability of success is the probability of A-B succeeding and B-C succeeding. This is just straightforward, basic probability; if the probability of success from A-B is, say, 95% and B-C is 96%, then the probability of success from A to C is their product, around 91%.

**Aside:** Note that I am assuming in this article that pigeons are passing from Bravo to Charlie even if a pigeon failed to arrive from Alpha; I’m *not* trying to model in this system constraints like “pigeons only fly from Bravo to Charlie when one arrived from Alpha”.

Now let’s add an extra bit of business.

We have an observer on the mountaintop at Fort Delta overlooking Alpha, Bravo and Charlie. Delta has some high-power binoculars and is recording carrier pigeon traffic from Alpha to Bravo and Bravo to Charlie. But here’s the thing: Delta is an unreliable observer, because observing carrier pigeons from a mountaintop is inherently error-prone; *sometimes Delta will fail to see a pigeon. *Let’s say that 98% of the time, Delta observes a pigeon that is there, and Delta never observes a pigeon that is not there.

Every so often, Delta issues a report: either “*the channel from Alpha to Charlie is healthy*” if Delta has observed a pigeon making it from Alpha to Bravo *and* also a pigeon making it from Bravo to Charlie. But if Delta has just failed to observe either a pigeon going from Alpha to Bravo, *or* a pigeon going from Bravo to Charlie, then Delta issues a report saying “*the channel from Alpha to Charlie is unhealthy*“.

The question now is: suppose Delta issues a report that the Alpha-Charlie channel is unhealthy. *What is the probability that a pigeon failed to get from Alpha to Bravo, and what is the probability that a pigeon failed to get from Bravo to Charlie?* Each is surely much higher than the 5-ish percent chance that is our prior.

We can use the gear we developed in the early part of my Fixing Random series to answer this question definitively, but before we do, **make a prediction. **If you recall episode 16, you’ll remember that you can have a 99% accurate test but the posterior probability of having the disease that the test diagnoses is only 50% when you test positive;* this is a variation on that scenario.*

Rather than defining multiple enumerated types as I did in earlier episodes, or even using bools, let’s just come up with a straightforward numeric encoding. We’ll say that **1** represents “a pigeon failed to make the journey”, and **0** means “a pigeon successfully made the journey” — if that seems backwards to you, I agree but it will make sense in a minute.

Similarly, we’ll say that **1** represents “Delta’s attempt to observe a pigeon has failed”, and **0** as success, and finally, that **1** represents Delta making the report “the channel is unhealthy” and **0** represents “the channel is healthy”.

The reason I’m using **1** in all these cases to mean “something failed” is because I want to use **OR** to get the final result. Let’s build our model:

var ab = Bernoulli.Distribution(95, 5); var bc = Bernoulli.Distribution(96, 4); var d = Bernoulli.Distribution(98, 2);

- 5% of the time
`ab`

reports**1**: the pigeon failed to get through. - 4% of the time,
`bc`

reports**1**: the pigeon failed to get through. - 2% of the time,
`d`

reports**1**: it fails to see a pigeon that is there.

Now we can ask and answer our question about the posterior: what do we know about the posterior distribution of pigeons making it from Alpha to Bravo and Bravo to Charlie? We’ll sample from `ab`

and `bc`

once to find out if a pigeon failed to get through, and then ask whether Delta failed to observe the pigeons.

What is the condition that causes Delta to report that the channel is unhealthy?

- the pigeon from Alpha to Bravo failed, OR
- the pigeon from Bravo to Charlie failed, OR
- Delta failed to observe Alpha’s pigeon, OR
- Delta failed to observe Bravo’s pigeon.

We observe that Delta reports that the channel is unhealthy, so we’ll add a **where** clause to condition the result, and then print out the resulting posterior joint probability:

var result = from pab in ab from pbc in bc from oab in d from obc in d let report = pab | pbc | oab | obc where report == 1 select (pab, pbc); Console.WriteLine(result.ShowWeights());

This is one of several possible variations on the “**noisy OR distribution**” — that is, the distribution that we get when we **OR** together a bunch of random Boolean quantities, but where the **OR** operation itself has some probabilistic “noise” attached to it.

**Aside:** That’s why I wanted to express this in terms of OR-ing together quantities; of course we can always turn this into a “noisy AND” by doing the appropriate arithmetic, but typically this distribution is called “noisy OR”.

We get these results:

(0, 0):11286 -- about 29% (0, 1):11875 -- about 30% (1, 0):15000 -- about 39% (1, 1):625 -- less than 2%

Remember that **(0, 0)** means that pigeons *did* make it from Alpha to Bravo and from Bravo to Charlie; in every other case we had at least one failure.

It should not be too surprising that **(1, 1)** — both the Alpha and Bravo pigeons failed simultaneously — is the rarest case because after all, that happens less than 9% of the cases overall, so it certainly should happen in some smaller percentage of the “unhealthy report” cases.

But the possibly surprising result is: when Delta reports a failure, there is a 29% chance that this is a false positive report of failure, and in fact it is almost as likely to be a false positive as it is to be a dropped packet, I mean *lost pigeon*, between Bravo and Charlie!

Put another way, if you get an “unhealthy” report, 3 times out of 10, the report is wrong and you’ll be chasing a wild goose. Just as we saw with false positives for tests for diseases, *if the test failure rate is close to the disease rate of the population, then false positives make up a huge percentage of all positives.*

My slip-up there of course illuminates what you figured out long ago; all of this whimsy about forts and pigeons and mountains is just a silly analogy. Of course what we really have is not three forts connected by two carrier pigeon routes, but ten thousand machines and hundreds of routers in a data center connected by network cabling in a known topology. Instead of pigeons we have trillions of packets. Instead of an observer in a fort on a mountaintop, we have special supervising software or hardware that is trying to detect failures in the network so that they can be diagnosed and fixed. Since the failure detection system is itself part of the network, it *also* is unreliable, which introduces “noise” into the system.

The real question at hand is: given prior probabilities on the reliability of each part of the system including the reporting system itself, *what are the most likely posterior probabilities that explain a report that some part of the network is unhealthy*?

This is a highly practical and interesting question to answer because it means that network engineers can quickly narrow down the list of possible faulty components given a failure report to the most likely culprits. **The power of probabilistic extensions in programming languages is that we now have the tools that we need to concisely express both those models and the observations that we need explanations for, and then generate the answers automatically.**

Of course I have just given some of the flavor of this problem space and I’m sure you can imagine a dozen ways to make the problem more interesting:

- what if instead of a coin flip that represent “dropped” or “delivered” on a packet, we had a more complex distribution on every edge of the network topology graph — say, representing average throughput?
- How does moving to a system with continuous probabilities change our analysis and the cost of producing that analysis?
- And so on.

We can use the more complex tools I developed in my series, like the Metropolis method, to solve these harder problems; however, I think I’ll leave it at that for now.

]]>