Admiral Picard (retired) apparently has the same 1982 science fiction book club edition of The Complete Robot handy on his desk as I have on mine:

though frankly, his copy seems to be in better shape than mine.

Anyone know what the book below it is?

UPDATE: My friend Brian R has identified a likely candidate for the second book. It appears to be the Easton Press edition of The Three Musketeers:

]]>

There has been some discussion on tech twitter lately on the subject of whether it is possible to be “successful” in the programming business without working long hours. I won’t dignify the posts which started this conversation off — firmly in the “not possible” camp — with a link; you can find them easily enough I suspect.

My first thought upon seeing this discussion was “*well that’s just dumb*“. The whole thing struck me as basically illogical for two reasons. First, because it was vague; “success” is relative to goals, and everyone has different goals. Second, because any universal statement like *“the only way to achieve success in programming is by working long hours”* can be refuted by a single counterexample and I am one! My career has been a success so far; I’ve worked on interesting technology, mentored students, made friends along the way, and been well compensated. But I have always worked long hours very rarely; only a handful of times in 23 years.

Someone said something dumb on the internet, the false universal statement was* directly refuted* by me in a *devastatingly logical* manner just now, and we can all move on, right?

Well, no.

My refutation — my personal, anecdotal refutation — answers in the affirmative the question *“Is it possible for any one computer programmer, anywhere in the world right now, to be successful without working long hours?”* but that is not an interesting or relevant question. *My first thought was also pretty dumb.*

Can we come up with some better questions? Let’s give it a shot. I’ll start with the personal and move to the general.

*We’ve seen that long hours were not a necessary precondition to my success. What were the sufficient preconditions?*

I was born into a middle-class, educated family in Canada. I had an excellent public education with teachers who were experts in their fields and genuinely cared about their students. I used family connections to get good high school jobs with strong mentors. Scholarships, internships, a supportive family and some talent for math allowed me to graduate from university with in-demand skills and no debt, with a *career* waiting for me, not just a *job.* I’ve been in good health my whole life. When I had problems I had access to professionals who helped me, and who were largely paid by insurance.

Did I *work* throughout all of that? Sure! Was it always *easy*? No! But **my privileged background enabled me to transform working reasonable hours at a desk into success.**

Now it is perhaps more clear why my “refutation” was so dumb, and that brings us to our next better question:

*If we subtract some of those privileges, does it become more and more likely that working long hours becomes a necessary precondition for success in our business?*

If you’re starting on a harder difficulty level — starting from poverty, without industry or academic connections, if you’re self-taught, if you’re facing the headwinds of discrimination, prejudice or harassment, if you have legal or medical or financial or family problems to solve on top of work problems* — there are not that many knobs you can turn* that increase your chance of success. It seems reasonable that “work more hours” is one of those knobs you can turn much more easily than “get more industry contacts”.

The original statement is maybe a little too strong, but what if we weaken it a bit? Maybe to something like “*working long hours is a good idea in this business because it greatly increases your chances of success, particularly if you’re facing a headwind.*” What if we charitably read the original statement more like that?

This is a statement that might be true or it might be false. We could do research to find out — and indeed, there is some research to suggest that there is **not** a clear causation between working more hours and being more successful. But the point here is that the weakened statement is at least *not immediately refutable. *

This then leads us from a question about how the world *is* to how it *ought* to be, but I’m going to come back to that one. Before that I want to dig in a bit more to the original statement, not from the point of view of *correctness*, or even *plausibility*, but from the point of view of* who benefits by making the statement*.

*Suppose we all take to heart the advice that we should be working longer to achieve success. Who benefits?*

I don’t know the people involved, and I don’t like to impute motives to people I don’t know. I encourage people to read charitably. **But I am having a hard time believing the apologia I outlined in the preceding section was intended**. The intended call to action here was not “*let’s all think about how structural issues in our economy and society incent workers from less privileged backgrounds to work longer hours for the same pay.*” *Should* we think about that? Yes. But that was not the *point*. The point being made was a *lot* simpler.

The undeniable subtext to *“you need to work crazy hours to succeed”* is *“anyone not achieving success has their laziness to blame; they should have worked harder, and you don’t want to be like them, do you?”*

**That is propaganda.** When you say the quiet part out loud, it sounds more like *“the income of the idle rich depends on capturing the value produced by the labours of everyone else, so make sure you are always producing value that they can capture. Maybe they will let you see some of that value, someday.” *

Why would anyone choose to produce value to be confiscated by billionaires?** Incentives matter and the powerful control the incentives.** Success is the *carrot*; poverty and/or crippling debt is the *stick*.

Those afforded less privilege get more and more of the stick. **If hard work and long hours could be consistently transformed into “success”, then my friends and family who are teachers, nurses, social workers and factory workers would be far more successful than I am. **They definitely work both longer and harder than I do, but they have far less ability to transform that work into success.

That to me is the real reason to push back on the notion that *long hours and hard work are a necessary precondition of success*: not because it is *false* but because **it is propaganda in service of weakening further the less privileged**. *“It is proper and desirable to weaken the already-weak in order to further strengthen the already-strong”* is as good a working definition of “evil” as you’re likely to find.

The original statement isn’t helpful advice. It isn’t a rueful commentary on disparity in the economy.** It’s a call to produce more profit now in return for little more than a vague promise of possible future success. **

*Should long hours be a precondition for success for anyone irrespective of their privileges?*

First off, I would like to see a world where everyone started with a stable home, food on the table, a high quality education, and so on, and I believe we should be working towards that end as a society, and as a profession.

We’re not there, and* I don’t know how to get there*. Worse, there are powerful forces that prefer increasing disparities rather than reducing them.

Software is in many ways unique. It’s the reification of algorithmic thought. It has effectively zero marginal costs. The industry is broad and affords contributions from people at many skill levels and often irrespective of location. The tools that we build amplify other’s abilities. And we build better tools for the world when the builders reflect the diversity of that world.

I would much rather see a world in which** anyone with the interest in this work could be as successful as I have been**, than this world where the majority have to sacrifice extra time and energy in the service of profits they don’t share in.

Achieving that will be hard, and like I said, I don’t know how to effect a structural change of this magnitude. But we can at least start by recognizing propaganda when we see it, and calling it out.

I hate to end the decade on my blog on such a down note, but 2020 is going to be hard for a lot of people, and we are all going to hear a *lot* of propaganda. Watch out for it, and don’t be fooled by it.

If you’re successful, that’s great; I am very much in favour of success. **See if you can use some of your success in 2020 to increase the chances for people who were not afforded all the privileges that turned your work into that success.**

Happy New Year all; I hope we all do good work that leads to success in the coming year. We’ll pick up with some more fabulous adventures in coding in 2020.

Thanks to my friend @editorlisaquinn for her invaluable assistance in helping me clarify my thoughts for this post.

]]>There is a reason why I did that topic before “Fixing Random”, but sadly I never got to the connection between differential calculus and sampling from an arbitrary distribution. I thought I might spend a couple of episodes sketching out how it is useful to use automatic differentiation when trying to sample from a distribution. I’m not going to write the code; I’ll just give some of the “flavour” of the algorithms.

Before I get into it, a quick refresher on the Metropolis algorithm. The algorithm is:

- We have a probability distribution function that takes a point and returns a double that is the “probability weight” of that point. The PDF need not be “normalized”; that is, the area under the curve need not add up to 1.0. We wish to generate a series of samples that conforms to this distribution.
- We choose a random “initial point” as the
*current*sample. - We randomly choose a
*candidate*sample from some distribution based solely on the current sample. - If the candidate is higher weight than current, it becomes the new current and we yield the value as the sample.
- If it is lower weight, then we take the ratio of candidate weight to current weight, which will be between 0.0 and 1.0. We flip an unfair coin with that probability of getting heads. Heads, we accept the candidate, tails we reject it and try again.
- Repeat; choose a new candidate based on the new current.

The Metropolis algorithm is straightforward and works, but it has a few problems.

**How do we choose the initial point?**

Since Metropolis is typically used to compute a posterior distribution after an observation, and we typically have the prior distribution in hand, we can use the prior distribution as our source of the initial point.

**What if the initial point is accidentally in a low-probability region?**We might produce a series of unlikely samples before we eventually get to a high-probability current point.

We can solve this by “burning” — discarding — some number of initial samples; we waste computation cycles so we would like the number of samples it takes to get to

“convergence” to the true distribution to be small. As we’ll see, there are ways we can use automatic differentiation to help solve this problem.

**What distribution should we use to choose the next candidate given the current sample?**

This is a tricky problem. The examples I gave in this series were basically “choose a new point by sampling from a normal distribution where the mean is the current point”, which seems reasonable, but then you realize that the question has been begged. A normal distribution has two parameters: the mean and the standard deviation. The standard deviation corresponds to “how big a step should we typically try?” If the deviation is too large then we will step from high-probability regions to low-probability regions frequently, which means that we discard a lot of candidates, which wastes time. If it is too small then we get “stuck” in a high-probability region and produce a lot of samples close to each other, which is also bad.

Basically, we have a “tuning parameter” in the standard deviation and it is not obvious how to choose it to get the right combination of good performance and uncorrelated samples.

These last two problems lead us to ask an important question: **is there information we can obtain from the weight function that helps us choose a consistently better candidate? **That would lower the time to convergence and might also result in fewer rejections when we’ve gotten to a high-probability region.

I’m going to sketch out one such technique in this episode, and another in the next.

As I noted above, Metropolis is frequently used to sample points from a high-dimensional distribution; to make it easier to understand, I’ll stick to one-dimensional cases here, but imagine that instead of a simple curve for our PDF, we have a complex multidimensional surface.

Let’s use as our motivating example the mixture model from many episodes ago:

Of course we can sample from this distribution directly if we know that it is the sum of two normal distributions, but let’s suppose that we don’t know that. We just have a function which produces this weight. Let me annotate this function to say where we want to go next if the current sample is in a particular region.

I said that we could use the derivative to help us, but it is very unclear from this diagram how the derivative helps:

- The derivative is small and positive in the region marked “go hard right” and in the immediate vicinity of the two peaks and one valley.
- The derivative is large and positive in the “slight right” region and to the left of the tall peak.
- The derivative is large and negative in the “slight left” region and on the right of the small peak.
- The derivative is small and negative in the “hard left” region and in the immediate vicinity of the peaks and valley.

No particular value for the derivative clearly identifies a region of interest. It seems like we cannot use the derivative to help us out here; what we really want is to move away from small-area regions and towards large-area regions.

Here’s the trick.

Ready?

I’m going to graph the *log* of the weight function below the weight function:

*Now look at the slope of the log-weight*. It is very positive in the “move hard right” region, and becomes more and more positive the farther left we go! Similarly in the “move hard left” region; the slope of the log-weight is very negative, and becomes more negative to the right.

In the “slight right” and “slight left” regions, the slope becomes more moderate, and when we are in the “stay around here” region, the slope of the log-weight is close to zero.* This is what we want.*

(ASIDE: Moreover, this is even more what we want because in realistic applications we often *already* have the log-weight function in hand, not the weight function. Log weights are convenient because you can represent arbitrarily small probabilities with “normal sized” numbers.)

We can then use this to modify our candidate proposal distribution as follows: rather than using a normal distribution centered on the *current* point to propose a candidate, **we compute the derivative of the log of the weight function using dual numbers**, and *we use the size and sign of the slope to tweak the center of the proposal distribution. *

That is, if our current point is far to the left, we see that the slope of the log-weight is very positive, so we move our proposal distribution some amount to the right, and then we are more likely to get a candidate value that is in a higher-probability region. But if our current point is in the middle, the slope of the log-weight is close to zero so we make only a small adjustment to the proposal distribution.

(And again, I want to emphasize: realistically we would be doing this in a high-dimensional space, not a one-dimensional space. We would compute the *gradient* — the direction in which the slope increases the most — and head that direction.)

If you work out the math, which I will not do here, the difference is as follows. Suppose our non-normalized weight function is *p*.

- In the plain-vanilla proposal algorithm we would use as our candidate distribution a normal centered on
*current*with standard deviation*s.* - In our modified version we would use as our candidate distribution a normal centered on
*current + (s / 2) * ∇log(p(current))*, and standard deviation*s*.

Even without the math to justify it, this should seem reasonable. The typical step in the vanilla algorithm is on the order of the standard deviation; we’re making an adjustment towards the higher-probability region of about half a step if the slope is moderate, and a small number of steps if the slope is severe; the areas where the slope is severe are the most unlikely areas so we need to get out of them quickly.

If we do this, we end up doing more math on each step (to compute the log if we do not have it already, and the gradient) but **we converge to the high-probability region much faster.**

If you’ve been following along closely you might have noticed two issues that I seem to be ignoring.

First, we have not eliminated the need for the user to choose the tuning parameter *s*. Indeed, this only addresses one of the problems I identified earlier.

Second, the Metropolis algorithm requires for its correctness that the proposal distribution *not ever be biased in one particular direction! *But the whole *point* of this improvement is to bias the proposal towards the high-probability regions. Have I pulled a fast one here?

I have, but we can fix it. I mentioned in the original series that I would be discussing the Metropolis algorithm, which is the oldest and simplest version of this algorithm. In practice we use a variation on it called Metropolis-Hastings which adds a correction factor to allow non-symmetric proposal distributions.

The mechanism I’ve sketched out today is called the Metropolis Adjusted Langevin Algorithm and it is quite interesting. It turns out that this technique of “walk in the direction of the gradient plus a random offset” is also how physicists model movements of particles in a viscous fluid where the particle is being jostled by random molecule-scale motions in the fluid. (That is, by Brownian motion.) It’s nice to see that there is a physical interpretation in what would otherwise be a very abstract algorithm to produce samples.

**Next time on FAIC:** The fact that we have a connection to a real-world physical process here is somewhat inspiring. In the next episode I’ll give a sketch of another technique that uses ideas from physics to improve the accuracy of a Metropolis process.

]]>

Welcome to this special bonus episode of *Fixing Random,* the immensely long blog series where I discuss ways to add probabilistic programming features into C#. I ran into an interesting problem at work that pertains to the techniques that we discussed in this series, so I thought I might discuss it a bit today.

Let’s suppose we have three forts, Fort Alpha, Fort Bravo and Fort Charlie at the base of a mountain. They are constantly passing messages back and forth by carrier pigeon. Alpha and Charlie are too far apart to fly a pigeon directly, so messages from Alpha to Charlie first go from Alpha to Bravo, and then on from Bravo to Charlie. (Similarly. messages from Charlie to Alpha go via Bravo, but we’ll not worry about that direction for the purposes of this discussion.)

Carrier pigeons are of course an unreliable mechanism for passing messages, so let’s model this as a Bernoulli process; every time we send a message from Alpha to Bravo, or Bravo to Charlie, we flip an unfair coin. Heads, the bird gets through, tails it gets lost along the way.

From this we can predict the reliability of passing a message from Alpha to Charlie via Bravo; the probability of failure is the probability that A-B fails or B-C fails (or both). Equivalently, the probability of success is the probability of A-B succeeding and B-C succeeding. This is just straightforward, basic probability; if the probability of success from A-B is, say, 95% and B-C is 96%, then the probability of success from A to C is their product, around 91%.

**Aside:** Note that I am assuming in this article that pigeons are passing from Bravo to Charlie even if a pigeon failed to arrive from Alpha; I’m *not* trying to model in this system constraints like “pigeons only fly from Bravo to Charlie when one arrived from Alpha”.

Now let’s add an extra bit of business.

We have an observer on the mountaintop at Fort Delta overlooking Alpha, Bravo and Charlie. Delta has some high-power binoculars and is recording carrier pigeon traffic from Alpha to Bravo and Bravo to Charlie. But here’s the thing: Delta is an unreliable observer, because observing carrier pigeons from a mountaintop is inherently error-prone; *sometimes Delta will fail to see a pigeon. *Let’s say that 98% of the time, Delta observes a pigeon that is there, and Delta never observes a pigeon that is not there.

Every so often, Delta issues a report: either “*the channel from Alpha to Charlie is healthy*” if Delta has observed a pigeon making it from Alpha to Bravo *and* also a pigeon making it from Bravo to Charlie. But if Delta has just failed to observe either a pigeon going from Alpha to Bravo, *or* a pigeon going from Bravo to Charlie, then Delta issues a report saying “*the channel from Alpha to Charlie is unhealthy*“.

The question now is: suppose Delta issues a report that the Alpha-Charlie channel is unhealthy. *What is the probability that a pigeon failed to get from Alpha to Bravo, and what is the probability that a pigeon failed to get from Bravo to Charlie?* Each is surely much higher than the 5-ish percent chance that is our prior.

We can use the gear we developed in the early part of my Fixing Random series to answer this question definitively, but before we do, **make a prediction. **If you recall episode 16, you’ll remember that you can have a 99% accurate test but the posterior probability of having the disease that the test diagnoses is only 50% when you test positive;* this is a variation on that scenario.*

Rather than defining multiple enumerated types as I did in earlier episodes, or even using bools, let’s just come up with a straightforward numeric encoding. We’ll say that **1** represents “a pigeon failed to make the journey”, and **0** means “a pigeon successfully made the journey” — if that seems backwards to you, I agree but it will make sense in a minute.

Similarly, we’ll say that **1** represents “Delta’s attempt to observe a pigeon has failed”, and **0** as success, and finally, that **1** represents Delta making the report “the channel is unhealthy” and **0** represents “the channel is healthy”.

The reason I’m using **1** in all these cases to mean “something failed” is because I want to use **OR** to get the final result. Let’s build our model:

var ab = Bernoulli.Distribution(95, 5); var bc = Bernoulli.Distribution(96, 4); var d = Bernoulli.Distribution(98, 2);

- 5% of the time
`ab`

reports**1**: the pigeon failed to get through. - 4% of the time,
`bc`

reports**1**: the pigeon failed to get through. - 2% of the time,
`d`

reports**1**: it fails to see a pigeon that is there.

Now we can ask and answer our question about the posterior: what do we know about the posterior distribution of pigeons making it from Alpha to Bravo and Bravo to Charlie? We’ll sample from `ab`

and `bc`

once to find out if a pigeon failed to get through, and then ask whether Delta failed to observe the pigeons.

What is the condition that causes Delta to report that the channel is unhealthy?

- the pigeon from Alpha to Bravo failed, OR
- the pigeon from Bravo to Charlie failed, OR
- Delta failed to observe Alpha’s pigeon, OR
- Delta failed to observe Bravo’s pigeon.

We observe that Delta reports that the channel is unhealthy, so we’ll add a **where** clause to condition the result, and then print out the resulting posterior joint probability:

var result = from pab in ab from pbc in bc from oab in d from obc in d let report = pab | pbc | oab | obc where report == 1 select (pab, pbc); Console.WriteLine(result.ShowWeights());

This is one of several possible variations on the “**noisy OR distribution**” — that is, the distribution that we get when we **OR** together a bunch of random Boolean quantities, but where the **OR** operation itself has some probabilistic “noise” attached to it.

**Aside:** That’s why I wanted to express this in terms of OR-ing together quantities; of course we can always turn this into a “noisy AND” by doing the appropriate arithmetic, but typically this distribution is called “noisy OR”.

We get these results:

(0, 0):11286 -- about 29% (0, 1):11875 -- about 30% (1, 0):15000 -- about 39% (1, 1):625 -- less than 2%

Remember that **(0, 0)** means that pigeons *did* make it from Alpha to Bravo and from Bravo to Charlie; in every other case we had at least one failure.

It should not be too surprising that **(1, 1)** — both the Alpha and Bravo pigeons failed simultaneously — is the rarest case because after all, that happens less than 9% of the cases overall, so it certainly should happen in some smaller percentage of the “unhealthy report” cases.

But the possibly surprising result is: when Delta reports a failure, there is a 29% chance that this is a false positive report of failure, and in fact it is almost as likely to be a false positive as it is to be a dropped packet, I mean *lost pigeon*, between Bravo and Charlie!

Put another way, if you get an “unhealthy” report, 3 times out of 10, the report is wrong and you’ll be chasing a wild goose. Just as we saw with false positives for tests for diseases, *if the test failure rate is close to the disease rate of the population, then false positives make up a huge percentage of all positives.*

My slip-up there of course illuminates what you figured out long ago; all of this whimsy about forts and pigeons and mountains is just a silly analogy. Of course what we really have is not three forts connected by two carrier pigeon routes, but ten thousand machines and hundreds of routers in a data center connected by network cabling in a known topology. Instead of pigeons we have trillions of packets. Instead of an observer in a fort on a mountaintop, we have special supervising software or hardware that is trying to detect failures in the network so that they can be diagnosed and fixed. Since the failure detection system is itself part of the network, it *also* is unreliable, which introduces “noise” into the system.

The real question at hand is: given prior probabilities on the reliability of each part of the system including the reporting system itself, *what are the most likely posterior probabilities that explain a report that some part of the network is unhealthy*?

This is a highly practical and interesting question to answer because it means that network engineers can quickly narrow down the list of possible faulty components given a failure report to the most likely culprits. **The power of probabilistic extensions in programming languages is that we now have the tools that we need to concisely express both those models and the observations that we need explanations for, and then generate the answers automatically.**

Of course I have just given some of the flavor of this problem space and I’m sure you can imagine a dozen ways to make the problem more interesting:

- what if instead of a coin flip that represent “dropped” or “delivered” on a packet, we had a more complex distribution on every edge of the network topology graph — say, representing average throughput?
- How does moving to a system with continuous probabilities change our analysis and the cost of producing that analysis?
- And so on.

We can use the more complex tools I developed in my series, like the Metropolis method, to solve these harder problems; however, I think I’ll leave it at that for now.

]]>Unlike our other local waterfowl that are willing to approach humans and dive from the water surface — cormorants, gulls, mallards, mergansers, loons and the like — kingfishers are skittish and dive at speed from the trees; they’re fast and hard to get in focus. My first several attempts ended up like this. (Click on photos for larger versions.)

and this

Not terrible for a first attempt, but I wanted to get a nice sharp closeup. I tried for two weeks and did not manage to get anything better, which was quite frustrating. So I decided that on my last day of vacation I would get up at 6 AM and take a kayak up the river just after sunrise. I figured if I was slow and careful I might be able to get closer.

Sure enough, I immediately saw a bird, but of course it saw me and took off upriver:

It stopped in a tree, again just far enough away that I could not get a good shot:

And then took off again when I got close:

This repeated several times, always going up river. I got a lot of burry photos of the back side of the damn bird.

Finally we got to a point where I could not go any further; there were several trees fallen entirely across the river, and the bird was perched on one of them *with a branch in the way and in a deep shadow*. (The overall brightness of this image is because I overexposed it to try to get the detail of a dark bird in a shadow.)

The sharp-eyed amongst you may have already noticed the larger problem here, but I did not. I very carefully and slowly paddled to where I could get an unobstructed view of the bird, and I finally got my close up…

…**OF A GREEN HERON. WHAT THE HECK.**

Now, don’t get me wrong; I am happy that there is also a family of green herons living at the river, but where did the switch happen? When I reviewed the several hundred shots I’d taken so far I discovered that in fact the bird I’d seen originally was this bird:

A green heron, and probably the same green heron.

I had actually been chasing at least two birds up the river. In fact I suspect I was actually chasing two green herons and a male kingfisher, but did not realize until much later that it was not all the same bird.

Since I could go no further I figured I would start over. I went back to the mouth of the river and there were a bunch of both males and females dive bombing each other; maybe for fun, maybe to settle some territorial dispute, I don’t know. I watched that for a while and managed to get a few slightly better images:

I think I can do better if I get the chance next year, but that will have to do for this year.

]]>

]]>

UPDATE: Mystery solved! See below.

On August 4th at about 20 minutes past 10 PM Eastern Daylight Time I did this 30 second exposure. I am facing south. The bright object in the middle is Jupiter; the orange star below and to its right is Antares. What we have here is the International Space Station flying (from my perspective that night) through the “head” of the constellation of Scorpius from right to left. (Click on the images for a larger view.)

I’ve tweaked the levels in post slightly, for clarity, but basically this is the image I was hoping to get.

I then quickly shifted the camera over to point towards Sagittarius and did an identical 30 second exposure. Again, I’ve tweaked the levels:

And again the bright object is Jupiter. The triangle of stars in the middle of the very bottom of the image is the “stinger” of Scorpius. The M7 cluster to its left is slightly blocked by the tree, and you can see the “lid of the teapot” of Sagittarius following the line of the tree, with the Milky Way emerging as the steam from the teapot.

The ISS is still traveling right to left — west to east — and you can clearly see that the path is much shorter than the previous 30 second exposure because the left end of its travel is where it passed into the shadow of the earth; sunset comes later for the ISS because of its great altitude.

That is again exactly what I expected. The part that I am completely flummoxed by is: **what are the two parallel tracks to the left of the ISS going north/south?!?**

- I took a third image after this one of the same part of the sky and there is no streak on it of any kind.
- It could be a
**camera malfunction**— but I have never seen such a malfunction.**UPDATE**: My friend Larry was taking a long exposure at the same time and also captured this exact same streak, so it is definitely not a camera malfunction. - It could be an
**atmospheric phenomenon**, like a jet contrail being lit up by something. But it does not look like any contrail or cloud I’ve ever seen, and it does not show up in the third image. - It could be an
**airplane**, but airplanes typically blink in long exposures, or can be seen to have both red and green lights. Also, if it were a single airplane then I would expect the parallel lines to start and end at the same place. And I would expect to see it in the third image. - It could be a
**pair of satellites in a polar orbit**, but I checked a satellite tracking app and it identified nothing in that neighbourhood except the ISS at that time. (However, I only checked the one app; probably I should check another.) And those “satellites” seem to be in very similar orbits, which seems unlikely.**UPDATE**: My friend Gord, son of the aforementioned Larry, suggests that it may have been satellites in the Starlink constellation, which travel in pairs. This is now my best hypothesis. I’ll see if I can get some data on Starlink orbits. - It could be a
**meteor that has split into two parts**that are traveling parallel, and just happened to be in my shot as the ISS entered the shadow of the Earth. Which seems like an extremely unlikely coincidence. Yes, there is a lot of meteor activity in early August, but I’m not buying it.

I have never seen anything like this before. We genuinely have an Unidentified Flying Object here, in that there is some object which is flying but not identified; I rather doubt it is aliens.

Does anyone with more experience than me in photographing satellites have any insight into what I’ve captured here?

Mystery solved by my friend Gord:

The recent Starlink launch put a constellation of 60 satellites into a low orbit, and they’re still all bunched up so it would be common to have two in frame at the same time. That orbit passed right over the Great Lakes region at 10:20 the night I took that exposure, and the direction corresponds as well. Thanks Gord!

]]>

I have some interesting news regarding my recently ended “Fixing Random” series, but before I get into that, I’ll spend a couple of episodes sharing some of my favourite shots from this year.

To start with, here’s a shot from last year; my eight-year-old friend Junior Naturalist Ada found a baby snapping turtle. (Click on images for larger versions.)

We looked all over for the mama snapping turtle but did not find her; I am pleased to report that this year we certainly did, just a couple bends up the river.

Isn’t she lovely? Let’s zoom in on that face.

You just want to snuggle her right up to the point where she bites your arm off, am I right?

**Coming soon on FAIC:** weird bugs, lovely birds, and some astronomical phenomena that I do not understand.

]]>

First, I want to **summarize**, second I want to describe **a whole lot of interesting stuff that I did not get to**, and third, I want to give **a selection of papers** and further reading that inspired me in this series.

If you come away with nothing else from this series, the key message is: probabilistic programming is important, it is too hard today, and we can do a lot better than we are doing. We need to build better tools that leverage the type system and support line-of-business programmers who need to do probabilistic work, the same way that we built better tools for programmers who needed to use sequences, or databases, or asynchrony.

- We started this series with me complaining about
`System.Random`

, hence the name of the series. Even though some of the implementation details have finally improved after only some decades of misleading developers, we are still dealing with random numbers like it was 1972. - The abstraction that we’re missing is to make “value drawn from a random distribution” a part of the type system, the same way that “function that returns a value”, or “sequence of values”, or “value that might be null” is part of the type system.
- The type we want is something like a sequence enumeration, but instead of a “move next” operation, we’ve got a “sample” operation.
- If we stick to simple distributions where the support is finite and small and the “weights” are integers, and restrict ourselves to pure functions, we can create new distributions from old ones using the same operations that we use to create new sequences from old ones: the LINQ operators
`Select`

,`SelectMany`

and`Where`

. - Moreover, we can compute distributions exactly at runtime, without doing rejection sampling or other clever techniques.
- And we can even use query comprehension syntax, which makes me happy.
- From this we can see that probability distributions are monads;
**P(A)**is just`IDistribution<A>`

- We also see that conditional probabilities
**P(B given A)**are just`Func<A, IDistribution<B>>`

— they are likelihood functions. - The
`SelectMany`

operation on the probability monad lets us combine a likelihood function with a prior probability. **The**`Where`

operation lets us compute a posterior given a prior, a likelihood and an observation.- This is an extremely useful result; even domain experts like doctors often do not have good intuitions about how an observation should change our opinions about the probability of an event having occurred. A positive result on a test of a rare disease may be only weak evidence if the test failure rate is close to the disease rate.
- Can we put these features in the language as well as the type system? Abusing LINQ is clever but maybe not the best from a usability perspective.
- We could in fact embed
`sample`

and`condition`

operators in the C# language, just as we embedded`await`

. We can then write an imperative workflow, and have the compiler generate a method that returns the correct probability distribution, just as`await`

allows us to write an asynchronous workflow and the compiler generates a method that returns a task! - Unfortunately, real-world probability problems are seldom discrete, small, and completely analyzable; they’re more often continuous and approximated.
- Implementing
`Sample`

efficiently on arbitrary distributions turns out to be a hard problem. - We can use our tools to generate Markov chains.
- We can use Markov chains to implement
`Sample`

using the Metropolis Algorithm - If we have a continuous prior (like a mint that produces coins of a certain quality with a certain probability) and a discrete likelihood (like the probability of a coin flip coming up heads), we can use this technique to compute a continuous posterior given an observation of one or more flips.
- This is a very useful technique in many real-world applications.
- Computing the expected value of a distribution given a function is a tricky problem in of itself, but there are a variety of techniques we can use.
- And if we can do that, we can solve some high-dimensional integral calculus problems!

That was rather a lot, and we still did not get to everything I wanted to.

By far the biggest thing that I did not get to, and that I may return to in another series of posts, is: **the connection between await as an operator and sampleas an operator is deeper than you might think.**

As I noted above, you can put `sample`

and `condition`

operations in a language and have the compiler build a method that when run, generates a simple, discrete distribution. But it turns out **we can actually do a pretty good job of dealing with sample operators on non-discrete distributions as well,** by having the compiler be smart about using some of the techniques that we discussed in this series for continuous distributions.

What you really need is the ability to pick a sample, run a little bit of a routine, remember the result, back up a bit and try a different sample, and so on; from this **we can build a distribution of program traces**, and from that we can build an approximation of the distribution of the output of a probabilistic method!

This kind of control flow is tricky; it’s sort of a generalization of coroutines where you’re allowed to re-run code that you’ve run before, but with *different* values for the variables to see what happens.

Obviously it is crucially important that the methods be pure! It’s also crucially important that you spend most of your time exploring high-likelihood control flows, because the number of unlikely control flows is nigh infinite. If this sounds a bit like training up an ML model and then using that model in production later, that’s because it is basically the same thing, but applied to programs.

I know what you’re thinking: *that sounds bonkers.* But — here is the thing that really got me motivated to write this series in the first place —** we actually did it.**

A while back my colleague Michael and I hacked together (ha ha) an implementation of multi-shot-continuations for *Hack*, our PHP variant and showed that we could in fact do probabilistic workflows where we have a distribution, and we sample from it some number of times, and trace out what happens in the program as we do so.

I then went on to work on other projects, but in the meanwhile **a team of people who understand statistics far, far better than I do actually built an industrial-strength probabilistic control flow language with a sample operator. **

You can read all about it at their paper Hack PPL: A Universal Probabilistic Programming Language.

Another point that I very much wanted to get to in this series but did not is: **we can do the same thing in C#, and in fact we can do it today**.

The C# team added the ability to return types other than `Task<T>`

from asynchronous workflows; and it turns out you only need to abuse that feature a small amount to convince C# to “go back a bit” in the workflow – back to the previous `await`

operation, which becomes a stand-in for `sample`

— and re-run portions of it with different sampled values. **The C# team could add probabilistic workflows to C# tomorrow.**

The C# team has historically done a great job of finding useful monads and embedding them into the control flow of the language; monadic probabilistic workflows with multi-shot continuations could be the next one.* How about it team? *

Finally, here’s a very incomplete list of papers and web sites that were inspiring to me in writing this series. I learned a lot, and there is plenty more to learn; I’ve just scratched the surface here.

- The idea that we can treat probability distributions like LINQ queries was given to me by my friend and director Erik Meijer; his fun and accessible paper Making Money Using Math hits many of the same points I did in this series, but I did it with a lot more code.
- The design of coroutines in Kotlin was a great inspiration to me; they’ve done a great job of making features that you would normally think of as being part of the language proper, like
`yield`

and`await`

, into library methods. The first thing I did in my multi-shot coroutine hack was verify that I could simulate those features. (I was also very pleased to discover that much of this work was implemented by my old colleague Nikov from the C# team!) - An introduction to probabilistic programming is a book-length work that uses a Lisp-like language with sample and observe primitives.
- Church is another Lisp-like language often used in academic studies of PPLs.
- The webppl.org web site has a Javascript-based implementation of a probabilistic programming language and a lot of great supporting information at dippl.org.

The next few papers are more technical.

- Build Your Own Probability Monads is a good overview for the Haskell programmers out there, as are Practical Probabilistic Programming With Monads and Stochastic Lambda Calculus and Monads of Probability Distributions.
- Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation is a good overview of how you can use MCMC techniques on program traces. A provably correct sampler for probabilistic programs goes into some of the correctness problems faced by PPL implementations, and Generating Efficient MCMC Kernels goes into some of the performance problems.
- Another fascinating area that I wanted to explore in this series is: we know it is bad enough to try to debug programs that have yields and awaits in them; how on earth do you debug programs that are actually running the same code paths maybe thousands of times when exploring the sample space of a workflow? Here’s an interesting paper on debugging probabilistic workflows.

And these are some good ones for the math:

- These MIT course notes on Bayesian updating of continuous priors was a good primer for my episodes on coin flipping. And the lecture notes MC Methods and Importance Sampling are what it says on the tin.
- Understanding the Metropolis-Hastings Algorithm gives a bunch of the underlying math.
- The enormous book Information Theory, Inference and Learning Algorithms was invaluable in getting me up to speed on math I had not done since my undergrad days.

There are many more that I am forgetting, and I’ll add to this list as I recall them.

All right, **that was super fun;** I am off on my annual vacation where I have no phone or internet, so I’m going to take a bit of a break from blogging; we’ll see you in a month or so for more fabulous adventures in coding!

]]>

Suppose we have a distribution of doubles, `p`

, and a function `f`

from double to double. We often want to answer the question *“what is the average value of f when it is given samples from p?”* This quantity is called the

The obvious (or “naive”) way to do it is: take a bunch of samples, evaluate the function on those samples, take the average. Easy! However, this can lead to problems if there are “black swans”: values that are rarely sampled, but massively affect the value of the average when run through `f`

. We would like to get a good estimate without having to massively increase the number of samples in our average.

We developed two techniques to estimate the expected value:

First, abandon sampling entirely and do numerical integral calculus:

- Use quadrature to compute two areas: the area under
`f(x)*p.Weight(x)`

and the area under`p.Weight(x)`

(which is the normalization constant of`p`

) - Their quotient is an extremely accurate estimate of the expected value
- But we have to know what region to do quadrature over.

Second, use importance sampling:

- Find a helper distribution
`q`

whose weight is large where`f(x)*p.Weight(x)`

bounds a lot of area. - Use the naive algorithm to estimate the expected value of
`x=>f(x)*p.Weight(x)/q.Weight(x)`

from samples of`q`

- That is
*proportional*to the expected value of`f`

with respect to`p`

- We gave a technique for estimating the proportionality constant by sampling from
`q`

also.

The problem with importance sampling then is finding a good `q`

. We discussed some techniques:

- If you know the range, just use a uniform distribution over that range.
- Stretch and shift
`p`

so that the transformed PDF doesn’t have a “black swan”, but the normalization constant is the same. - Use the Metropolis algorithm to generate a helper PDF from
`Abs(f*p)`

, though in my experiments this worked poorly - If we know the range of interest, we can use the VEGAS algorithm. It makes cheap, naive estimates of the area of subranges, and then uses that information to gradually refine a piecewise-uniform helper PDF that targets spikes and avoid flat areas of
`f*p`

. - However, the VEGAS algorithm is complex, and I did not attempt to implement it for this series.

The question you may have been asking yourself these past few episodes is:

**If quadrature is an accurate and cheap way to estimate the expected value of f over samples from p then why are we even considering doing sampling at all? Surely we typically know at least approximately the range over which f*p has some area. What’s the point of all this?**

Quadrature just splits up the range into some number — say, a thousand — equally-sized pieces, evaluates `f*p`

on each of them, and takes the average. That sure seems cheaper and easier than all this mucking around with sampling. Have I just been wasting your time these past few episodes? And why has there been so much research and effort put into finding techniques for estimating expected value?

This series is called “Fixing Random” because the built-in base class library tools we have in C# for representing probabilities are weak. I’ve approached everything in this series from the perspective of *“I want to have an object that represents probabilities in my business domain, and I want to use that object to solve my business problems”.*

*“What is the expected value of this function given this distribution?”* is a very natural question to ask when solving business problems that involve probabilities, and as we’ve seen, you can answer that question by simulating integral calculus through quadrature.

But, as I keep on saying: **things equal to the same are equal to each other.** Flip the script. Suppose our business domain involves *solving integral calculus problems*. And suppose there is an integral calculus problem that we *cannot* efficiently solve with quadrature. What do we do?

- We can solve expected value problems with integral calculus techniques such as quadrature.
- We can solve expected value problems with sampling techniques
- Things equal to the same are equal to each other.
- Therefore
**we can solve integral calculus problems with sampling techniques.**

That is why there has been so much research into computing expected values: the expected value is the *area* under the function `f(x)*p.Weight(x)`

so **if we can compute the expected value by sampling, then we can compute that area** and solve the integral calculus problem *without* doing quadrature!

I said above *“if quadrature is accurate and cheap”*, but there are* many* scenarios in which quadrature is not a cheap way to compute an area.

What’s an example? Well, let’s generalize. So far in this series I’ve assumed that `f`

is a `Func<double, double>`

. What if `f`

is a `Func<double, double, double>`

— a function from pairs of doubles to double. That is `f`

is not a line in two dimensions, it is a surface in three.

Let’s suppose we have `f`

being such a function, and we would like to solve a calculus problem: what is the volume under `f`

on the range (0,0) to (1, 1)?

We could do it by quadrature, but remember, in my example we split up the range 0-to-1 into a thousand points. If we do quadrature in two dimensions with the same granularity of 0.001, that’s a million points we have to evaluate and sum. If we only have computational resources to do a thousand points, then we have to have a granularity of around 0.03.

What if the function is zero at most of those points? We could then have a really crappy estimate of the total area because our granularity is so low.

We now reason as follows: take a two-dimensional probability distribution. Let’s say we have the *standard continuous uniform implementation* of `IWeightedDistribution<(double, double)>`

.

All the techniques I have explored in this series work equally well in two dimensions as one! So we can use those techniques. Let’s do so:

- What is the estimated value of
`f`

when applied to samples from this distribution? - It is equal to the volume under
`f(x,y)*p.Weight((x,y)).`

`But`

`p.Weight((x,y))`

is always 1.0 on the region we care about; it’s the standard continuous uniform distribution, after all.- Therefore
**the estimated expected value of**`f`

when evaluated on samples from`p`

is an estimate of the volume we care about.

How does that help?

It doesn’t.

If we’re taking a thousand points by quadrature or a thousand points by sampling from a uniform distribution over the same range, it doesn’t matter. We’re still computing a value at a thousand points and taking an average.

But now here’s the trick.

Suppose we can find a *helper* distribution `q`

that is large where `f(x,y)`

has a lot of volume and very small where it has little volume.

We can then use importance sampling to compute a more accurate estimate of the desired expected value, and therefore the desired volume, because most of the points we sample from `q`

are in high-volume regions. Our thousand points from `q`

will give us a better estimate!

Now, up the dimensionality further. Suppose we’ve got a function that takes *three* doubles and goes to double, and we wish to know its hypervolume over (0, 0, 0) to (1, 1, 1).

With quadrature, we’re either doing a billion computations at a granularity of 0.001, or, if we can only afford to do a thousand evaluations, that’s a granularity of 0.1.

**Every time we add a dimension, either the cost of our quadrature goes up by a factor of a thousand, or the cost stays the same but the granularity is enormously coarsened.**

Oh, but it gets worse.

When you are evaluating the hypervolume of a 3-d surface embedded in 4 dimensions, there are a *lot* more points where the function can be zero! There is just so much *room* in high dimensions for stuff to be. **The higher the dimensionality gets, the more important it is that you find the spikes and avoid the flats. **

**Exercise:** Consider an n-dimensional cube of side 1. That thing always has a hypervolume of 1, no matter what n is.

Now consider a concentric n-dimensional cube inside it where the sides are 0.9 long.

- For a 1-dimensional cube — a line — the inner line is 90% of the length of the outer line, so we’ll say that 10% of the length of the outer line is “close to the surface”.
- For a 2-dimensional cube — a square — the inner square has 81% of the area of the outer square, so 19% of the area of the outer square is “close to the surface”.

At what dimensionality is more than 50% of the hypervolume of the outer hypercube “close to the surface”?

**Exercise:** Now consider an n-dimensional cube of side 1 again, and the concentric n-dimensional sphere. That is, a circle that exactly fits inside a square, a sphere that exactly fits inside a cube, and so on. The radius is 1/2.

- The area of the circle is pi/4 = 79% of the area of the square.
- The volume of the sphere is pi/6 = 52% of the volume of the cube.
- … and so on

At what value for n does the volume of the hypersphere become 1% of the volume of the hypercube?

In high dimensions, *any* shape that is *anywhere* on the *interior* of a hypercube is *tiny* when compared to the massive hypervolume near the cube’s *surface*!

That means: if you’re trying to determine the hypervolume bounded by a function that has large values somewhere *inside* a hypercube, the samples *must* frequently hit that important region where the values are big. If you spend time “near the edges” where the values are small, you’ll spend >90% of your time sampling irrelevant values.

That’s why importance sampling is so useful, and why we spend so much effort studying how to find distributions that compute expected values.** Importance sampling allows us to numerically solve multidimensional integral calculus problems with reasonable compute resources.**

**Aside**: Now you know why I said earlier that I misled you when I said that the VEGAS algorithm was designed to find helpful distributions for importance sampling. The VEGAS algorithm absolutely does that, but that’s not what it was *designed* to do; *it was designed to solve multidimensional integral calculus problems.* Finding good helper distributions is how it does its job.

**Exercise:** Perhaps you can see how we would extend the algorithms we’ve implemented on distributions of doubles to distributions of tuples of doubles; I’m not going to do that in this series; give it a shot and see how it goes!

**Next time on FAIC:** This has been one of the longest blog series I’ve done, and looking back over the last sixteen years, I have never actually *completed* any of the really big projects I started: building a script engine, building a Zork implementation, explaining Milner’s paper, and so on. I’m going to complete this one!

There is so much more to say on this topic; people spend their careers studying this stuff. But I’m going to wrap it up in the next couple of episodes by giving some final thoughts, a summary of the work we’ve done, a list of some of the topics I did not cover that I’d hoped to, and a partial bibliography of the papers and other resources that I read when doing this series.

]]>