Bean Machine Retrospective, part 4

Did I actually build a compiler? Yes and no.

Traditionally we think of a compiler as a program which takes as its input the text of a program written in one language (C#, say), and produces as its output an “equivalent” program written in another language (MSIL, say).  A more modern conception is: a compiler is a collection of code analysis services, and one of those services is translation.

I made a great many assumptions when embarking on this project. Some of the more important ones were as follows:

Assumption: we must implement everything necessary for a source-to-source translation, but that is not enough. The primary use case of Beanstalk is outputting samples from a posterior, not outputting an equivalent program text. Translation may be a secondary use case, but we will not succeed by stopping there.

We’re building a compiler in order to build a better calculator.

Static and dynamic analysis

Traditional translating compilers such as the C# compiler perform a “static” analysis of the code; that is, they analyze only (1) the text of the program and (2) metadata associated with already-compiled artifacts such as libraries. The code is not executed in a static analysis.

“Dynamic” analysis is by contrast an analysis of code while the code is running. 

Assumption: correct static analysis of Python is impossible for at least two reasons. 

1) Python is an extremely dynamic language; it is difficult to definitively know what any program identifier refers to, for example, without actually running the program to find out.

2) The model may depend upon non-stochastic model elements that are unknowable by static analysis. Model parameters or observed values can (and will!) be read from files or queried from databases, and those values must be present in the graph that we deduce.

We must build a dynamic analyzer; we’re going to run a modified version of the model code to find out what it does.

Pure functions, acyclic graph

For our purposes we define a pure function as one that:

  • Always terminates normally; it does not go into an infinite loop or throw an exception
  • Is idempotent: it always produces the same output when given the same input
  • Its action does not depend on mutating global state
  • Its action does not mutate global state

Examples of impure functions are easily produced:

mean = 1.0
@random_variable
def normal():
  global mean
  mean = mean + 1.0 # UH OH
  return Normal(mean, 1.0)

Or, just as bad:

mean = tensor([1.0])
@random_variable
def normal():
  global mean
  mean[0] += 1.0
  return Normal(mean, 1.0)

Assumptions:

  • Model builders will write pure functions. In particular, model builders will not mutate a tensor “in place” such that the mutation causes a change of behaviour elsewhere in the model
  • Beanstalk is not required to detect impure functions, though it may
  • Beanstalk may memoize model functions

Also: a Bayesian network never has cycles; the distribution of each node depends solely on the distributions of its ancestor nodes.

Assumptions:

  • Model builders will write models where no random variable depends upon itself, directly or indirectly.
  • Beanstalk is not required to detect cycles (though in practice, it does)

Note that a model may contain recursion if the recursion terminates. This is legal:

@random_variable
def normal(n):
  if n <= 0:
    mean = 0.0
  else:
    mean = normal(n-1)
  return Normal(mean, 1.0)
queries = [ normal(0) ]
observations = { normal(2): tensor(0.0) }

Next time on FAIC: Having made these and other assumptions, I embarked upon an implementation. But before I can explain the implementation, I should probably answer some nagging questions you might have about what exactly that random_variable decorator does.

Bean Machine Retrospective, part 3

Introducing Beanstalk

Last time I introduced Bean Machine Graph, a second implementation of the PPL team’s Bayesian inference algorithm. We can compare and contrast the two implementations:

  • BMG emphasizes mechanisms; BMG programs are all about constructing a graph. BM emphasizes business logic
  • A point which I did not emphasize yesterday but which will come up again in this series: BMG requires many node constructions be annotated with semantic types like “probability” or “positive real”. BM lacks all “type ceremony”.
  • BMG programs represent operations such as addition as verbose calls to node factories. BM concisely represents sums as “x + y ”, logs as “x.log()”, and so on

In short, the BMG user experience is comparatively not a great experience for data scientists, in the same way that writing in machine code is not great for line-of-business programmers. We wished to automatically convert models written in Bean Machine into equivalent programs which constructed a graph, and then got the improved inference performance of BMG.

OK… how?

A program which translates one language to another is called a compiler, and that’s where I came in. We code named the compiler “Beanstalk”. I started working on Beanstalk in October of 2019 given this brief:

Construct a compiler which takes as its input a Bean Machine model, queries and observations, and deduces from it the equivalent BMG graph (if there is one).

If we can solve that problem effectively then we get the best of both worlds. Data scientist users can write models using a pleasant Python syntax easily integrated into their existing PyTorch workflows, but get the inference performance afforded by BMG. The “concept count” — the number of APIs that the data scientist must understand — increases only slightly, but their workflows get faster. That is a win all around.

Does it work today?

Yes. Barely. If for some strange reason you want to play around with it, you can obtain the code from github. The error messages are terrible, the compiler is “over fit” to specific models that we needed to get compiled, and it is quite slow, but for simple models it does work. At the time the team was shut down we were actively extending both BMG and the compiler to handle more general models efficiently.

Early on we realized that of course the compiler is a means to an end, not an end in itself. What users really wanted was a more efficient inference engine that just happened to use BMG as its back end, so that’s the main API. You can call BMGInference().infer(), passing in queries, observations, and so on, just like any other Bean Machine inference engine; the compiler will analyze the source code of the queries, produce a graph, and call BMG’s inference engine to produce samples from the queries posteriors.

It has a few other useful APIs as well that are more like a traditional source-in-source-out compiler.

  • BMGInference().to_graph() produces a graph object without starting inference.
  • BMGInference().to_python() and to_cpp() produce Python and C++ source code fragments which construct the graph.
  • BMGInference().to_dot() produces a program in the DOT graph visualization language; that’s what I used to make the graph diagrams in this series.

Next time on FAIC: What were my initial assumptions when tasked with writing a “compiler” that extracts a graph from the source code of a model? And is it really a compiler, or is it something else?

Bean Machine Retrospective, part 2

Introducing Bean Machine Graph

Bean Machine has many nice properties:

  • It is integrated with Python, a language often used by data scientists
  • It describes models using the rich, flexible pytorch library
  • Inference works well even on models where data is stored in large tensors

I’m not going to go into details of how Bean Machine proper implements inference, at least not at this time. Suffice to say that the implementation of the inference algorithms is also in Python using PyTorch; for a Python program it is pretty fast, but it is still a Python program.

We realized early on that we could get order-of-magnitude better inference performance than Bean Machine’s Python implementation if we could restrict values in models to (mostly) single-value tensors and a limited set of distributions and operators.

In order to more rapidly run inference on this set of models, former team member Nim Arora developed a prototype of Bean Machine Graph (BMG).

BMG is a graph-building API (written in C++ with Python bindings) that allows the user to specify model elements as nodes in a graph, and relationships as directed edges. Recall that our “hello world” example from last time was:

@random_variable
def fairness():
  return Beta(2,2)

@random_variable
def flip(n):
  return Bernoulli(fairness())

That model written in BMG’s Python bindings would look like this: (I’ve omitted the queries and observations steps for now, and we’ll only generate one sample coin flip instead of ten as in the previous example, to make the graph easier to read.)

g = Graph()
two = g.add_constant_pos_real(2.0)
beta = g.add_distribution(
  DistributionType.BETA,
  AtomicType.PROBABILITY,
  [two, two])
betasamp = g.add_operator(OperatorType.SAMPLE, [beta])
bern = g.add_distribution(
  DistributionType.BERNOULLI,
  AtomicType.BOOLEAN,
  [betasamp])
flip0 = g.add_operator(OperatorType.SAMPLE, [bern])

That’s pretty hard to read. Here’s a visualization of the graph that this code generates:

These graphs are properly called Bayesian network diagrams, but for this series I’m just going to call them “graphs”.


I should say a little about the conventions we use in this graphical representation. Compiler developers like me are used to decomposing programs into abstract syntax trees. An AST is, as the name suggests, a tree. ASTs are typically drawn with the “root” at the top of the page, arrows point down, “parent nodes” are above “child nodes”, and operators are parents of their operands. The AST for something like x = a + b * c would be

where X, A, B, C are identifier nodes.

Bayesian network diagrams are just different enough to be confusing to the compiler developer. First of all, they are directed acyclic graphs, not trees. Second, the convention is that operators are children of their operands, not parents.

The best way I’ve found to think about it is that graphs show data flow from top to bottom. The parameter 2.0 flows into an operator which produces a beta distribution — twice. That distribution flows into a sample operator which then produces a sample from its parent distribution. That sampled value flows into an operator which produces a Bernoulli distribution, and finally we get a sample from that distribution.

If we wanted multiple flips of the same coin, as in the original Python example, we would produce multiple sample nodes out of the Bernoulli distribution.


BMG also has the ability to mark sample nodes as “observed” and to mark operator nodes as “queried”; it implements multiple inference algorithms which, just like Bean Machine proper, produce plausible samples from the posterior distributions of the queried nodes given the values of the observed nodes. For the subset of models that can be represented in BMG, the performance of the inference algorithms can be some orders of magnitude faster than the Bean Machine Python implementation.

Summing up: Our team had two independent implementations of inference algorithms; Bean Machine proper takes as its input some decorated Python methods which concisely and elegantly represents models in a highly general way using PyTorch, but the inference is relatively slow. Bean Machine Graph requires the user to write ugly, verbose graph construction code and greatly restricts both the data types and the set of supported operators, but uses those restrictions to achieve large inference speed improvements.


Next time on FAIC: Given the above description, surely you’ve guessed by now what the compiler guy has been doing for the last three years on this team full of data scientists! Can we automatically translate a Bean Machine Python model into a BMG graph to get BMG inference performance without sacrificing representational power?

Bean Machine Retrospective, part 1

As I mentioned in the previous episode, the entire Bean Machine team was dissolved; some team members were simply fired, others were absorbed into other teams, and some left the company. In this series I’m going to talk a bit about Bean Machine and my work on what is surely the strangest compiler I’ve ever written.

I should probably recap here my introduction to Bean Machine from what now seems like an eternity ago but was in fact only September of 2020.

We also have some tutorials and examples at beanmachine.org, and the source code is at github.com/facebookresearch/beanmachine.


We typically think of a programming language as a tool for implementing applications and components: games, compilers, utilities, spreadsheets, web servers, libraries, whatever. Bean Machine is not that; it is a calculator that solves a particular class of math problems; the problems are expressed as programs.

The purpose of Bean Machine is to allow data scientists to write declarative code inside Python scripts which represents relationships between parts of a statistical model, thereby defining a prior distribution. The scientist can then input real-world observations of some of the random variables, and queries on the posterior distributions. That is, we wish to give a principled, mathematically sound answer to the question: how should we update our beliefs when given real-world observations?

Bean Machine is implemented as some function decorators which modify the behavior of Python programs and some inference engines which do the math. However, the modifications to Python function call semantics caused by the decorators are severe enough that it is reasonable to conceptualize Bean Machine as a domain specific language embedded in Python.


The “hello world” of Bean Machine is: we have a mint which produces a single coin; our prior assumption is that the fairness of the coin is distributed somehow; let’s suppose we have reason to believe that it is drawn from beta(2,2).

@random_variable
def fairness():
  return Beta(2,2)

We then flip that coin n times; each time we call flip with a different argument represents a different coin flip:

@random_variable
def flip(n):
  return Bernoulli(fairness())

We then choose an inference algorithm — say, single-site Metropolis — say what we observed some coin flips to be, and ask for samples from the posterior distribution of the fairness of the coin. After all, we have much more information about the fairness of the coin after observing some coin flips than we did before.

heads = tensor(1)
tails = tensor(0)
samples = bm.SingleSiteAncestralMetropolisHastings().infer(
    queries=[fairness()],
    # Say these are nine heads out of ten, for example.
    observations={ flip(0) : heads, [...] flip(9): tails },
    num_samples=10000,
    num_chains=1,
)

If we then did a histogram of the prior and the posterior of fairness given these observations, we’d discover that as the number of samples increased, the histograms would conform more and more closely to these graphs:

Prior: Beta(2,2)

Posterior if we got nine heads and one tail in the observations:

When we observe nine heads out of ten, we should update our beliefs about the fairness of the coin by quite a large amount.

I want to emphasize that what this analysis gives you is not just a point estimate — the peak of the distribution — but a real sense of how tight that estimate is. If we had to make a single guess as to the fairness of the coin prior to observations, our best guess would be 0.5. In the posterior our best guess would be around 0.83. But we get so much more information out of the distribution! We know from the graphs that the prior is extremely “loose”; sure, 0.5 is our best guess, but 0.3 would be entirely reasonable. The posterior is much tighter. And as we observed more and more coin flips, that posterior would get even tighter around the true value of the fairness.

Notice also that the point estimate of the posterior is not 0.9 even though we saw nine heads out of ten! Our prior is that the coin is slightly more likely to be 0.8 fair than 0.9 fair, and that information is represented in the posterior distribution.


All right, that’s enough recap. Next time on FAIC: I’m not going to go through all the tutorials on the web site showing how to use Bean Machine to build more complex models; see the web site for those details. Rather, I’m going to spend the rest of this series talking about my work as the “compiler guy” on a team full of data scientists who understand the math much better than I do.

Backyard birds of Seattle

Since I’m staying home all day due to the ongoing pandemic emergency, I’ve decided to document all the different species of birds that arrive in my yard. I am not a great bird photographer but I am enjoying practicing every day.

This will be my last post of 2020 and frankly this year cannot be over soon enough; I hope you are all safe and well. We will pick up in 2021 with more fabulous adventures in coding!

As always, click on any image for a larger version.


Anna’s hummingbird — the only hummingbird that stays in the Pacific Northwest all year round. The male has an iridescent magenta head depending on what angle you look at it; the female has just a few iridescent spots.


Bald eagle — this juvenile showed up in my yard for just a few seconds on election day; fortunately I had my camera handy. Bald eagles do not get their characteristic white head until they are four years old.


Bewick’s wren — I’ve only seen this bird once at my feeder this year; they are easily identified by the prominent white eyebrow stripe.


Black-capped chickadee — messy eaters. We also get chestnut-backed chickadees in the area but I have not seen one in my yard yet.


Bushtit — they travel in flocks of a dozen or more and mob suet feeders for a few minutes before flying off. Super cute, and they fly like they’re constantly about to fall out of the sky.


California scrub jay — tends to fly in, get in a fight with a bunch of much larger Steller’s jays, and leave.

Crow — looks thoroughly metal on a windy day.


Downy woodpecker — easily confused with the hairy woodpecker, which I have not yet seen in my yard. The male has a red cap. The smallest North American woodpecker.


Eastern grey squirrel — HEY YOU’RE NOT A BIRD; GET OUT OF THE BIRD FEEDER


European starling — super invasive, super aggressive, but very pretty little dinosaurs.


House finch — the males are somewhat red, the females are tricky to tell apart from other finches.


Northern flicker — the most common woodpecker in the Pacific Northwest; we typically see the “red-shafted” variety which is in fact orange-shafted. This is a female; the male has a red spot on the face.


Oregon junco — this is the Pacific Northwest coloring of the dark-eyed junco. One of the most common feeder birds.


Pine siskin — these little finches look a lot like house finches but they have a yellow flash on their wings. They tend to arrive in groups.


Raven — tis the wind and nothing more. A rare sight in my backyard.


Robin — lives in constant disdain. Not to be confused with the spotted towhee, who thinks you are awesome.


Spotted towhee — looks a bit like a robin, but thinks you are great and that you should give yourself more credit for dealing with a difficult situation this year.


Steller’s jay — the classic Pacific Northwest blue jay. Noisy and territorial. But lovely plumage.


And that’s all the birds in my backyard in the last few months that I managed to get a picture of.

Have a safe and festive holiday season, but not too festive; we want you and your relatives around for more fabulous adventures in 2021!

The VSTO startup sequence

Earlier this week I was looking for an old photo, and while browsing I came across a photo I took of my whiteboard in my office at Microsoft in 2004. Or rather, it was two photos; I’ve crudely stitched them together. Click on the image for a larger version.

OMG. What. The. Heck. Is. All. That. Nonsense?

Let me start with a little history.

Before I was on the C# team and after I was on the scripting languages team, I spent a couple years at Microsoft working on the plumbing for Visual Studio Tools for Office.

The idea of VSTO was to bring the ability to write truly rich, client-server aware, data-driven applications in Office documents using C# or Visual Basic; we wanted to go beyond the scripts and productivity tools typical of VBA customizations.

This was a project with many, many difficulties, both technical and political. On the political side of things, I think it is fair to say that the Office team has historically had (with good reason!) a great deal of resistance to taking any compatibility burden that would possibly slow down their ability to innovate in future releases; “platformizing” Word and Excel into true application hosts by a team external to Office was just such a burden.

The technical difficulties were considerable, in large part due to concerns about the security model. We were deeply, painfully aware of how Office customizations and scripting languages had been misused in the past as vectors for malware, and we did not want to create new vulnerabilities. As mitigation, we designed a mechanism that would isolate any customization code to its own appdomain with a highly restrictive default security policy.

Office, however, was not at the time designed to host the CLR. They were only willing to give us a single callback to our loader code that kicked off the whole process when a customized spreadsheet or document was loaded.

By 2004 we were on the fourth revision to our loading algorithm and I was tasked with coming up with the fifth; to facilitate discussion of options I drew a diagram on my whiteboards which I helpfully titled “HIGHLY SIMPLIFIED STARTUP SEQUENCE v4”.

A few things strike me about this diagram now, over 16 years later.


First: though it looks like a mess, I did actually put some thought into the design elements.

  • The diagram is divided into three sections, separated by heavy blue vertical lines. On the left are components running entirely outside of the CLR; in the middle are components that run in the CLR’s default appdomain, and on the right are components that run in the customization’s restricted appdomain. (And of course on the extreme far left is the edge of my “THE MATRIX” poster. A lot of the code names of the different parts of the project were references to The Matrix, including the team cover band that I played keyboards for. I am sad that I can no longer wear my “The Red Pills” polo shirt in public due to the co-opting of that movie reference by misogynist jerks.)
  • The purple boxes that run along the top are components and the lollipops give the interfaces they implement.
  • The purple boxes and arrows below give the exact sequence of twenty different method calls showing what component is calling what other component with what data, and why. In particular the diagram allows us to easily see when a component running in a more restricted security environment is calling into a less restricted environment; those calls need to be allowed because we need them to happen, but that then means that maybe hostile user code could call them, which could be bad.
  • Design problems, questions, annotations and proposed changes are shown in blue.
  • Red is used to identify one key component and an important question about it.
  • I have no idea what that yellow code fragment is or why it was written over top of everything else. It looks irrelevant.

The purpose of the diagram was originally to clarify in my own mind what the sequence was and what the problems were, but by the time it was in this form it was also for giving context to my coworkers when we were discussing options, so it had to be readable. I probably redrew this diagram a half a dozen times before it got to this state.


Second: we can see that there were a good half dozen or more design problems that I was trying to solve here but the big problems involved dirty documents and application manifests.

When you close Word or Excel, you have I am sure noticed that sometimes you get a “save your work?” dialog and sometimes the app just closes. The app is keeping track of whether the document is dirty — changed since it was last loaded or saved — or clean.

Suppose we load a customized spreadsheet, and initializing the customization causes the installer to notice that there is a newer version that it should be downloading. That might change the manifest information about the customization, so the spreadsheet is now “dirty”. But we do not want to ever unnecessarily dirty a document, because that is confusing and irritating to the user.

In step nine the fake application activator obtains an IAppInfo reference from the appdomain manager, updates the manifest from the customization’s server, and parses the manifest. My comments say:

  • Do not write back at this point; need to maintain dirty state
  • No, don’t do this at all. Host must provide updated manifest. This is not a VSTA feature, it is VSTO. (Meaning here that something here is unique to Visual Studio Tools for Office, and not the generalization of it we were working on, VST for Applications.)
  • Must do both. Don’t write back. AIState object must ensure dirtyness.

Apparently I was deeply conflicted on this point. I don’t recall how it was resolved.

My favourite comment though is the one in red:

Can we take manifest out of doc? Peter: “It would be awesome. If assembly is available offline, so is manifest”.

The scenario here had something to do with both the dirty bit problem, and more generally dealing with locally cached customizations. We did a bunch of work on the security model for “what happens if you’re trying to run a customization while your laptop is in airplane mode and we can’t get to the server to check for updates”. Peter is of course legendary Microsoft PM Peter Torr with whom I worked for many years.

My second favourite was where I said “OFFICE12?” Yeah, what’s going to happen when Office revs? Can we guarantee that all this stuff keeps working?


Third: It’s funny how the mind works. Though I’ve described the organization of the diagram and the major problems, today I remember almost none of what is going on here, what the specific issues were, or how we resolved them. But that whole sequence was intensely important to me for several months of my life; it was the foundational plumbing to the entire endeavor and so I knew it literally forwards and backwards. Those memories are 90% gone now. And yet if someone were to yell the numbers “nine six seven eleven eleven” at me from across the street I would be unable to not think “call Pizza Pizza, right away“. Thanks, 1980s jingle writers.


Fourth: I often think about this sort of thing in the context of those “tell me about a time you solved a design problem” interview questions. This “highly simplified” startup sequence with its twenty method calls has to balance:

  • security
  • performance
  • privacy
  • debuggability
  • code maintainability
  • versioning
  • robustness
  • graceful failure
  • user irritation

and numerous other design criteria. But can you imagine trying to explain any detail of this diagram to someone with no prior knowledge in a 45 minute interview? Real-world design problems are hard precisely because there are so many conflicting goals and messy politics. And worse, too often this is the institutional knowledge that is never written down and then lost.


Coming up on FAIC: Not sure!

  • I want to embark upon a more detailed dive into Bean Machine
  • We have just open-sourced a tool we use for benchmarking PPLs internally; I’d like to talk about that a bit
  • I’ve developed a little AST rewriting library in Python that is kinda fun; I could delve into the ideas behind that.

Let me know in the comments what you think.

Introducing Bean Machine

The final part of my Life series is still in the works but I need to interrupt that series with some exciting news. The new programming language I have been working on for the last year or so has just been announced by the publication of our paper Bean Machine: A Declarative Probabilistic Programming Language For Efficient Programmable Inference


Before I get into the details, a few notes on attributing credit where it is due and the like:

  • Though my name appears on the paper as a courtesy, I did not write this paper. Thanks and congratulations in particular to Naz Tehrani and Nim Arora who did a huge amount of work getting this paper together.
  • The actual piece of the language infrastructure that I work on every day is a research project involving extraction, type analysis and optimization of the Bayesian network underlying a Bean Machine program. We have not yet announced the details of that project, but I hope to be able to discuss it here soon.
  • Right now we’ve only got the paper; more information about the language and how to take it out for a spin yourself will come later. It will ship when its ready, and that’s all the scheduling information I’ve got.
  • The name of the language comes from a physical device for visualizing probability distributions because that’s what it does.


I will likely do a whole series on Bean Machine later on this autumn, but for today let me just give you the brief overview should you not want to go through the paper. As the paper’s title says, Bean Machine is a Probabilistic Programming Language (PPL).

For a detailed introduction to PPLs you should read my “Fixing Random” series, where I show how we could greatly improve support for analysis of randomness in .NET by both adding types to the base class library and by adding language features to a language like C#.

If you don’t want to read that 40+ post introduction, here’s the TLDR.

We are all used to two basic kinds of programming: produce an effect and compute a result. The important thing to understand is that Bean Machine is firmly in the “compute a result” camp. In our PPL the goal of the programmer is to declaratively describe a model of how the world works, then input some observations of the real world in the context of the model, and have the program produce posterior distributions of what the real world is probably like, given those observations. It is a language for writing statistical model simulations.

A “hello world” example will probably help. Let’s revisit a scenario I first discussed in part 30 of Fixing Random: flipping a coin that comes from an unfair mint. That is, when you flip a coin from this mint, you do not necessarily have a 50-50 chance of getting heads vs tails. However, we do know that when we mint a coin, the distribution of fairness looks like this:

Fairness is along the x axis; 0.0 means “always tails”, 1.0 means “always heads”. The probability of getting a coin of a particular fairness is proportional to the area under the graph. In the graph above I highlighted the area between 0.6 and 0.8; the blue area is about 25% of the total area under the curve, so we have a 25% chance that a coin will be between 0.6 and 0.8 fair.

Similarly, the area between 0.4 and 0.6 is about 30% of the total area, so we have a 30% chance of getting a coin whose fairness is between 0.4 and 0.6. You see how this goes I’m sure.

Suppose we mint a coin; we do not know its true fairness, just the distribution of fairness above. We flip the coin 100 times, and we get 72 heads, 28 tails. What is the most probable fairness of the coin?

Well, obviously the most probable fairness of a coin that comes up heads 72 times out of 100 is 0.72, right?

Well, no, not necessarily right. Why? Because the prior probability that we got a coin that is between 0.0 and 0.6 is rather a lot higher than the prior probability that we got a coin between 0.6 and 1.0. It is possible by sheer luck to get 72 heads out of 100 with a coin between 0.0 and 0.6 fairness, and those coins are more likely overall.


Aside: If that is not clear, try thinking about an easier problem that I discussed in my earlier series. You have 999 fair coins and one double-headed coin. You pick a coin at random, flip it ten times and get ten heads in a row. What is the most likely fairness, 0.5 or 1.0? Put another way: what is the probability that you got the double-headed coin? Obviously it is not 0.1%, the prior, but nor is it 100%; you could have gotten ten heads in a row just by luck with a fair coin. What is the true posterior probability of having chosen the double-headed coin given these observations?


What we have to do here is balance between two competing facts. First, the fact that we’ve observed some coin flips that are most consistent with 0.72 fairness, and second, the fact that the coin could easily have a smaller (or larger!) fairness and we just got 72 heads by luck. The math to do that balancing act to work out the true distribution of possible fairness is by no means obvious.

What we want to do is use a PPL like Bean Machine to answer this question for us, so let’s build a model!

The code will probably look very familiar, and that’s because Bean Machine is a declarative language based on Python; all Bean Machine programs are also legal Python programs. We begin by saying what our “random variables” are.


Aside: Statisticians use “variable” in a way very different than computer programmers, so do not be fooled here by your intuition. By “random variable” we mean that we have a distribution of possible random values; a representation of any single one of those values drawn from a distribution is a “random variable”. 


To represent random variables we declare a function that returns a pytorch distribution object for the distribution from which the random variable has been drawn. The curve above is represented by the function beta(2, 2), and we have a constructor for an object that represents that distribution in the pytorch library that we’re using, so:

@random_variable
def coin():
  return Beta(2.0, 2.0)

Easy as that. Every usage in the program of coin() is logically a single random variable; that random variable is a coin fairness that was generated by sampling it from the beta(2, 2) distribution graphed above.


Aside: The code might seem a little weird, but remember we do these sorts of shenanigans all the time in C#. In C# we might have a method that looks like it returns an int, but the return type is Task<int>; we might have a method that yield returns a double, but the return type is IEnumerable<double>. This is very similar; the method looks like it is returning a distribution of fairnesses, but logically we treat it like a specific fairness drawn from that distribution.


What do we then do? We flip a coin 100 times. We therefore need a random variable for each of those coin flips:

@random_variable
def flip(i):
  return Bernoulli(coin())

Let’s break that down. Each call flip(0), flip(1), and so on on, are distinct random variables; they are outcomes of a Bernoulli process — the “flip a coin” process — where the fairness of the coin is given by the single random variable coin(). But every call to flip(0) is logically the same specific coin flip, no matter how many times it appears in the program.

For the purposes of this exercise I generated a coin and simulated 100 coin tosses to simulate our observations of the real world. I got 72 heads. Because I can peek behind the curtain for the purposes of this test, I can tell you that the coin’s true fairness was 0.75, but of course in a real-world scenario we would not know that. (And of course it is perfectly plausible to get 72 heads on 100 coin flips with a 0.75 fair coin.)

We need to say what our observations are.  The Bernoulli distribution in pytorch produces a 1.0 tensor for “heads” and a 0.0 tensor for “tails”. Our observations are represented as a dictionary mapping from random variables to observed values.

heads = tensor(1.0)
tails = tensor(0.0)
observations = {
  flip(0) : heads,
  flip(1) : tails,
  ...  and so on, 100 times with 72 heads, 28 tails.
}

Finally, we have to tell Bean Machine what to infer. We want to know the posterior probability of fairness of the coin, so we make a list of the random variables we care to infer posteriors on; there is only one in this case.

inferences = [ coin() ]
posteriors = infer(observations, inferences)
fairness = posteriors[coin()]

and we get an object representing samples from the posterior fairness of the coin given these observations. (I’ve simplified the call site to the inference method slightly here for clarity; it takes more arguments to control the details of the inference process.)

The “fairness” object that is handed back is the result of efficiently simulating the possible worlds that get you to the observed heads and tails; we then have methods that allow you to graph the results of those simulations using standard graphing packages:

The orange marker is our original guess of observed fairness: 0.72. The red marker is the actual fairness of the coin used to generate the observations, 0.75. The blue histogram shows the results of 1000 simulations; the vast majority of simulations that produced those 72 heads had a fairness between 0.6 and 0.8, even though only 25% of the coins produced by the mint are in that range.  As we would hope, both the orange and red markers are near the peak of the histogram.

So yes, 0.72 is close to the most likely fairness, but we also see here that a great many other fairnesses are possible, and moreover, we clearly see how likely they are compared to 0.72. For example, 0.65 is also pretty likely, and it is much more likely than, say, 0.85. This should make sense, since the prior distribution was that fairnesses closer to 0.5 are more likely than those farther away; there’s more “bulk” to the histogram to the left than the right: that is the influence of the prior on the posterior!

Of course because we only did 1000 simulations there is some noise; if we did more simulations we would get a smoother result and a clear, single peak. But this is a pretty good estimate for a Python program with six lines of model code that only takes a few seconds to run.


Why do we care about coin flips? Obviously we don’t care about solving coin flip problems for their own sake. Rather, there are a huge number of real-world problems that can be modeled as coin flips where the “mint” produces unfair coins and we know the distribution of coins that come from that mint:

  • A factory produces routers that have some “reliability”; each packet that passes through each router in a network “flips a coin” with that reliability; heads, the packet gets delivered correctly, tails it does not. Given some observations from a real data center, which is the router that is most likely to be the broken one? I described this model in my Fixing Random series.
  • A human reviewer classifies photos as either “a funny cat picture” or “not a funny cat picture”. We have a source of photos — our “mint” — that produces pictures with some probability of them being a funny cat photo, and we have human reviewers each with some individual probability of making a mistake in classification. Given a photo and ten classifications from ten reviewers, what is the probability that it is a funny cat photo? Again, each of these actions can be modeled as a coin flip.
  • A new user is either a real person or a hostile robot, with some probability. The new user sends a friend request to you; you either accept it or reject it based on your personal likelihood of accepting friend requests. Each one of these actions can be modeled as a coin flip; given some observations of all those “flips”, what is the posterior probability that the account is a hostile robot?

And so on; there are a huge number of real-world problems we can solve just with modeling coin flips, and Bean Machine does a lot more than just coin flip models!


I know that was rather a lot to absorb, but it is not every day you get a whole new programming language to explain! In future episodes I’ll talk more about how Bean Machine works behind the scenes, how we traded off between declarative and imperative style, and that sort of thing. It’s been a fascinating journey so far and I can’t hardly wait to share it.

 

Installing windows

Episode 34 will be delayed again — sorry! — because once again the time I had set aside for writing this weekend got consumed by a real-world task that could not wait. (I will try for Thursday of this week.)

Some friends who are moving had a handyman failure; as is often the case when renovating a house to be sold, they have a set of build dependencies that required this window to be replaced in a hurry in order to not slip the schedule for other renovations, so I volunteered to take care of it.

Yuck.

Living in a 112 year old house myself, I am used to doing archaeological investigations of the strange decisions made by previous owners. This window, though obviously an old single-paned window, did not look like it was original to the 120-year-old house. It was the wrong size for the rough opening; the hinges looked more modern than turn-of-the-century, and so on.

Sure enough, when disassembled there was a gap behind the trim that was insulated with crumpled newspapers from 1957. Oddly enough they were Pittsburgh newspapers from different days; perhaps the owners of the house in 1957 moved from Pittsburgh, replaced a window, and insulated the gaps with the packing paper they moved with? It’s a mystery.

Having zero haircuts since quarantine began has done wonders for my hair.

New window in and trimmed — obviously the paint will need to be redone but that’s why the window had to go in before the painters arrived.

And the interior needs a little more drywalling and priming before it is ready for painting, but it is 1000000x better than before at least.

The neighbours in the blue house apparently asked my friends for my contact information as they also have a window that needs replacing. I am quite chuffed. I had my friends pass along that I only do windows as a favour, but I would be happy to design them a programming language for hire should they need one of those.

Next time: Gosper’s algorithm, finally!

 

Implementing a full fence

Episode 34 of my ongoing series will be slightly delayed because I spent the time on the weekend I normally spend writing instead rebuilding one of my backyard fences.

I forgot to take a before picture, but believe me, it was ruinous when I bought the place in 1997 and to the point in 2020 where it was actually falling to pieces in a stiff breeze.

I replaced it with an identical design and materials:

…and divided and re-homed some sixteen dozen irises in the process.

It’s nice implementing something that requires no typing every now and then.