The names of birds, part 4

The European starling is a lovely looking bird, though territorial, noisy and aggressive up close. Unfortunately, they are very invasive in North America. Most of the hundreds of millions of European starlings now living in the Americas can be found fighting over my suet feeder in winter months.

Last time we said that the kestrel obeys the identity Kxy=x for all birds x and y in the forest. The starling has a slightly more complicated identity: Sxyz=xz(yz).

Again, recall that this is parenthesized ((Sx)y)z=(xz)(yz).

Yes I know I used S for the “successor bird” in part 2; sorry for the confusion, this is a different bird.

If a forest contains both a starling and a kestrel, can you show that there is a bird in the forest which is fond of every bird in the forest? (Note that I’m implying here that there are at least two distinct birds in the forest, and therefore no lonely egocentric kestrel from last episode.)

Give it a try, and then scroll down.








Let’s call K to S; the starling calls back another bird. We’ll then call K to it. That calls back some other bird. We’ll then call any bird z in the forest to that bird.


But Kz is a bird that is fixated on z, so no matter what we call to Kz, we always get z back:


We call SKK the identity bird I, because Iz=z for all z in the forest. Every value is a fixpoint of the identity function bird.

(Aside: Note that SKy names the identity bird for any bird y; we didn’t have to use K.)

Now that we know that any forest with both S and K in it also has I in it, can you prove that the forest also has M in it? (Recall that Mx=xx for all x.) Give it a try and then scroll down.








SIIz=(Iz)(Iz)=zz, so SII is M.

All this whimsey is fun of course, but I’m sure you’ve all realized by now that birds are more conventionally called combinators. Combinators are functions that take a single combinator and return a single combinator; forests are sets of combinators.

The study of combinatory logic is interesting for computer scientists. It turns out that there is a relatively straightforward way to take any expression made up of S and K and find an equivalent lambda calculus expression, and vice versa. Any expression in lambda calculus can be expressed as a string of S and K combinators. You can use the S and K combinators to express Booleans, numbers, arbitrary functions on integers, you name it; if a Turing machine can compute it, there’s an S/K expression which can be interpreted as computing the same thing.

There’s a lot more I could say about that, but I’m not going to go into the details.

Oh, and when I added lambdas to C# 3 way back in the day, the first test case I wrote to ensure the type analysis was working was something like:

delegate C C(C x);
C i = x=>x;
C m = x=>x(x);
C k = x=>y=>x;
C s = x=>y=>z=>x(z)(y(z));

but I’m not going to discuss that further, at least not right now. We’re already way off track from my series on compiling Bean Machine. Let’s get back to it.

Next time on FAIC: We’ll see how to use an approach inspired by combinators to build parts of a compiler, and why ensuring combinators have fixpoints is important.

If you enjoyed these puzzles, the ones that I showed are some of the introductory puzzles in To Mock A Mockingbird, which is delightful and educational, and gets as far as computability theory and Godel incompleteness. Do pick up a copy!

The names of birds, part 3

In the autumn of last year my friend Joan and I went on a little trip up to the Skagit valley north of Seattle to photograph birds of prey; I managed to get a blurry but recognizable shot of this charming little kestrel looking for field mice. It was easily identifiable in the field because kestrels are one of the very few birds of prey that can briefly hover while hunting, almost like a hummingbird.

Well, back to combinatory logic and some puzzles from To Mock a Mockingbird.

We’ve said that if there are birds x and y such that xy=y, then x is fond of y, and y is a fixpoint of x. Smullyan goes on to say that if there are birds x and y such that xz=y for every z in the forest, then x is not just fond of y, it is fixated on y.

The kestrel, K, is an interesting bird. When you call the name of any bird x to it, it calls back the name of a bird in the forest that is fixated on x. That is, the kestrel obeys the identity Kxy=x for all birds x and y in the forest. (Recall from part one that Kxy means (Kx)y, not K(xy).)

Suppose there is a forest which contains a kestrel and moreover in this particular forest K is fond of itself; that is KK=K. Can you prove that a kestrel that is fond of itself is both (1) extremely egocentric, and (2) lonely? Give it a try, and then scroll down.





I need a longer lens! This is at the edge of what my little 70-300 zoom can handle.




(1) A bird which is fond of itself is already probably pretty egocentric. Kx gives the name of a bird which is fixated on x, so KK gives the name of a bird which is fixated on K. But KK=K, so K is the name of a bird that is fixated on K. A bird which is fixated on itself is surely extremely egocentric!

(2) By definition (Kx)y=x for all x and y. Since in this forest K is fixated on K, Kx=K for all x. Therefore Ky=x for all x, y. But again Ky=K for all y so K=x for all x in the forest. Every bird in the forest is the kestrel, and therefore there is only one bird in the forest. No wonder an extremely egocentric kestrel is lonely!

Next time on FAIC: The starling!

The names of birds, part 2

Reader “Joel” had an insightful comment on the first part of this series which I thought deserved a short episode of its own. Recall that we proved the theorem “if a compositional forest contains a mockingbird then every bird in the forest has a fixpoint“. An equivalent way to state that theorem is: “if a forest contains a bird without a fixpoint then either the forest is not compositional or it contains no mockingbird“. Joel’s question was whether there is a more “intuitive” way to understand why the mockingbird is so interesting that the presence of a bird without a fixpoint should prevent a mockingbird from existing in the same (compositional) forest.

The short version is: we’ll say that two birds agree on an input if their outputs are the same for that input. The mockingbird is a bird that agrees with every other bird on at least one input. A bird without a fixpoint is a bird that can always produce a different output than its input. Their composition is a bird which disagrees with every bird in the forest on one input and therefore is not in the forest at all.

The long version is:

I find it easiest to get an intuition for a general rule by looking at some simple examples.

The first thing I’d note is that it is quite easy to construct a non-compositional forest containing a bird without a fixpoint. A simple example would be a non-compositional forest containing exactly two birds, A and M. Define them such that:

AA = M
AM = A
MA = M
MM = M

Plainly these are two different birds, since their responses disagree on input M. Plainly M is a mockingbird, since Mx=xx for all x in the forest. And plainly A has no fixpoint. This forest is not compositional because there is no bird C in the forest such that Cx=A(Mx) for all x. With such a small number of birds in the forest it seems quite reasonable that one might have no fixpoint.

Let’s look at a more interesting example, this one in a compositional forest. Let’s suppose there is a countable infinity of birds in the forest. By “countable infinity” I mean that we can number each bird with a natural number; we’ve got bird 0, bird 1, bird 2, and so on. In fact, let’s just use natural numbers as the names of the birds in this forest.

Since each bird needs a response to hearing the name of every other bird in the forest, we can imagine an infinite lookup table that says what each bird responds when it hears the name of another bird in the forest. We’ll put the “called to” bird on the vertical axis and the “name called” on the horizontal axis:

      0 1 2 3 ...
  0 | 9 5 2 5 ...
  1 | 8 3 6 7 ...
  2 | 0 1 2 3 ...
  3 | 3 1 4 1 ...
... |
  S | 1 2 3 4 ...
... |

So in our notation 0 1=5, 2 3=3, and so on.

Let’s suppose this forest is compositional and contains a bird, call it S, which has no fixpoint. We could choose any such S, but without loss of generality we shall pick a specific S, the bird where S 0=1, S 1=2, S 2=3, S 3=4, … and so on; the “successor” bird. Of course S has some “number” name but we don’t really care what it is, so we’ll just “alias” that one as S.

Now let’s suppose there’s a mockingbird in the forest and deduce a contradiction. The mockingbird tells us what is on the highlighted diagonal! That is, M 0=9, M 1=3, M 2=2, M 3=1, and so on. (As with S, M is an alias for some “number” name but we don’t know or care what it is.)

Now comes the fun bit. The forest is compositional, so therefore there exists a bird C such that Cx=S(Mx) for all x. We do care what the number name of C is. So, which is it?

C 0=10, and therefore C is not 0, because 0 0=9. But similarly, C 1=4, so C is not 1. And similarly C is not 2, 3, 4, 5, … because C definitely disagrees with each of those functions I mean birds at the diagonal. But if C has no number name, then it is not in the forest at all, contradicting the statement that the forest is compositional. We now have deduced a contradiction, and therefore our assumption that M is in the forest is shown to be false.

By looking at a specific example — a countable forest with a successor bird — perhaps you have a better intuition about why Mx=xx and any bird S that has no fixpoint cannot coexist in a compositional forest: if they did then the bird C such that Cx=S(Mx) definitely disagrees with every bird in the forest on at least one input, and therefore the forest would not be closed under composition; there would always be a missing bird.

Next time on FAIC: We’ll look at some more fun examples and then get back to compilation.

The names of birds, part 1

For the next part in my Bean Machine retrospective to make sense I’ll need to make a short digression. In looking back on the almost 20 years I’ve been blogging, it is surprising to me that I’ve only briefly alluded to my appreciation of combinatory logic. In the next couple of episodes, I’ll do a quick introduction based on the delightful book that introduced it to me: To Mock A Mockingbird, by the late Raymond Smullyan.

Imagine a forest containing some birds, possibly finitely or infinitely many. These are unusual birds. When you call the species name of a bird in the forest to a bird in the forest, it calls one back to you. Maybe the same, maybe different, but when you tell a bird the name of a bird, it names a bird back to you. There might be a Red Capped Cardinal in the forest and when you call out Great Blue Heron to it, it calls back Belted Kingfisher.

(Photos by me; click for higher resolution.)

We will notate “I called Q to P and got response R” as PQ = R. If I then called out S to R and R responded with T, we’ll notate that as PQS = RS = T. We’ll use parentheses in the obvious way: PQS = (PQ)S and this might be different from P(QS). The latter is “I called S to Q, and then called Q‘s response to P“. I’ll use capital letters to represent specific bird names, and small letters to represent variables.

The question we’re considering here is: under what circumstances will a bird call back the same name you called to it? That is, for a given bird y, under what circumstances does yx = x?

Smullyan calls birds with this relationship “fondness” — that is, “y is fond of x” means that yx = x. If y is fond of x then x is said to be a “fixpoint” of y.

A forest is said to be “compositional” if for every pair of birds (a, b) — a and b can be the same or different — there is a bird c in the forest such that cx = b(ax) for all x. That is, if we call any name x to a, and then call its response to b, we get the same result as simply calling x to c. c is the composition of “call x to a, and then call that response to b“.

A mockingbird is the bird M with the property that Mx = xx. That is, for any bird name x, M tells you what x‘s response is to its own name. (The attentive reader will note that we have not said what MM is, but at least we know from the definition that MM = MM, so give me some credit for consistency at least.)

Theorem: if a compositional forest contains a mockingbird then every bird in the forest is fond of at least one bird.

Proof: Try to prove it yourself; scroll down for the solution.








Let p be any bird. Let c be the bird that composes p with M. Therefore cx = p(Mx) for all x.

In particular, that’s true for c, so cc = p(Mc). But Mc = cc, thus cc = p(cc), and we’ve found a bird that p is fond of. Since p was any bird, every bird in the forest is fond of at least one bird. Or, put another way, every bird in a compositional forest with a mockingbird has at least one fixpoint.

Next time on FAIC: we’ll take a look at a few more interesting birds, and then discuss why this whimsy is relevant to compilation before getting back to Bean Machine.

Bean Machine Retrospective, part 6

Happy New Year all!

Last time I briefly described the basic strategy of the Beanstalk compiler: transform the source code of each queried or observed function (and transitively their callees) into an equivalent program which partially evaluates the model, accumulating a graph as it goes. In this post I’ll go through the first step of the transformation.

The idea of the first source-to-source transformation is to take a relatively complicated language — Python — and replace it with what I think of as “simplified Python”. The Zen of Python famously says “there should be one — and preferably only one — obvious way to do it” but that is not the goal of this simplification step. The Zen of Python emphasizes obviousness. It is about making the code clear and the language discoverable. My goal with simplified Python is “there should be only one way to do it” — because it is easier to instrument a program to capture its meaning as it runs if there’s only one way to do everything.

Consider for example how to call a function foo with three arguments:

foo(1, 2, 3)
x = [1, 2, 3]
foo(1, 2, bar=3)
x = {"bar":3}
foo(1, 2, **x)

… and so on — there are a great many obvious ways to call a function in Python, but I’d like there to be exactly one in simplified Python.

The first thing we do when analyzing a queried or observed function is to obtain its source code, parse that into an abstract syntax tree (AST), and then do a series of AST transformations that reduce the function into an equivalent but simpler Python program. This is the most traditional-compiler-ish step in the pipeline.

(How do we obtain the source code? Remember, we have an RVID for each observation and query; the RVID contains a reference to the original function object. Python’s runtime can provide the source code for a given function reference.)

Let’s look at an only slightly more complicated version of our “hello world” model:

def fairness():
  return Beta(2,2)

def flip(n):
  return Bernoulli(0.5 * fairness())

In our simplified Python:

  • Every value gets its own variable
  • Every computation gets its own statement
  • Every call has the same syntax: a = b(*c, **d)

Our two methods in the simplified form would be:

def fairness():
  _t1 = 2
  _t2 = 2
  _t3 = [_t1, _t2]
  _t4 = {}
  _t5 = Beta(*_t3, **_t4)
  return _t5

def flip(n):
  _t1 = 0.5
  _t2 = []
  _t3 = {}
  _t4 = fairness(*_t2, **_t3)
  _t5 = _t1 * t4
  _t6 = [_t5]
  _t7 = {}
  _t8 = Bernoulli(*_t6, **_t7)
  return _t8

ASIDE: Readers with some experience writing compilers might ask “is this static-single-assignment form?” SSA form is a transformation whereby every variable is assigned exactly once; it’s used for control flow analysis and other program analyses. I did not write a full-on SSA transformation but when I was doing initial prototyping I knew that I might need SSA form in the future. Think of it as SSA-light if that makes sense.

We can go further; in our simplified Python:

  • logical operators and and or are eliminated
x = y and z

can be simplified to

x = y
if x:
  x = z

and now the graph accumulator does not need to worry at all about logical operators, because there aren’t any in the simplified program.

  • Every while loop is while True.
  • No loop has an else clause. (Did you know that Python loops have else clauses? It’s true!)
  • There are no compound comparisons:
x = a < b < c

can be simplified to

x = a < b
if x:
  x = b < c

and now the graph accumulator does not need to worry about compound comparisons, because there aren’t any.

  • Similarly there are no lambdas, no function annotations, and so on. All the convenient syntactic sugars are desugared into their more fundamental form.

The simplified language is definitely not a language I’d like to write a long program in, but it is a language that is easy to analyze programmatically. Thus far, the transformations make the program longer and simpler, but do not change its meaning; if we compiled and ran the simplified version of the program it should do the same thing as the normal version.

Next time on FAIC: let’s get into the details of how the AST-to-AST transformation code works. I had a lot of fun writing it.

Bean Machine Retrospective, part 5

Let’s take another look at the “hello world” example and think more carefully about what is actually going on:

def fairness():
  return Beta(2,2)

def flip(n):
  return Bernoulli(fairness())

heads = tensor(1)
tails = tensor(0)

observations = { flip(0) : heads, ... flip(9): tails }
queries = [ fairness() ]
num_samples = 10000
results = BMGInference().infer(queries, observations, num_samples)
samples = results[fairness()]

There’s a lot going on here. Let’s start by clearing up what the returned values of the random variables are.

It sure looks like fairness() returns an instance of a pytorch object — Beta(2,2) — but the model behaves as though it returned a sample from that distribution in the call to Bernoulli. What’s going on?

The call doesn’t return either. It returns a random variable identifier, which I will abbreviate as RVID. An RVID is essentially a tuple of the original function and all the arguments. (If you think that sounds like a key to a memoization dictionary, you’re right!)

This is an oversimplification, but you can imagine for now that it works like this:

def random_variable(original_function):
  def new_function(*args):   
    return RVID(original_function, args)
  # The actual implementation stashes away useful information
  # and has logic needed for other Bean Machine inference algorithms
  # but for our purposes, pretend it just returns an RVID.
  return new_function

def fairness():
  return Beta(2,2)
fairness = random_variable(fairness)

The decorator lets us manipulate function calls which represent random variables as values! Now it should be clear how

queries = [ fairness() ]

works; what we’re really doing here is

queries = [ RVID(fairness, ()) ]

That clears up how it is that we treat calls to random variables as unique values. What about inference?

Leaving aside the behavior of the decorators to cause random variables to generate RVIDs, our “hello world” program acts just like any other Python program right up to here:

results = BMGInference().infer(queries, observations, num_samples)

Argument queries is a list of RVIDs, and observations is a dictionary mapping RVIDs onto their observed values. Plainly infer causes a miracle to happen: it returns a dictionary mapping each queried RVID onto a tensor of num_samples values that are plausible samples from the distribution of the posterior of the random variable.

Of course it is no miracle. We do the following steps:

  • Transform the source code of each queried or observed function (and transitively their callees) into an equivalent program which partially evaluates the model, accumulating a graph as it goes
  • Execute the transformed code to accumulate the graph
  • Transform the accumulated graph into a valid BMG graph
  • Perform inference on the graph in BMG
  • Marshal the samples returned into the dictionary data structure expected by the user

Coming up on FAIC: we will look at how we implemented each of those steps.

I want toast

I’ll get back to Bean Machine and Beanstalk in the next episode; today, a brief diversion to discuss a general principle of language design and congratulate some of my former colleagues.

Back when we were all at Waterloo, a bunch of us were sitting around the comfy lounge, eating C&D donuts and talking about possible future user interfaces for “smart appliances”; this was in the 1990s, long before ubiquitous voice recognition, compute and networking that we take for granted today. I think it was my friend Peter who pointed out that there is a “use vs mention” problem in voice-driven home appliances. We want to be able to mention the toaster in conversation without that causing the toaster to be used.

This was a prescient observation. I have not verified this for myself, but I have heard from reliable sources that in some versions of Amazon’s Alexa smart speaker you could have this conversation:

Human: Alexa, add an item to my to-do list.

Alexa: OK, what do you want to add?

Human: Alexa, what is already on my to-do list?

Alexa: OK, adding Alexa what is already on my to-do list to your to-do list.

(At least the response was not “OK, adding an item to your to-do list”. Thank goodness for small mercies.)

The use-mention distinction is tricky, but I pointed out that you can make at least a little progress by parsing for desires rather than directives. Instead of building a system where we issue a series of commands to our appliances, we could instead give a description of the desired state — I want a pastrami sandwich on toast, no mustard — and let the allegedly smart appliances figure out how to achieve that goal.

(I once told this story to a friend who worked on Alexa; he laughed and pointed out that Alexa is certainly not “smart” and in fact is only barely “obedient”. We have a long way to go.)

This was all on my mind today because the Networked Systems Design and Implementation Symposium has just accepted a paper by some of my Meta colleagues about a declarative domain specific programming language for specifying and achieving desired router states in a datacenter. I reviewed some of their earlier work and made a few small suggestions for improvements to the submission. As a courtesy they listed me as an author; I wish to emphasize that I did none of the actual work whatsoever. 🙂

A huge problem that teams managing datacenters face every day is safely making configuration changes; this is most obvious on days when it goes horribly, horribly wrong. Even if you don’t know anything about router configuration — and I believe packets are moved around the wired network by gnomes and the wireless network by fairies — it’s pretty clear that any time you’re taking a router offline for maintenance or replacement, several things need to happen. You need to have a way to undo the operation back to a known good state if something goes wrong halfway through. You need to move user traffic to another router while still maintaining the ability for administrators to communicate with the target router. And so on. Writing imperative code that ensures that a router whose configuration is changing is always in a controllable state is maybe too easy to get wrong.

Instead, we can take the “I want pastrami on toast” approach. Say what you want to happen under what set of constraints; write a compiler that turns those intentions into an imperative program that you can prove works.

The goal of designing domain-specific languages is to create a language where the problems of the domain are naturally expressed in the jargon of the domain — in this example we can look at the grammar of the language in figure 1 of the paper and immediately see that the language is about describing paths, locations, topologies, routing, propagation, preferences and policies. This pays dividends in understandability of the code by domain experts, and has many other nice benefits.

For example, you can then write a static analyzer which looks for logical errors at the domain level in the code, which is much harder to do with, say, an imperative Python script that makes configuration changes. Even better, the team also has a simulator where they can simulate different network conditions, compile and execute configuration changes, and see if the simulated network is reconfigured as expected; there is no need to “test in production” when making any configuration changes.

Many thanks to the other authors who reached out to me and my manager Walid; I learned a lot from our short collaboration. It is delightful to see an example of a successful, innovative real-world at-scale application of the power of DSLs. Congratulations on being accepted to the symposium!

Next time on FAIC: Back to Bean Machine.

Bean Machine Retrospective, part 4

Did I actually build a compiler? Yes and no.

Traditionally we think of a compiler as a program which takes as its input the text of a program written in one language (C#, say), and produces as its output an “equivalent” program written in another language (MSIL, say).  A more modern conception is: a compiler is a collection of code analysis services, and one of those services is translation.

I made a great many assumptions when embarking on this project. Some of the more important ones were as follows:

Assumption: we must implement everything necessary for a source-to-source translation, but that is not enough. The primary use case of Beanstalk is outputting samples from a posterior, not outputting an equivalent program text. Translation may be a secondary use case, but we will not succeed by stopping there.

We’re building a compiler in order to build a better calculator.

Static and dynamic analysis

Traditional translating compilers such as the C# compiler perform a “static” analysis of the code; that is, they analyze only (1) the text of the program and (2) metadata associated with already-compiled artifacts such as libraries. The code is not executed in a static analysis.

“Dynamic” analysis is by contrast an analysis of code while the code is running. 

Assumption: correct static analysis of Python is impossible for at least two reasons. 

1) Python is an extremely dynamic language; it is difficult to definitively know what any program identifier refers to, for example, without actually running the program to find out.

2) The model may depend upon non-stochastic model elements that are unknowable by static analysis. Model parameters or observed values can (and will!) be read from files or queried from databases, and those values must be present in the graph that we deduce.

We must build a dynamic analyzer; we’re going to run a modified version of the model code to find out what it does.

Pure functions, acyclic graph

For our purposes we define a pure function as one that:

  • Always terminates normally; it does not go into an infinite loop or throw an exception
  • Is idempotent: it always produces the same output when given the same input
  • Its action does not depend on mutating global state
  • Its action does not mutate global state

Examples of impure functions are easily produced:

mean = 1.0
def normal():
  global mean
  mean = mean + 1.0 # UH OH
  return Normal(mean, 1.0)

Or, just as bad:

mean = tensor([1.0])
def normal():
  global mean
  mean[0] += 1.0
  return Normal(mean, 1.0)


  • Model builders will write pure functions. In particular, model builders will not mutate a tensor “in place” such that the mutation causes a change of behaviour elsewhere in the model
  • Beanstalk is not required to detect impure functions, though it may
  • Beanstalk may memoize model functions

Also: a Bayesian network never has cycles; the distribution of each node depends solely on the distributions of its ancestor nodes.


  • Model builders will write models where no random variable depends upon itself, directly or indirectly.
  • Beanstalk is not required to detect cycles (though in practice, it does)

Note that a model may contain recursion if the recursion terminates. This is legal:

def normal(n):
  if n <= 0:
    mean = 0.0
    mean = normal(n-1)
  return Normal(mean, 1.0)
queries = [ normal(0) ]
observations = { normal(2): tensor(0.0) }

Next time on FAIC: Having made these and other assumptions, I embarked upon an implementation. But before I can explain the implementation, I should probably answer some nagging questions you might have about what exactly that random_variable decorator does.

Bean Machine Retrospective, part 3

Introducing Beanstalk

Last time I introduced Bean Machine Graph, a second implementation of the PPL team’s Bayesian inference algorithm. We can compare and contrast the two implementations:

  • BMG emphasizes mechanisms; BMG programs are all about constructing a graph. BM emphasizes business logic
  • A point which I did not emphasize yesterday but which will come up again in this series: BMG requires many node constructions be annotated with semantic types like “probability” or “positive real”. BM lacks all “type ceremony”.
  • BMG programs represent operations such as addition as verbose calls to node factories. BM concisely represents sums as “x + y ”, logs as “x.log()”, and so on

In short, the BMG user experience is comparatively not a great experience for data scientists, in the same way that writing in machine code is not great for line-of-business programmers. We wished to automatically convert models written in Bean Machine into equivalent programs which constructed a graph, and then got the improved inference performance of BMG.

OK… how?

A program which translates one language to another is called a compiler, and that’s where I came in. We code named the compiler “Beanstalk”. I started working on Beanstalk in October of 2019 given this brief:

Construct a compiler which takes as its input a Bean Machine model, queries and observations, and deduces from it the equivalent BMG graph (if there is one).

If we can solve that problem effectively then we get the best of both worlds. Data scientist users can write models using a pleasant Python syntax easily integrated into their existing PyTorch workflows, but get the inference performance afforded by BMG. The “concept count” — the number of APIs that the data scientist must understand — increases only slightly, but their workflows get faster. That is a win all around.

Does it work today?

Yes. Barely. If for some strange reason you want to play around with it, you can obtain the code from github. The error messages are terrible, the compiler is “over fit” to specific models that we needed to get compiled, and it is quite slow, but for simple models it does work. At the time the team was shut down we were actively extending both BMG and the compiler to handle more general models efficiently.

Early on we realized that of course the compiler is a means to an end, not an end in itself. What users really wanted was a more efficient inference engine that just happened to use BMG as its back end, so that’s the main API. You can call BMGInference().infer(), passing in queries, observations, and so on, just like any other Bean Machine inference engine; the compiler will analyze the source code of the queries, produce a graph, and call BMG’s inference engine to produce samples from the queries posteriors.

It has a few other useful APIs as well that are more like a traditional source-in-source-out compiler.

  • BMGInference().to_graph() produces a graph object without starting inference.
  • BMGInference().to_python() and to_cpp() produce Python and C++ source code fragments which construct the graph.
  • BMGInference().to_dot() produces a program in the DOT graph visualization language; that’s what I used to make the graph diagrams in this series.

Next time on FAIC: What were my initial assumptions when tasked with writing a “compiler” that extracts a graph from the source code of a model? And is it really a compiler, or is it something else?

Bean Machine Retrospective, part 2

Introducing Bean Machine Graph

Bean Machine has many nice properties:

  • It is integrated with Python, a language often used by data scientists
  • It describes models using the rich, flexible pytorch library
  • Inference works well even on models where data is stored in large tensors

I’m not going to go into details of how Bean Machine proper implements inference, at least not at this time. Suffice to say that the implementation of the inference algorithms is also in Python using PyTorch; for a Python program it is pretty fast, but it is still a Python program.

We realized early on that we could get order-of-magnitude better inference performance than Bean Machine’s Python implementation if we could restrict values in models to (mostly) single-value tensors and a limited set of distributions and operators.

In order to more rapidly run inference on this set of models, former team member Nim Arora developed a prototype of Bean Machine Graph (BMG).

BMG is a graph-building API (written in C++ with Python bindings) that allows the user to specify model elements as nodes in a graph, and relationships as directed edges. Recall that our “hello world” example from last time was:

def fairness():
  return Beta(2,2)

def flip(n):
  return Bernoulli(fairness())

That model written in BMG’s Python bindings would look like this: (I’ve omitted the queries and observations steps for now, and we’ll only generate one sample coin flip instead of ten as in the previous example, to make the graph easier to read.)

g = Graph()
two = g.add_constant_pos_real(2.0)
beta = g.add_distribution(
  [two, two])
betasamp = g.add_operator(OperatorType.SAMPLE, [beta])
bern = g.add_distribution(
flip0 = g.add_operator(OperatorType.SAMPLE, [bern])

That’s pretty hard to read. Here’s a visualization of the graph that this code generates:

These graphs are properly called Bayesian network diagrams, but for this series I’m just going to call them “graphs”.

I should say a little about the conventions we use in this graphical representation. Compiler developers like me are used to decomposing programs into abstract syntax trees. An AST is, as the name suggests, a tree. ASTs are typically drawn with the “root” at the top of the page, arrows point down, “parent nodes” are above “child nodes”, and operators are parents of their operands. The AST for something like x = a + b * c would be

where X, A, B, C are identifier nodes.

Bayesian network diagrams are just different enough to be confusing to the compiler developer. First of all, they are directed acyclic graphs, not trees. Second, the convention is that operators are children of their operands, not parents.

The best way I’ve found to think about it is that graphs show data flow from top to bottom. The parameter 2.0 flows into an operator which produces a beta distribution — twice. That distribution flows into a sample operator which then produces a sample from its parent distribution. That sampled value flows into an operator which produces a Bernoulli distribution, and finally we get a sample from that distribution.

If we wanted multiple flips of the same coin, as in the original Python example, we would produce multiple sample nodes out of the Bernoulli distribution.

BMG also has the ability to mark sample nodes as “observed” and to mark operator nodes as “queried”; it implements multiple inference algorithms which, just like Bean Machine proper, produce plausible samples from the posterior distributions of the queried nodes given the values of the observed nodes. For the subset of models that can be represented in BMG, the performance of the inference algorithms can be some orders of magnitude faster than the Bean Machine Python implementation.

Summing up: Our team had two independent implementations of inference algorithms; Bean Machine proper takes as its input some decorated Python methods which concisely and elegantly represents models in a highly general way using PyTorch, but the inference is relatively slow. Bean Machine Graph requires the user to write ugly, verbose graph construction code and greatly restricts both the data types and the set of supported operators, but uses those restrictions to achieve large inference speed improvements.

Next time on FAIC: Given the above description, surely you’ve guessed by now what the compiler guy has been doing for the last three years on this team full of data scientists! Can we automatically translate a Bean Machine Python model into a BMG graph to get BMG inference performance without sacrificing representational power?