Excessive explanation, part twenty-four

We have four kinds of things in Exp: identifiers, function calls, lambdas and lets, and we have four rules in type inference. Finally, the last recursive rule of the type inference algorithm:

If e is let x = e1 in e2 then let W(A, e1) = (S1, τ1) and
W(S1 Ax ∪ {x:S1 A(τ1), e2) = (S2, τ2); then S = S2 S1 and τ = τ2.

Again let’s look at an example. Let’s say we want to infer a type for

let double_it = λx.(plus x x) in double_it foo

where we have plus as before, and foo:int in our set of assumptions. We begin by recursing on λx.(plus x x) with the current set of assumptions. Note that we do not want to remove double_it from our set of assumptions if it is there because the expression might use double_it, meaning some other definition. If it does, then it uses the definition already in the assumptions; the “new” double_it is only in scope after the in.

Type inference tells us that the expression is of type int → int, so we add an assumption double_it:int → int to our set of assumptions, and compute the type of double_it foo under the new set of assumptions, which of course says that it is int. Since the type of a let expression is the type of the second expression, we’re done.

The reason we need the closure is: suppose type inference tells us that the first expression is of type β→β, and β is free in A. (Remember, we always get a type, not a type scheme.) For the purposes of inferring the type of the right side of the in, we want the expression to have type scheme ∀β β→β.

NOTE: When any of the conditions above is not met W fails.

Though this is true, I think it would have been better to call out the two places in the algorithm where the failure can occur. They are in rule (i), which requires that an identifier have a type in the set of assumptions, and in rule (ii) which requires that type unification succeeds. Rules (iii) and (iv) never fail of their own accord; they only fail when the recursive application of W fails.

Next time: more sketch proofs!


Excessive explanation, part twenty-three


Last time we gave the base case of our recursive algorithm. Today, two of the three non-base cases.

But before we get to them, another annoyance.

The first time I read this algorithm carefully I got to the second step of the inference algorithm and said to myself wait, that cannot possibly be right. I stared at that thing for probably fifteen minutes trying to make sense of it before realizing that there is a footnote at the bottom of the page that says

There are obvious typographic errors in parts (ii) and (iv)
which are in the original publication. I have left the correction
of these as an easy exercise for the reader.

Now, I am hardly one to talk as I leave stuff as exercises for the reader all the time on Stack Overflow. But really? There’s a typo in the paper, you know about it, and you leave the typo in the paper as a challenge? VEXING.

I will remove the typos in my presentation of the algorithm. Anyone who wants the frankly rather dubious challenge of attempting to fix the typos themselves can go read the original paper.


If e is e1 e2 then let W(A, e1) = (S1, τ1) and W(S1 A, e2) = (S2, τ2) 
and U(S2 τ1, τ2 → β) = V where β is new; then S = V S2 S1 and τ = V β.

Let’s unpack this, starting with just the types.

We have a function application e1 e2. We recursively determine that e1 is some function type τ1 and e2 is its argument of type τ2.

Therefore τ1 had better be a function type that takes a τ2 and returns something of some type; call it β. Whatever β is must be the type of the function invocation expression. To find it, we unify τ1 with τ2 → β and we get a substitution that tells us what type to substitute for β. That’s our resulting type.

Let’s take a look at a trivial example. Suppose our assumptions are x:int and i:∀α α→α, and our expression to be inferred is i x. We recursively apply type inference to determine that τ1 is β → β — remember, we need a new type variable for the type τ1 — and τ2 is of course int. Substitutions S1 and S2 are empty substitutions. We unify β → β with int → γ — again we need a new type variable used nowhere else — and get the subtitution V = [β/int, γ/int]. Our result type is V γ which is obviously int.

We can deduce that passing an int to an identity function returns int; awesome.

If e is λx.e1 then let β be a new type variable and
W(Ax ∪ {x:β}, e1) = (S1, τ1); then S = S1 and τ = S1 β → τ1. 

Let’s think of an example. Suppose we have an assumption plus:int → int → int and the expression λx.(plus x x). So we make a new assumption x:β, and run algorithm W on plus x x. That tells us that the expression is of type int, and also returns a substitution [int/β].

The type of the lambda is therefore [int/β](β → int), which obviously is int → int.

Next time: inferring the type of a “let” expression.

Excessive explanation, part twenty-two

All right, at long last, the type inference algorithm itself!

Algorithm W.

W(A, e) = (S, τ) where

This is just a reminder that what we’re doing here is taking a set of assumptions that give types to identifiers, and an expression that might contain some of those identifiers, and from that we deduce a substitution on the identifier types, and a type for the expression.

This reminder however does not remind us that the algorithm can also tell us that there is no type that can be inferred for the expression given those assumptions; properly we ought to be saying that we produce a substitution and a type, or an error.

If e is x and there is an assumption x:∀α1, ... ,αn τ' in A then 
S = Id and τ = [βii]τ' where the βi are new. 

Of course this is the identity (empty) substitution, not the set 
Id of identifiers.

Just a small rant here. First off, we already have a notation for substitutions which is a list of τ/α pairs in square brackets. The notation [ ] is a perfectly concise way to describe the identity type substitution. So why did the authors chose to re-use “Id” to mean “the empty substitution” when it already means “the set of identifiers”, thereby prompting them to insert a note to the effect that this notation is confusing? In code reviews I tell people to never comment confusing code; fix it so it is not confusing! A two-character fix here would ensure that the clarifying note is unneeded.

I also cannot fathom why we’ve gone from notating a type scheme with multiple bindings as ∀α1…αn at the beginning of the paper to ∀α1, … ,αn at the end. Before there were no commas; now there are commas. Small things like this are an impediment to understanding. Attention to detail is important in a paper, and in code.

Rant over. More rants next time!

As you might have guessed, this is going to be a recursive algorithm. Therefore we start with the simplest possible base case. If the expression we’re trying to deduce is an identifier whose type scheme is given in the assumptions, then we just use that type.

There are a few subtleties to consider carefully though.

If we have expression x and an assumption like x:int then it is easy; x is deduced to be int. Similarly if we have a free type variable; if we have x:α then we simply deduce that x is, of course, α.

The tricky case is where we have an assumption that x is quantified. Maybe we have assumption x:∀α1α2 α1→α2. We do not want to deduce that the type of the expression is the type α1→α2 because α1 or α2 might appear free elsewhere, and we don’t want to confusingly associate the type of x with those free variables. We need to make new free type variables, so we deduce that the type of x is β1→β2 where these are brand new not previously used type variables.

When I joined the C# language design team the major design work for LINQ was done but there were still a few details left. The existing generic method type inference algorithm was insufficient to deal with a world that included lambdas, so we needed to fix it up. Complicating matters was the fact that the existing code that did type inference was difficult to follow and did not bear a very close resemblance to the specification in its action. (Its results were fine, but it was hard to look at the code and see how the specification had been translated into code.)

Given that we were making major changes to the algorithm, I decided to scrap the existing implementation and start fresh with an implementation that was (1) very clearly implementing exactly what was in the specification, so that as we tweaked the spec during the design it would be clear what needed to change in the implementation, and (2) written in a style similar to how I would want to write the same algorithm in C#; I was anticipating the future existence of Roslyn and I wanted to make that transition smooth.

And I really did start from scratch, including rewriting the type unification and substitution algorithms from scratch as well.

The first major mistake that got caught after I checked it in was:

void M<T, U>(T t, U u, bool flag) 
  if (flag) 
    M(flag, t, false);

What types should be inferred on the recursive call? Plainly this should be the same as M<bool, T>(flag, t, false);. My first implementation deduced that this was M<bool, bool>. You can probably guess why. The type inferencer deduced that T is bool and U is T, and the substituter said OK, if T is bool and U is T then U is also bool!

The T that is being substituted in for U is the T of the outer invocation of M. The T of the inner invocation of M is a different T entirely, but I used the same object to represent both! When you are implementing these algorithms you have to be super careful about not granting referential identity to things that look the same but are in fact just placeholders for different activations.

Fortunately we had a massive set of test cases which found my bug in short order.

Next time we’ll look at the non-base cases, and some irritating typos.

Excessive explanation, part twenty-one

All right; we have a deductive system whereby the types of expressions can be justified with derivations. What we want is the ability to take an expression and the typing assumptions of an environment and deduce the type scheme of that expression:

6 The type assignment algorithm W

The type inference system itself does not provide an easy 
method for finding, given A and e, a type-scheme σ such that 
A ⊢ e:σ 


We now present an algorithm W for this purpose. In 
fact, W goes a step further. Given A and e, if W succeeds it
finds a substitution S and a type τ, which are most general 
in a sense to be made precise below, such that

                        S A ⊢ e:τ.

Note that this is a bit subtle. Our inputs are a set of assumptions A — what are the type schemes of all the relevant identifiers? — and an expression e, whose type we wish to know. Our outputs are a substitution — what types should some of the type variables have? — and a type. Not, interestingly, a type scheme; a type.

To define W we require the unification algorithm of Robinson [6].

Proposition 3 (Robinson). There is an algorithm U which, given 
a pair of types, either returns a substitution V or fails; further

(i) If U(τ,τ') returns V, then V unifies τ and τ', i.e. V τ = V τ'.

Note that there was a typo in the PDF of the paper here; it said V τ = τ'. I’ve corrected it here.

(ii) If S unifies τ and τ' then U(τ,τ') returns some V and 
there is another substitution R such that S = RV .

Moreover, V involves only variables in τ and τ'.

What does this mean?

The first claim is the important one. Suppose we have two types. Let’s say β→β and int→int. The unification algorithm would find a substitution on the two types that makes them equal; obviously the substitution is [int/β]. That’s an easy one; you can I’m sure see how these could get quite complicated.

A slightly more complicated example illustrates that both types might need substitution: Consider unifying int → γ with β → β. The substitution V = [int/β, int/γ] does make them equal to each other under substitution.

The unification algorithm can return “there is no such substitution” of course. There is no substitution that can turn β→β into int→(int→int), for example.

I’m not going to go through the unification algorithm in detail; perhaps you can deduce for yourself how it goes. I will however say that unification of types in the ML type system is simple compared to unification in a type system like C#’s. In C# we must solve problems like “I have an IEnumerable<Action<Tiger, T>> and I need a unification with List<blah blah blah... and so all the mechanisms of covariant and contravariant reference conversions come into play. In the C# type inference algorithm we not only need to find substitutions, we need to put bounds on the substitutions, like “substitute Mammal-or-any-subtype for T”.

The second claim notes that the substitution that unifies two types may not be unique, but that this fact doesn’t matter. That is, if V is a substitution that unifies two types found by the algorithm, and S is a different substitution that also unifies the types, then there exists a way to transform V into S.

We also need to define the closure of a type τ with respect to
assumptions A;

     A(τ) = ∀α1, ... ,αn τ

where α1, ... ,αn  are the type variables occurring free τ in
but not in A.

I am not super thrilled with this notation.

Basically we are saying here that given a set of assumptions A and a type τ that might contain free type variables, we can identify the free type variables not already free in A, and quantify them to make an equivalent type scheme.

Now, we’ve already said that there is a function U that takes two types and produces a substitution, and we’re going to say that there is a function W that takes a set of assumptions and an expression and produces a substitution and a type. So why on earth do the authors not simply say “there is a function B that takes a set of assumptions and a type and produces a type scheme” ???

I genuinely have no idea why authors of these sorts of papers do not adopt a consistent approach to functions and their notation. Rather than taking a straightforward approach and notating it B A τ or B(A,τ), they notate it as A-with-a-bar-on-top(τ), for no good reason I am aware of.

I started this series because I am asked fairly frequently about how to read the notation of these papers, and boy, do they ever not make it easy. Even a little thing like whether function applications are notated U(τ,τ') as the unification function is notated, or S τ as substitutions are notated, is frequently inconsistent. And now we have yet another kind of function, this time notated as, of all things, a bar over what ought to be one of the arguments.

Sigh. I’ll stick with the notation as it is describe in the paper.

Next time: at long last, the base case of algorithm W!

Happy anniversary Visual Studio!

Oh my goodness, I cannot believe it has been twenty years since I got this t-shirt:

Happy anniversary Visual Studio, and congratulations to the whole team on VS 2017, and a particular shout-out to my Roslyn friends. I can’t wait to take the new C# and VB out for a spin. Nicely done everyone.

Even readers who were following Microsoft dev tools in 1997 would be forgiven for not knowing what the “Da Vinci” is on my shirt. It is a little-known fact that I was on the Da Vinci team for about five minutes. I’ll tell a funny story about that, but some other time.

Sorry for radio silence this past week; it has been crazy busy at work. I’ll finish off my excessive explanation series soon!

Excessive explanation, part twenty

Summing up the story so far: we have a formal system that allows us to start with the types of identifiers in an environment, and provide a series of deductive steps that ends the type of an expression. This is called a “derivation”.

Last time we discussed the “proof” that a “compile-time” series of deductions about the type of an expression is also a guarantee that the expression will always have a value compatible with that type at “run time”. I say “proof” because of course it was a proof by assertion that there exists an inductive proof.

The goal of this paper is to provide an algorithm that takes an expression and a set of typing assumptions for an environment, and produces a derivation that concludes with the type of the expression.

There are two more sketch proofs in this section of the paper, which we’ll briefly review.

We will also require later the following two properties of
the inference system.

Proposition 2. If S is a substitution and A ⊢ e:σ then S A ⊢ e:S σ. 
Moreover if there is a derivation of A ⊢ e:σ : of height n then 
there is also a derivation of S A ⊢ e:S σ of height less [than] or
equal to n.

Proof. By induction on n.

An example might help here. Suppose we have a derivation for

{x:α} ⊢ λy.x:∀β β→α

Now we have a substitution S that means “change all the α to int”. The we can apply the substitution to both sides and still have a true statement:

{x:int} ⊢ λy.x:∀β β→int

Now, remember what that turnstile means. It means that there is actually a derivation: a finite sequence of applications of our six rules, that starts with a bunch of assumptions and ends with “those assumptions entail that this expression has this type”. So if “apply the substitution to both sides” is in fact legal, then there must be a derivation of the substituted type from the substituted assumptions.

The proposition here claims that not only is there such a derivation, but moreover that there is a derivation of equal or shorter length! Substitution apparently can make a derivation shorter, but you never need to make it longer.

The proof-by-assertion here again says that you can prove this by induction, this time by “ordinary” induction on the integer n, the length of the derivation. Let’s sketch that out.

The base case is vacuously true: there are no derivations of length zero, and everything is true of all members of the empty set.

Now, suppose that you have a derivation of A ⊢ e':σ' that is k steps long. Suppose that there is a derivation of S A ⊢ e':S σ' that is k steps long or shorter. And suppose by adding just one more step to our original derivation, we can derive A ⊢ e:σ. Now all we must show is that we can add zero or one additional steps to the derivation of S A ⊢ e':S σ' in order to deduce S A ⊢ e:S σ. There are only six possible steps that could be added, so all we have to show is that each of those possible steps still works under a substitution, or can be omitted entirely.

I’m not going to do that here as it is tedious; noodle around with it for a bit and convince yourself that it’s true.

One more theorem, and then we’re done with this section:

Lemma 1. If σ > σ' and Ax ∪ {x:σ'} ⊢ e:σ0 then also
Ax ∪ {x:σ} ⊢ e:σ0

Incidentally, why have we had two propositions and a lemma?

Propositions, lemmas and theorems are all the same thing: a claim with a formal justification that proves the claim. We use different words to communicate to the reader how important each is.

Propositions are the least interesting; they are typically just there to prove some technical point that is necessary later on in a larger theorem. The proofs are often just hand-waved away as too tedious to mention, as they have been here.

Lemmas are more interesting; they are also in the service of some larger result, but are interesting enough or non-trivial enough that the proof might need to be sketched out in more detail.

Theorems are the major interesting results of the paper, and are all useful in their own right, not just as stepping stones to some larger result.

These lines are blurry of course, since the difference is entirely in emphasizing to the reader what we think is important.

What is this lemma saying? Basically, if we can derive the type of an expression given a small type bound on an identifier, then we can come up with a derivation of the same type for the same expression even if we have a larger bound. For example, suppose we have a derivation for

{y:int, x:int→int} ⊢ (x y):int

Then we can also find a derivation that gives the same result even with a larger type for x:

{y:int, x:∀β β→β} ⊢ (x y):int

Again, remember what the turnstile is saying: that we have a finite sequence of applications of our six rules that gets us from the first set of assumptions to the first type. For this lemma to be true, we must also have a finite sequence of rules that gets us from our new set of assumptions to that same type. The proof sketches out how to create such a sequence:

Proof. We construct a derivation of Ax ∪ {x:σ'} ⊢ e:σ0 from
that of Ax ∪ {x:σ'} ⊢ e:σ0 by substituting each use of TAUT 
for x:σ' with x:σ, followed by an INST step to derive x:σ'. 
Note that GEN steps remain valid since if α occurs free in 
σ then it also occurs free in σ'.

So at most the new derivation gets longer by a finite number of steps, less than or equal to the number of TAUT steps in the original derivation.

And that’s it for section five! Next time we’ll start looking at the algorithm that actually finds the type of an expression. We’ll start by discussing type unification.

Excessive explanation, part nineteen

We’ve just seen that we can use our deductive logic system with its rules TAUT, GEN, and so on, to start from knowing nothing, and end with a correct derivation of the type scheme of a complex expression. But here’s an important question.

We know that A ⊢ e:σ means that there exists a derivation in our deductive system that allows us to conclude that the given expression is of the given type scheme when the given assumptions are true.

And we know that A ⊨ e:σ means that no matter what environment with values we assign to the identifiers named in assumptions A, the value computed for expression e will be a value of type scheme σ.

What we would really like to know is that A ⊢ e:σ logically implies A ⊨ e:σ. That is, if we have a derivation at compile time that the type scheme is correct in our deduction system, that there will be no type violation at runtime! Recall that earlier we defined the “soundness” of a logical deduction as a logically correct derivation where we start with true statements and end with true statements; what we want is to know that our logical system of derivation of types is “sound” with respect to the semantics of the language.

The paper of course provides a detailed proof of this fact:

The following proposition, stating the semantic soundness
of inference, can be proved by induction on e.

Proposition 1 (Soundness of inference). If A ⊢ e:σ then A ⊨ e:σ.

That’s the entire text of the proof; don’t blink or you’ll miss it. It can be proved, and that’s as good as proving it, right? And there’s even a big hint in there: it can be proved by induction.

Wait, what?

Inductive proofs are a proof technique for properties of natural numbers but e is an expression of language Exp, not a number at all!

Let’s consider for a moment the fundamental nature of an inductive proof. Inductive proofs on naturals depend on the following axioms: (by “number” throughout, I mean “non-negative integer”.)

  • Zero is a number.
  • Every number k has a successor number called k + 1.
  • If something is true for zero, and it being true for k implies it is true its successor, then it is true of all numbers.

So the idea is that we prove a property for zero, and then we prove that if the property holds for k, then it also holds for k+1. That implies that it holds for 0+1, which implies that it holds for 1+1, which implies that it holds for 2+1, and so on, so we deduce that it applies to all numbers.

This is the standard form in which inductive reasoning is explained for numbers, but this is not the only form that induction can take; in fact this kind of induction is properly called “simple induction”. Here’s another form of induction:

Prove a property for zero, and then prove that if the property holds for k and every number smaller than k, then it also holds for k+1.

This form of induction is called “strong induction”. Simple and strong induction are equivalent; some proofs are easier to do with simple induction and some are easier with strong, but I hope it is clear that they are just variations on the technique.

Well, we can make the same reasoning for Exp:

  • An identifier is an expression.
  • Every other kind of expression — lambdas, lets, and function calls — is created by combining together a finite number of smaller expressions.
  • Suppose we can prove that a property is true for the base case: identifiers. And suppose that we assume that a property is true for a given expression and all of the finite number of smaller expressions that it is composed of. And suppose we can prove from the assumption that then the property holds on every larger expression formed from this expression. Then the property holds on all expressions.

That’s just a sketch to get the idea across; actually making a formally correct description of this kind of induction, and showing that it is logically justified, would take us very far afield indeed. But suffice to say, we could do so, and are justified in using inductive reasoning on expressions.

Before we go on I want to consider a few more interesting facts about induction. Pause for a moment and think about why induction works. Because I was sloppy before; I left some important facts off the list of stuff you need to make induction work.

Suppose for example I make up a new number. Call it Bob. Bob is not the successor of any number, and it is not zero either. Bob has a successor — Bob+1, obviously. Now suppose we have a proof by induction: we prove that something is true for 0, we prove that if it is true for k then it is true for k+1. Have we proved it for all numbers? Nope! We never proved it for Bob, which is not zero and not the successor of any number!

Induction requires that every number be “reachable” by some finite number of applications of the successor relationship to the smallest number, zero. One of the axioms we need to add to our set is that zero is the only number that is the successor of no number.

The property that expressions in Exp share with integers is: every expression in Exp is either an identifier, or is formed from one or more strictly smaller expressions. There is no “Bob” expression in Exp; there’s no expression that we can’t get to by starting from identifiers and combining them together via lets, lambdas and function calls. And therefore if we can start from identifiers and work our way out, breadth-first, to all the successor expressions, then we will eventually get all expressions. Just as if we start from zero and keep adding one, we eventually get to all numbers.

Induction is really a very powerful technique; it applies in any case where there are irreducible base cases, and where application of successorship rules eventually gets to all possible things.

Back to the paper.

Basically the way the proof would proceed is: we’d start with the base case: does our deductive system guarantee that if A ⊢ x:σ for an identifier x then A ⊨ x:σ? That’s our base case, and it’s pretty obviously true. Suppose we have the assumption A = {x:σ}. By assumption, the value of x at runtime must be from type scheme σ. By the TAUT rule, we can deduce that A ⊢ x:σ. And plainly A ⊨ x:σ is true; by assumption x will have a value from σ! Our base case holds.

Next we would assume that A ⊢ k:σ implies A ⊨ k:σ, and then ask the question “can we show that for every possible successor k’ of k, A ⊢ k':σ' implies A ⊨ k':σ'“? If we can show that, then by the inductive property of expressions, it must be true for all expressions that A ⊢ e:σ implies A ⊨ e:σ.

The actual proof would be a bit trickier than that because of course some of the successors require two expressions, and so we’d have to reason about that carefully.

These proofs are extraordinarily tedious, and you learn very little from them. Thus the proof is omitted from the paper.

There are two more sketch proofs in this section. We’ll cover them next time, and then that will finish off the section on the deductive system.