What is up with transparent identifiers? Part one

A query expression in C# is, as you probably know, just a syntactic sugar for a bunch of method calls. For example,

from customer in customers
where customer.City == "London"
select customer.LastName

is a syntactic sugar for

customers.
  Where(customer => customer.City == "London").
  Select(customer => customer.LastName)

A great many queries are straightforward variations on this pattern: the query is translated into a series of method calls where the arguments are lambdas formed from the range variables and expressions in the query. However, some of them are weird. This one is straightforward:

from parent in parents
from child in parent.Children
select child.LastName

The meaning here is “take the sequence of parents, map each parent onto a sequence of children, concatenate all those sequences, and then project from the concatenation the sequence of last names”. That’s equivalent to this method call:

parents.SelectMany(
  parent => parent.Children, 
  (parent, child) => child.LastName)

The first lambda maps each element of the original sequence onto a new sequence. The second lambda takes each element of the new sequence and the element from which it was created, and does the projection. Note that parent is in scope in the final select clause; we’re not using that fact here, but it is important.

Now let’s make a small variation, and we’ll see that things get very not-straightforward very quickly:

from parent in parents
from child in parent.Children
where child.Height > parent.Height
select child.LastName

Query translation is a multi-step process where each step along the way turns some of the query elements into method calls and lambdas. Since progress is always made towards the goal of eliminating the query altogether, and there is never any ambiguity about what rule to apply next, we know that we can just keep on applying rules until we’re done. The C# specification says that the first step of the translation from this query into method calls is to translate it into this query:

from * in parents.
  SelectMany(
    parent => parent.Children, 
    (parent, child) => new { parent, child })
where child.Height > parent.Height
select child.LastName

And the second step is to translate that into:

from * in parents.
  SelectMany(
    parent => parent.Children, 
    (parent, child) => new { parent, child }).
  Where(* => child.Height > parent.Height)
select child.LastName

And the third step is to translate that into:

parents.
  SelectMany(
    parent => parent.Children, 
    (parent, child) => new { parent, child }).
  Where(* => child.Height > parent.Height).
  Select(* => child.LastName)

Um. That’s not C#.

What on earth are those stars doing in there?

For the most part the query syntax can be transformed syntactically into the method call syntax, but in this query that breaks down. The * is a transparent identifier” and it means:

I am a lambda formal parameter of anonymous type
Members of my type are in scope in the body of the lambda
If any of my members are themselves transparent identifiers, their members are in scope too, and so on, to any level of nesting

(Recall once again as we have discussed many times on this blog, the scope of a particular programming language element is defined as the region of program text in which it may be referred to by its unqualified name. Scope is entirely about figuring out the meaning of names in C#.)

In other words, the star means that there is one more stage of translation to go through, and it’s a semantic, not a syntactic transformation because we must understand the meanings of the identifiers in the lambda bodies. The final transformation is:

parents.
  SelectMany(
    parent => parent.Children, 
    (parent, child) => new { parent, child }). 
  // We've selected a sequence where elements have anonymous type,
  // so "dummy" is of anonymous type in both lambdas:
  Where(dummy => dummy.child.Height > dummy.parent.Height).
  Select(dummy => dummy.child.LastName)

Think about what is going on here. The select-many query is essentially concatenating together many sequences of pairs of parents and children, and then running the remainder of the query on those pairs. Clearly this works but it seems absurdly complicated. Were the language designers crazy when this was added to the specification? Why is there this absolutely bizarre new scoping rule being introduced?

They were not crazy; there is a good reason for this. Let’s deduce what it is by first attempting to do things the naive way, and we’ll see what goes wrong.

We will start over and design a new query translation rule set and LINQ-to-objects implementation from scratch. We need to translate the query shown above somehow such that the following is true of the translated form:

parent is in scope starting at the second from and continuing until the end of the query.
child in in scope starting at the where and continuing until the end of the query.
the semantics of a nested from query are: take a sequence, map each element to a new sequence, and concatenate all the resulting sequences.

This seems easily done with a very straightforward rule. Let’s suppose we have these very simple methods: (Error handling is removed for clarity.)

public static IEnumerable<R> Select<A, R>(
  this IEnumerable<A> items, 
  Func<A, R> projection)
{
  foreach(A item in items)
    yield return projection(item);
}

public static IEnumerable<R> SelectMany<A, R>(
  this IEnumerable<A> items, 
  Func<A, IEnumerable<R>> map)
{
  foreach(A outerItem in items)
    foreach(R innerItem in map(outerItem))
      yield return innerItem;
}

public static IEnumerable<A> Where<A>(
  this IEnumerable<A> items, 
  Func<A, bool> predicate)
{
  foreach(A item in items)
    if (predicate(item)) 
      yield return item;
}

Very straightforward. Let’s now state the following translation rules:

The Where Rule:

 
from x in y 
where z 
...

is translated to

from x in (y).Where(x => z)
...

The Select rule:

from x in y 
select z

is translated to

(y).Select(x => z)

(We’ll ignore optimizing away degenerate queries.)

The Select Many rule:

from x in y 
from q in r 
...

is translated to

(y).SelectMany(x => from q in r ...)

These seem like reasonable rules. Let’s try them on our example:

from parent in parents
from child in parent.Children
where child.Height > parent.Height
select child.LastName

We first apply the SelectMany rule to obtain:

(parents).
  SelectMany( 
    parent => 
      from child in parent.Children
      where child.Height > parent.Height
      select child.LastName )

Seems reasonable. Now we apply the Where rule to that:

(parents).
  SelectMany( 
    parent => 
      from child in (parent.Children).
        Where(child => child.Height > parent.Height)
      select child.LastName)

Then we apply the Select rule to that:

(parents).
  SelectMany( 
    parent => 
      ((parent.Children).
        Where(child => child.Height > parent.Height)).
        Select(child => child.LastName));

This looks great! If you actually compiled the methods, you’d end up with an expression that takes a sequence of parents and returns a sequence of last names, which is just what we want. Why on earth then does the C# spec not do this very simple thing, instead of the crazy complicated nonsense where SelectMany takes multiple lambdas and introduces a “transparent identifier”? I normally push back on “why not?” questions, but this one seems justified; there must be something wrong with the naive approach that prevented the C# design team from going for it, causing them to introduce a far more complicated mechanism. What could that reason be?

Tomorrow on FAIC: The thrilling conclusion!

22 thoughts on “What is up with transparent identifiers? Part one”

treed on July 31, 2014 at 10:32 am said:

The transformation steps are more complicated, but the final expression produced has shallower branches when broken up into an expression tree. This makes the tree easier to convert into a different format, such as SQL.

Reply ↓
- ericlippert on July 31, 2014 at 11:03 am said:
  
  It is the shallowness that is desirable, but the SQL transformation wasn’t the fundamental problem.
  
  Reply ↓
  - treed on July 31, 2014 at 5:25 pm said:
    
    Is it because of the lifted variables with the second set of rules?
    
    Though with the first set, I can’t see a way around creating multiple anonymous types or lifting variables if you nest more levels (such as “from grandchild”).
    
    Unless using the transparent identifier allows you to only create 1 anonymous type with 3+ from clauses.
    
    Reply ↓
- Michael Grier on July 31, 2014 at 11:41 am said:
  
  That presumes that the name resolution in the translation target is the same as would be applied for C# which may or may not be the case. In general it’s better to translate from an exact/precisely defined interpretation of the query than from an unbound/imprecise one. I worked on a SQL query designer that had to map between various dialects and you absolutely had to resolve any ambiguity in the context of the source before attempting to emit a query in the target. The same is true of any compilation/translation process otherwise you get things like first generation human language translation engines which could often be close or otherwise entirely comical.
  
  Reply ↓
Dave Sexton on July 31, 2014 at 11:40 am said:

Correction:
foreach(R innerItem in map(items))
should be
foreach(R innerItem in map(outerItem))

Reply ↓
- ericlippert on July 31, 2014 at 12:44 pm said:
  
  Whoops. Fixed, thanks.
  
  Reply ↓
Dave Sexton on July 31, 2014 at 11:58 am said:

Is it performance and/or side effects?
The former query is evaluated at runtime in its entirety only once, lazily, and then its execution simply boils down to functional composition.
The latter query’s SelectMany is evaluated at runtime only once, lazily, but then its inner query is evaluated for each item that the outer query generates.

Reply ↓
- Joshua on July 31, 2014 at 5:07 pm said:
  
  I’m also thinking along the lines of lazy evaluation performance improvements due to the shallowness of the former query. I wonder if there is a Big-O improvement in either CPU or memory, but I don’t exactly see it at the moment.
  
  Reply ↓
EduardoS on July 31, 2014 at 3:37 pm said:

The nested lambda in the later query, when going a few more steps in C3 compilation process nested lambdas becomes much more complicated than the former query.

Reply ↓
- ericlippert on August 1, 2014 at 10:23 am said:
  
  Of all the commenters so far, you’re closest to identifying the biggest problem.
  
  Reply ↓
Pingback: The Morning Brew - Chris Alcock » The Morning Brew #1664
Pingback: Dew Drop – August 1, 2014 (#1827) | Morning Dew
Eldc on August 1, 2014 at 7:32 am said:

The first way allocates anonymous types. The second way allocates closures.
My guess is: the former is more efficient.

Reply ↓
- ericlippert on August 1, 2014 at 10:23 am said:
  
  Good guess, but it turns out that the way we generate closures is very similar to the way we use anonymous types, so that ends up being a wash.
  
  Reply ↓
Pingback: Transparent Identifiers - The Daily Six Pack: August 4, 2014
Stephen on August 21, 2014 at 9:36 am said:

Weren’t you on the C# design time when this was done? Do you know why that you ask at the end of your post?

Reply ↓
- ericlippert on August 25, 2014 at 8:19 am said:
  
  This is a “rhetorical question”. I know the answer; the answer is in the next post.
  
  Reply ↓
Pingback: What is up with transparent identifiers? Part two | Fabulous adventures in coding
Pingback: Linq Queries Under the Hood | Programming Blog
Pingback: Dummy | My Blog
Pingback: LINQ to Entities: Equivalence of traditional and lambda syntax of a Join with Where clause | 我爱源码网
Pingback: Монады как паттерн переиспользования кода / Хабр