Here’s a question I got from a user recently:
The foreach
loop in C# uses a pattern-based approach to semantic analysis. LINQ in C# also uses a pattern-based approach. Why don’t other features, such as the using
statement, also use a pattern-based approach?
Great question. First off, what do we mean by a “pattern-based approach”?
You might think that in order to use a foreach
loop, the collection you are iterating over must implement IEnumerable
or IEnumerable<T>
. But as it turns out, that is not actually a requirement. What is required is that the type of the collection must have a public method called GetEnumerator
, and that must return some type that has a public property getter called Current
and a public method MoveNext
that returns a bool
. If the compiler can determine that all of those requirements are met then the code is generated to use those methods. Only if those requirements are not met do we check to see if the object implements IEnumerable
or IEnumerable<T>
.
Essentially, C# is doing a fairly strong form of “duck typing” here. That is, if the object is known at compile time to have a Quack()
method then we treat it like a duck even if it isn’t marked as implementing IDuck
. There are a few places in the C# language where we do this sort of “pattern matching”; we don’t care what the exact type is, just so long as the methods we need are available.
But why do we use a pattern-based approach in foreach
in the first place?
Two reasons. First, imagine a world without generic types: the world of C# 1.0. Suppose you have a collection that you yourself have just implemented, a collection of ten square integers:
class MyIntegers : IEnumerable { private class MyEnumerator : IEnumerator { private int index = 0; public object Current { return index * index; } public bool MoveNext() { if (index > 10) return false; ++index; return true; } } public IEnumerator GetEnumerator() { return new MyEnumerator(); } }
When you “foreach” over this thing, every integer that comes out of the collection is boxed. How can you avoid the boxing penalty? Like this:
class MyIntegers : IEnumerable { public class MyEnumerator : IEnumerator { private int index = 0; object IEnumerator.Current { return this.Current; } int Current { return index * index; } public bool MoveNext() { if (index > 10) return false; ++index; return true; } } public MyEnumerator GetEnumerator() { return new MyEnumerator(); } IEnumerator IEnumerable.GetEnumerator() { return this.GetEnumerator(); } }
Now the foreach
loop sees can use the public method GetEnumerator
that returns a MyEnumerator
, that in turn has a public property Current
that returns an int
. No boxing required.
The second reason is because using this approach, the “MyEnumerator
” could also be a mutable struct. Yes, mutable structs are a bad programming practice, but in this case it’s not so bad because the foreach
loop is never going to expose the raw enumerator to possible misuse by user code. Sometimes you really do want to avoid collection pressure, so making the enumerator a struct can be a small but measurable win. Of course, it is a very bad idea to make classes into structs “because structs are faster” unless you have empirical evidence that doing so makes a measurable and important difference. In this case the design team who built iterators for lists did experiments and determined that the risks of using mutable structs were worth the measurable performance gains. (UPDATE: See my follow-up article on some of the pitfalls of this design choice.)
So there you go: we use a pattern-based approach for foreach
because first, we didn’t have the ability to avoid boxing via IEnumerable<T>
in C# 1.0, and because there is a small but perhaps important performance win in avoiding the pressure of allocating a reference-typed enumerator.
What about LINQ? We also use a pattern-based approach in LINQ. When you say “from c in customers where ...
” we do not require that “customers
” have a “Where
” method, or even that it implement IEnumerable
. We don’t have any type requirements on it at all; rather, we require that overload resolution finds an instance method or an extension method called Where
that works with the LINQ pattern.
Again, two reasons. The first reason is that LINQ was invented very late; it would not be a success if we required everyone who wanted to have their data source work with LINQ have to ship a new version of their interface that has Where
and Select
methods. It has to be a feature that “lights up” without any work on behalf of the data source providers.
The second and more highfalutin reason is that LINQ was designed to bake the monad pattern into C#. It was specifically designed to bake in the “sequence” monad, but if you are a crazy person you can actually use query comprehension syntax to represent binding operations on any monad, as Wes points out in his excellent article.
People sometimes ask why there is no IMonad
interface in C# that identifies a type that can be used as a monad. The short answer is that the concept of “I am a type that obeys the rules of a monad” is actually too general for the .NET type system to handle; the monad pattern is a “higher type” than an interface. Therefore, rather than bake it into the type system, we baked it into the compiler. (*)
So now we come to the user’s actual question: given that we use this pattern-based approach in several places in C#, why not use it everywhere? In particular, why require that an object implement IDisposable
to be disposed in a using
block? Why not just require that it implement a public method called Dispose
?
The answer is that we like capturing things in type systems; we do so whenever we can. Types are our first and best solution for solving problems in a statically-analyzed language, and so we want to stick to that whenever possible. In the case of the using
block there is no compelling reason to not require that a disposable object mark itself with a special interface. We’ve already seen that we can elide boxing in things that get disposed, and even if we cannot, the boxing penalty is pretty small for disposing a dozen handle-holding objects compared to the penalty of boxing a million integers coming out of a collection. And the LINQ reasoning doesn’t apply to using
— we do not want you to be able to extend the disposal behaviour of objects you do not own, even as we do want you to be able to extend the sorting, searching and grouping behaviours of objects you do not own. With no compelling reason for a pattern-based approach, and no compelling reason against an interface-based approach, we went with the interface.
However, as the C# language gets more and more high-level features added to it, we are finding that often a pattern-based approach is necessary. The coming “async/await
” feature, for example, will use a LINQ-like pattern-based approach to figuring out how to make an “awaiter” for a particular “task” object. Pattern-based approaches can allow more participation by third parties and enable language designers to embed design patterns in the language that are too rich for their type systems to handle, so I would expect to see more of this sort of thing in the future.
(*) A detailed explanation of why the .NET type system is insufficient to represent higher types would take us pretty far afield. If this subject interests you, see this StackOverflow question. For more ideas on how the generic type system of .NET is insufficient to represent certain desired patterns, see my recent article on the subject. If you want an even more gentle introduction to monads than Wes’s article, see this StackOverflow question.
I know this is an old article, but I found it from your more recent article, and I have a question: Could you elaborate on what you mean by “The second reason is because using this approach, the ‘MyEnumerator’ could also be a mutable struct.” What’s that mean here? Thanks.
Quote:
“So there you go: we use a pattern-based approach for foreach because first, we didn’t have the ability to avoid boxing via IEnumerable in C# 1.0, and because there is a small but perhaps important performance win in avoiding the pressure of allocating a reference-typed enumerator.”
First and second reasons are, essentially the same: both allow to avoid boxing by following a pattern, the difference is that the first does the job for IEnumerator.Current, and the second one for IEnumerable.GetEnumerator
Pingback: Enumerator performance surprises – Blanch on Code
“Only if those requirements are not met do we check to see if the object implements IEnumerable or IEnumerable.” This is very strange, don’t you think so?