We decided early in the Roslyn design process that the primary data structure that developers would use when analyzing code via Roslyn is the syntax tree. And thus one of the hardest parts of the early Roslyn design was figuring out how we were going to implement syntax tree nodes, and what information they would proffer up to the user. We would like to have a data structure that has the following characteristics:
- The form of a tree.
- Cheap access to parent nodes from child nodes.
- Possible to map from a node in the tree to a character offset in the text.
By persistence I mean the ability to reuse most of the existing nodes in the tree when an edit is made to the text buffer. Since the nodes are immutable, there’s no barrier to reusing them, as I’ve discussed many times on this blog. We need this for performance; we cannot be re-parsing huge wodges of text every time you hit a key. We need to re-lex and re-parse only the portions of the tree that were affected by the edit because we are potentially re-doing this analysis between every keystroke.
Aside: Determining what those portions of the tree are is quite tricky; I might blog about that at a later date. An edit that, for example, adds
async to a method can cause the parse of
await(foo); in the method body to change from an invocation to a usage of the
await contextual keyword.
When you try to put all five of those things into one data structure you immediately run into problems:
- How do you build a tree node in the first place? The parent and the child both refer to each other, and are immutable, so which one gets built first?
- Supposing you manage to solve that problem: how do you make it persistent? You cannot re-use a child node in a different parent because that would involve telling the child that it has a new parent. But the child is immutable.
- Supposing you manage to solve that problem: when you insert a new character into the edit buffer, the absolute position of every node that is mapped to a position after that point changes. This makes it very difficult to make a persistent data structure, because any edit can change the spans of most of the nodes!
But on the Roslyn team we routinely do impossible things. We actually do the impossible by keeping two parse trees:
The “green” tree is immutable, persistent, has no parent references, is built “bottom-up”, and every node tracks its width but not its absolute position. When an edit happens we rebuild only the portions of the green tree that were affected by the edit, which is typically about O(log n) of the total parse nodes in the tree.
The “red” tree is an immutable façade that is built around the green tree; it is built “top-down” on demand and thrown away on every edit. It computes parent references by manufacturing them on demand as you descend through the tree from the top. It manufactures absolute positions by computing them from the widths, again, as you descend. You, the consumer of the Roslyn API, only ever see the red tree; the green tree is an implementation detail. (And if you use the debugger to peer into the internal state of a parse node you’ll in fact see that there is a reference to another parse node in there of a different type; that’s the green tree node.)
Incidentally, these are called “red/green trees” because those were the whiteboard marker colours we used to draw the data structure in the design meeting. There’s no other meaning to the colours.
The benefit of this strategy is that we get all those great things: immutability, persistence, parent references, and so on. The cost is that this system is complex and can consume a lot of memory if the “red” façades get large. We are at present doing experiments to see if we can reduce some of the costs without losing the benefits.
Next time on FAIC: What precisely is “implementation-defined behaviour”?
Commentary from 2020:
I did not come up with the idea for red-green trees but I was in the room where it happened and it was a lot of fun coming up with a new twist on an old data structure that would meet our needs while still being “functional style”. I have since used variations on this data structure a few times in different parsers.
There were a number of questions in the comments to the original post that I never answered, oddly enough. Sorry for the eight year delay:
- How many children does each tree node have?
As many as it needs; a binary operator node has two children, a unary operator has one, and so on. I wrote an XML document that lists the children of every tree node type, and a script that automatically generates the tree node classes from that XML doc. I don’t know if this scheme is still used for the Roslyn project or not!
- Is the red tree rebuilt on every edit or every traversal?
Every edit. The red tree is internally mutable but presents an immutable interface to its caller. When you access a child of a red tree node for the first time, a new red tree node is created for that child green node, and the new red child is then cached in its red parent.
- Was GC pressure a consideration?
Good heavens yes. This scheme ends up creating up to twice as many small objects as a normal parse tree would, and therefore a lot of collection pressure. Moreover, there is a real danger that you end up interleaving small short-term allocations with small long-term allocations, which produces the worst case for data locality while maximizing the work the compacting collector must do. We measured the GC cost of this solution very carefully.
- How bad a GC penalty is discarding the red tree on every edit?
Since the red tree is built on demand, the red tree is usually a very small fraction of the green tree; it consists of just the “spines” that get to whatever nodes were of interest to IntelliSense or whatever analyzers are running. If for some reason you are having to traverse every node in a red tree on an edit, then it is expensive to build that tree and then throw it away.