String concatenation behind the scenes, part one

The first algorithm I ever worked on in the C# compiler was the optimizer that handles string concatenations.(Unfortunately I did not manage to port these optimizations to the Roslyn codebase before I left; hopefully someone will get to that!) It’s all pretty straightforward, but there are a few clever bits that I thought I might discuss today.

First off, let’s consider just straight up concatenation of string variables; in fact, let’s assume that all variables are strings for the purposes of this article. If you have:

r = a + b;

then clearly that is going to be generated as

r = String.Concat(a, b);

But what about

r = a + b + c;

? String addition, like all addition in C#, is left associative, so this is the same as

r = (a + b) + c;

So you might think that would be generated as

r = String.Concat(String.Concat(a, b), c);

but the compiler can be more efficient than that because there is an overload of String.Concat which takes three operands and produces the same result:

r = String.Concat(a, b, c);

Ditto for four operands. However when we get to five or more operands the compiler switches to a different overload:

r = a + b + c + d + e;

is generated as

r = String.Concat(new String[5] { a, b, c, d, e });

It’s a bit unfortunate that you have to take on the cost of allocating an array, but there it is.

That’s all very straightforward, but there are more optimizations that the C# compiler performs. The main optimization depends upon the fact that string concatenation is fully associative. (And that String.Concat has no observable side effects.) That is, the parentheses don’t matter. You might think that an oddity like

r = a + (b + c);

would have to be generated as

r = String.Concat(a, String.Concat(b, c));

but of course we can observe that since string concatenation is associative, this is no different than

r = (a + b) + c;

which we already know how to generate efficiently. The C# compiler’s string optimizer automatically rewrites nested string concatenations into the left-parenthesized form no matter how you parenthesize them. Except when it doesn’t, which is our next optimization. Let’s now consider the impact of constants:

r = "A" + "B" + c;

The compiler will parse this as:

r = ("A" + "B") + c;

and of course that first term is a compile-time constant according to the spec. The compiler automatically computes values of compile-time constants, so this is the same as:

r = String.Concat("AB", c);

But what about this?

r = a + "B" + "C";

There are two compile-time constants, but this is parsed as

r = (a + "B") + "C";

Since both addition operators have a non-constant element on their left side, you might think this would have to be generated as

r = String.Concat(a, "B", "C");

But again associativity comes to the rescue. The optimizer notices that the original expression is the same as

r = a + ("B" + "C");

and generates

r = String.Concat(a, "BC");

And it will even do the right consolidation of constants “in the middle”:

r = a + "B" + "C" + d;

is generated as though you’d said:

r = (a + ("B" + "C")) + d;

to produce

r = String.Concat(a, "BC", d);

The C# compiler also knows that empty string constants are the identity elements of concatenation; they are eliminated. If you say

r = a + "" + b;

Then the code generated is not

r = String.Concat(a, "", b);

but rather simply

r = String.Concat(a, b);

Null string constants are likewise identity elements:

const string NullString = null;
r = a + NullString + b;

is similarly generated as

r = String.Concat(a, b);

OK I must admit I just made a bald-faced lie there; did you catch it?

The null and empty strings are not identities of concatenation. An identity of an operator has the property that it leaves the other operand unchanged; zero is the identity of addition of numbers because any number plus zero gives you the original number. But there is one possible operand that is not preserved by addition with an empty or null string: the null string! NullString + NullString and NullString + "" both result in the empty string, not the null string, so neither is an identity.

You might think then that

r = a + NullString;

must be generated as

r = String.Concat(a, NullString);

because the compiler needs to ensure that if a is null then the result is the empty string. In fact the C# compiler anticipates what String.Concat will do and simply inlines it. This is generated the same as:

r = a ?? "";

which I think is a neat trick, to elide the concatenation entirely.


Next time on FAIC: An optimization the C# compiler does not perform.

About these ads

42 thoughts on “String concatenation behind the scenes, part one

  1. Does that mean that, if x is *not* a string, (“” + x) is literally equivalent to x.ToString() and does not take on the cost of a concatenation at all?

    • Not quite. It is equivalent to x == null ? "" : x.ToString(). I didn’t mention that corner case but the optimization there is that it is reduced to String.Concat(x), using this overload, which is a convenient method for this case.

      • What about value types? If x is an integer, wouldn’t the use of that overload create unnecessary garbage due to boxing of value types?

        • That is correct. For integers it is better to call ToString directly on them. The C# compiler could optimize “”+someInt into someint.ToString() but does not. I missed that particular trick.

  2. For larger concatenations, why don’t you use the

    public static String Concat(Object arg0, Object arg1, Object arg2, Object arg3, __arglist)

    overload? (to save the array allocation)

    (At least if the project isn’t CLS-compliant)

  3. Pingback: The Morning Brew - Chris Alcock » The Morning Brew #1379

  4. Does this mean that it is OK to have things like “r = a + b;” in your code? I’ve once read that you should replace it with “r = String.Concat(a, b);” because that’s (marginally) more performant. But if the compiler is clever enough to optimize it, I might as well keep using the + operator, right?

    • Readability is more important than pre-optimizations.
      “String.Concat” clutters your codebase for no valuable gain.
      Let the compiler optimize your code. Refactor to less readable code only if you need to.

    • The + operator has to be implemented somehow, and apparently it is implemented in terms of string.Concat. So I’d say replacing it with String.Concat is not useful. However, in case you are building a larger string with many concatenations, you should probably use a StringBuilder.

    • The only reason (in both Java and C#) to use a StringBuilder directly is if you’re using loops, because last time I checked at least the Java JIT wasn’t clever enough to generate optimal code.

      That would probably be a rather useful optimization though (and would allow us programmers to make the code clearer), but I’m sure it’s not as simple as I’m thinking right now, certainly some evil edge case in there.

  5. This optimization is detectable by checking String.IsInterned. The documentation for interning says that the intern pool consists of “a single instance of each unique literal string constant declared in a program, as well as any unique instance of String you add programmatically.” In the case of “A” + “B”, the string that is interned is “AB”, which appears to contradict the documentation as stated, since the literal string constant “AB” does not appear in the program. It was introduced by the optimizer. I assume this is a documentation error, but it might surprise people who are using IsInterned for business logic.

    Here’s another surprise with IsInterned:

    string s = “AB”; // removing this line alters behavior
    StringBuilder sb = new StringBuilder();
    sb.Append(“A”);
    sb.Append(“B”);
    string guessme = string.IsInterned(sb.ToString());

    I also have a pathological example where a string literal never enters the intern table (nor is any compiler-optimized concatenation of that string with any other string), which this margin is too narrow to contain.

  6. Back in 2008 I had to write an application that did a lot of string manipulations and I ran into a lot of performance issues. When I used the Visual Studio Profiler, it told me that the chief offender was the string concatenation and the advice I had googled up was to use StringBuilder instead of string1 + string2. I did just that and the performance improved by 50%. Since then I almost always use StringBuilder when I need to concatenate more than 2 strings. I wonder, is there anything special about the way StringBuilder does the concatenation? The article that recommended it to me said that it does not create temporary copies (whereas the ‘normal’ concatenation does). So this is my question: is there a difference between the ‘normal’ concatenation and the way StringBuilder.Append() works? Thanks in advance.

    • Strings are immutable, so normal string concatenation creates a *new* string. At least naively, this means string concatenation has to copy the contents of both source strings to a completely new string object (string could be using some fancy data structure internally to avoid this, but my guess is that it does not). At the least it means that a new string object is allocated.
      StringBuilder is mutable, it contains a buffer that can be changed / appended to directly. The Append() method does exactly that, it changes the StringBuilder by appending something to its internal buffer. The original contents of the StringBuilder don’t need to be copied, and there is no extra object creation.
      So to summarize, a = b + c involves three String objects. The strings referred to by b and c remain unchanged, and a *new* String is assigned to a.
      sb.Append(c) only involves one StringBuilder and one String, and the StringBuilder itself is changed.
      Usually the simplicity of working with plain strings outweighs the performance benefit of StringBuilder, but if you concatenate long strings together or do many concatenations (in particular in loops) the advantage of StringBuilder can become significant.

    • If code says `someString = someString+”1″;` [both variables of type `string`], no other variable anywhere in the entire universe holds a reference to the string object referred to by variable `someString`, and the amount of space allocated for that object is large enough to accommodate another character, then having the code add another character to the the string object referred to by `someString` would be equivalent to generating a new string object containing the characters in `someString` plus the digit ‘1’. If, however, there exists somewhere in the universe some other reference to that same string object, modifying it cause code which expected to hold a reference to the unmodified sequence of characters to instead hold a reference to the modified sequence. Neither the compiler nor runtime have facilities for keeping track of whether a variable holds the only extant reference to an object, and so both are generally required to assume that such references might exist.

      When code says `myStringBuilder.Append(“1″);` the `StringBuilder` knows that it holds the only reference to its buffer anywhere in the universe. Since nobody who holds a reference to the `StringBuilder` will be expecting to hold a reference to its unmodified value, it’s free to modify its buffer as it sees fit without having to create a new object.

      Although there is some “magic” within the StringBuilder class which allows better performance than could be achieved without it, such magic is a relatively small part of why code that uses it can perform much better than code which builds strings by repeated concatenation.

      • CPython uses reference counting, so it can know that there is only a single reference to a given string. Thus, it can make the optimization of not having to copy a string if there is enough room after it to do the concatenation. Presumably other runtimes with immutable strings that use reference counting perform a similar optimization.

  7. What surprises me a bit in this story, is the choice of what to optimize. For example,

    r = a + null;

    is optimized, though it’s hard to imagine anyone actually writing such code.

    On the other hand,

    r = a + ‘.';

    is not, although I have seen that code appear in real world programs a lot.

    If you look at other string manipulation methods, things like

    a.IndexOf(“c”) -> a.IndexOf(‘c’)
    a.ToString() -> a
    ‘c’.ToString() -> “c”

    could be interesting as well, even if I they do assume compiler knowledge of the framework (but then again, so do the string concatenation optimizations).

    I often have to review code written by others, and have written custom Code Analysis rules for these (and many other more interesting issues). Obviously, I am now experimenting with and counting on Roslyn CodeIssueProviders for this sort of thing.

    In fact, it could be argued that things like

    r = a + null;

    might be better dealt with in a CodeIssueProvider, than in the compiler itself.

    • On the other hand, people might write

      #if DEBUG
      const suffix = ” (debug)”;
      #else
      const suffix = null;
      #endif


      s = s + suffix;

      And in that case, we obviously do want the optimization.

  8. I stopped at the Coverity booth at the Detroit Telematics Update in Novi, MI and mentioned your name, and that I’m a fan of your blog! They laughed.

  9. So what about side-effects? Does it matter whether the string expressions to be concatenated are local variables or function calls?

  10. Wow. That awakward day when you learn that strings in C# don’t make a monoid… or do they? Is “null” even a string? If not, then “pure” strings do form the free monoid (over the set of Unicode characters, sans issues with non-BMP stuff). However, “NullString + NullString != NullString” still breaks down Nullable<T> monad. I can understand why it was made this way, however — I still remember that one program where I concatenated like 30 strings together in SQL, and was extremely frustrated that I when I got a sudden NULL as a result, I had to look thorugh all 30 operands to find that one unlucky NULL-string. Wrapping every string in COALESCE(s, “”) solved the problem.

  11. Pingback: String Concatenation using '+' operator | BlogoSfera

  12. Pingback: Windows 8 Tips Featured In The Daily Six Pack: June 26, 2013

  13. The C# compiler doesn’t know that String.Empty is actually “” and the identity of concatenation, so it doesn’t simplify “foo” + String.Empty. Another point for “” over ugly String.Empty.

  14. Pingback: Why isn't the empty string an identity of concatenation? | BlogoSfera

  15. Pingback: String Concatenation using ‘+’ operator | Ask Programming & Technology

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s