String concatenation behind the scenes, part two

It is, I hope, well-known that naïve string concatenation in a loop is a quadratic “Schlemiel” algorithm. (If you haven’t read Joel’s essay, stop reading this and go read it now.)

Even though strings in .NET are length-prefixed as Joel discusses in his essay, it’s still the case that naïve string concatenation is slow in C#. If you have:

string s = "";
foreach(var item in list)
    s = s + M(item);

then we have a Schlemiel algorithm. If each item produces, say, 10 chars, then the first time through the loop we copy 10 chars. The second time we copy the original 10 and the new 10, for 20. The third time we copy the last 20 and the new 10, for 30. The fourth time, we copy 40. The fifth time, 50. Add those all up: 10 + 20 + 30 + 40 + 50 … and in order to get a 1000 char string, you end up copying over 50000 chars. That’s a lot of copying and a lot of pressure on the garbage collector. The right way to do this is of course:

StringBuilder sb = new StringBuilder();
foreach(var item in list)
    sb.Append(M(item));
string s = sb.ToString();

(Alternatively, string.Concat(list.Select(item=>M(item)) would also work, but that uses similar techniques behind the scenes anyways.)

The StringBuilder object is carefully designed to have linear, not quadratic, performance in this scenario. (How it achieves this has changed over the years; in the past it has used a double-when-full strategy. More recently I believe it uses a linked list of large blocks. This is a nice illustration of how a clean API design allows you to radically change implementation details without breaking users.) The question then is: why does the C# compiler not simply transform the first code into the second code automatically for you, if that’s so much more efficient?

That’s a good question. When I was on the JScript.NET team back in the day we added this optimization to that language. But in that case we had the expectation that JScript.NET programmers would not be familiar with the .NET framework, coming from a JavaScript background. We have the expectation that C# programmers will be more familiar with the tools in the .NET framework, which is therefore points against the optimization. Also, it’s not at all clear when the optimization is actually an optimization. Both string concatenation and string builders are very fast and there definitely are scenarios where one is faster than the other. Making that choice on behalf of the user seems wrong; C# developers make the reasonable assumption that the code they write is the code that runs. On yet a third hand, some Java compilers apparently perform a similar optimization by default, so maybe there are good reasons for this to happen in C# as well. In any event, the optimization is not there today, so use StringBuilder if you have a lot of concatenation to do.

17 thoughts on “String concatenation behind the scenes, part two

  1. Not to go too far off topic, but this construct:

    string.Join(“”, list.Select(item=>M(item))

    has always irritated me. With a trivial extension method, this becomes:

    list.Select(item=>M(item)).Join()

    The result is a more linear expression with less punctuation and with the operations in one logical order. A number of standard functions come out much prettier using extensions (IMO, of course), including the `string.IsNullOrEmpty` family, and I’ve long wondered why they never got the extension treatment when extension methods became available. Was it because it might be confusing to have an additional officially-endorsed way to call them?

    I guess the most likely possibility is that it simply wasn’t deemed worth the time and effort, because anyone who cared could simply write their own extension.

    • Yeah, or string.Format(“{0}”, x) instead of “{0}”.Format(x) — null is not even acceptable as a composite format string parameter for Format, so this change wouldn’t break any contract. One of the warts of the original NET 1.0 library.

      • I think String.Format is a static method because it’s more of a factory method that *produces* a string than it is a method which acts upon a format string. That having been said, there are some static methods which really gain no benefit from being in classes, such as Abs and Format, and it would be helpful if classes could expose them as “free” functions [Math.Abs(expr) is ugly, and arguably (expr).Abs is better, but Abs(expr) would be IMHO better yet].

        • By the same logic, “String.Split is a static method because it’s more of a factory method that *produces* a string than it is a method which acts upon the splitted string”. But Split isn’t static. And Format *does* act upon the format string: it takes it and derives another string from it.

          And since strings are immutable, the whole “acts upon” argument doesn’t hold in the first place.

          And nope, no “free” functions for you. A function outside of a class is a too scary thing to exist.

          • To my mind, the purpose of `String.Format(“({0}, {1})”, X, Y);` is to produce a string containing X and Y, formatted as indicated by “({0}, {1})”. While the string literal is important, it is in many cases the least “interesting” part of the expression (I would consider “Format” the most interesting part). As for “free” functions, they shouldn’t be declared willy-nilly, but I don’t really see why `Math.Sin(x)` or `x.Sine()` should be considered superior to `Sine(x)`. After all, the purpose of the expression wouldn’t normally be described as “Do some math, namely a sine computation, on x”, nor as “Take some number x and compute its sine”, but rather as “Take the sine of X”.

    • Fortunately nobody prevents you from adding your extension methods to achieve what you all wrote so far. Did you ever found a Microsoft product or library that was ready for consumption without any extra work or buying any add-ons from 3rd party vendors? 😉

  2. “More recently I believe it uses a linked list of large blocks”

    Looking at it in dotPeek, it seems to be a linked list of, not large blocks, but actually of StringBuilders. Pretty cool!

    Rather than creating a whole new type to represent the underlying linked list, they just take advantage of the existing implementation. If it runs out of room, it just makes another of itself.

    I didn’t look closely, but I’ll bet this simplifies the implementation somewhat, in that there’s very little additional code to manage the list of buffers, and of course adding new characters winds up mostly just the same code StringBuilder always had anyway.

    Ironically, this is a similar strategy that many people take when rolling their own collection types to get past .NET’s own 2GB limit. Just link a bunch of the original collections together (lists can just be a list of lists, dictionaries get a little more complicated, but not much, etc.)

  3. Just out of curiosity, how much does it factor into the equation that you are giving the compiler knowledge of a particular type? Obviously there can be big benefits from doing this (expression trees), but it seems like you would need a big benefit in order to justify the cost.

    • String is already quite a special type in C# (it can be const; it has two kinds of literals; the + operator compiles as string.Concat, including the optimization from previous part; probably other things), so I think making it even more special wouldn’t be a big deal.

  4. It bothers me a bit to see the usage: string.Join(“”, […]

    You’re not actually joining anything, so why use the Join operator? What you’re doing is Concatting the strings together. There’s already an operator that does *exactly* that, so using it makes much more semantic sense:

    string.Concat(list.Select(item=>M(item))

  5. Eric, you left out an important point in this post which I have seen you make previously (I believe on StackOverflow). Specifically, that the C# compiler *does* do this kind of magic “do what I mean, not what I say” substitution with a switch statement on strings by changing it into a Dictionary lookup rather than a series of equals comparisons.

    • Indeed, I have mentioned that before, thanks for the reminder. The specification does not call out how switch statements or string concatenations are implemented, so an implementation has wide latitude to optimize them as it sees fit.

  6. Pingback: String concatenation behind the scenes, part one | Fabulous adventures in coding

Leave a comment