The first algorithm I ever worked on in the C# compiler was the optimizer that handles string concatenations.(Unfortunately I did not manage to port these optimizations to the Roslyn codebase before I left; hopefully someone will get to that!) It’s all pretty straightforward, but there are a few clever bits that I thought I might discuss today.
First off, let’s consider just straight up concatenation of string variables; in fact, let’s assume that all variables are strings for the purposes of this article. If you have:
r = a + b;
then clearly that is going to be generated as
r = String.Concat(a, b);
But what about
r = a + b + c;
? String addition, like all addition in C#, is left associative, so this is the same as
r = (a + b) + c;
So you might think that would be generated as
r = String.Concat(String.Concat(a, b), c);
but the compiler can be more efficient than that because there is an overload of String.Concat
which takes three operands and produces the same result:
r = String.Concat(a, b, c);
Ditto for four operands. However when we get to five or more operands the compiler switches to a different overload:
r = a + b + c + d + e;
is generated as
r = String.Concat(new String[5] { a, b, c, d, e });
It’s a bit unfortunate that you have to take on the cost of allocating an array, but there it is.
That’s all very straightforward, but there are more optimizations that the C# compiler performs. The main optimization depends upon the fact that string concatenation is fully associative. (And that String.Concat
has no observable side effects.) That is, the parentheses don’t matter. You might think that an oddity like
r = a + (b + c);
would have to be generated as
r = String.Concat(a, String.Concat(b, c));
but of course we can observe that since string concatenation is associative, this is no different than
r = (a + b) + c;
which we already know how to generate efficiently. The C# compiler’s string optimizer automatically rewrites nested string concatenations into the left-parenthesized form no matter how you parenthesize them. Except when it doesn’t, which is our next optimization. Let’s now consider the impact of constants:
r = "A" + "B" + c;
The compiler will parse this as:
r = ("A" + "B") + c;
and of course that first term is a compile-time constant according to the spec. The compiler automatically computes values of compile-time constants, so this is the same as:
r = String.Concat("AB", c);
But what about this?
r = a + "B" + "C";
There are two compile-time constants, but this is parsed as
r = (a + "B") + "C";
Since both addition operators have a non-constant element on their left side, you might think this would have to be generated as
r = String.Concat(a, "B", "C");
But again associativity comes to the rescue. The optimizer notices that the original expression is the same as
r = a + ("B" + "C");
and generates
r = String.Concat(a, "BC");
And it will even do the right consolidation of constants “in the middle”:
r = a + "B" + "C" + d;
is generated as though you’d said:
r = (a + ("B" + "C")) + d;
to produce
r = String.Concat(a, "BC", d);
The C# compiler also knows that empty string constants are the identity elements of concatenation; they are eliminated. If you say
r = a + "" + b;
Then the code generated is not
r = String.Concat(a, "", b);
but rather simply
r = String.Concat(a, b);
Null string constants are likewise identity elements:
const string NullString = null; r = a + NullString + b;
is similarly generated as
r = String.Concat(a, b);
OK I must admit I just made a bald-faced lie there; did you catch it?
The null and empty strings are not identities of concatenation. An identity of an operator has the property that it leaves the other operand unchanged; zero is the identity of addition of numbers because any number plus zero gives you the original number. But there is one possible operand that is not preserved by addition with an empty or null string: the null string! NullString + NullString
and NullString + ""
both result in the empty string, not the null string, so neither is an identity.
You might think then that
r = a + NullString;
must be generated as
r = String.Concat(a, NullString);
because the compiler needs to ensure that if a
is null
then the result is the empty string. In fact the C# compiler anticipates what String.Concat
will do and simply inlines it. This is generated the same as:
r = a ?? "";
which I think is a neat trick, to elide the concatenation entirely.
Next time on FAIC: An optimization the C# compiler does not perform.
Does that mean that, if x is *not* a string, (“” + x) is literally equivalent to x.ToString() and does not take on the cost of a concatenation at all?
Not quite. It is equivalent to
x == null ? "" : x.ToString()
. I didn’t mention that corner case but the optimization there is that it is reduced toString.Concat(x)
, using this overload, which is a convenient method for this case.What about value types? If x is an integer, wouldn’t the use of that overload create unnecessary garbage due to boxing of value types?
That is correct. For integers it is better to call ToString directly on them. The C# compiler could optimize “”+someInt into someint.ToString() but does not. I missed that particular trick.
> x = null ? “” : x.ToString()
I think you’re missing an “=” in there! 😛
For larger concatenations, why don’t you use the
public static String Concat(Object arg0, Object arg1, Object arg2, Object arg3, __arglist)
overload? (to save the array allocation)
(At least if the project isn’t CLS-compliant)
I pretty sure that using arglist allocates an array as well.
It does not; SLaks is correct, that would be moderately more performant but not CLS-compliant.
Not with the current implementation it wouldn’t, as that method calls Concat(object[]) in its implementation, and allocates the array to be able to do so.
Why would it not be CLS compliant? My understanding of CLS compliance is that a CLS-compliant assembly must be for all practical purposes fully usable by any language that implements the minimum requirements. Abstract or virtual methods should not expose non-CLS types, and functionality exposed by methods or overloads that include non-CLS types should be duplicated by methods or overload which do not. I don’t think there’s any prohibition against a CLS-compliant class using non-CLS types, or calling overloads of outside code that use non-CLS types. If one will be linking with a framework which is guarantee to include the __arglist overload, and if it’s more performant than the Object[] one, why not use it?
No; you’re thinking of
params
(which is an array; it’s merely syntactic sugar).I’m pretty sure somebody once wrote:
“The sole by-design purpose of __arglist is so that you can safely do interop calls to and from libraries written in C++ that use C-style var args. Do not use it for anything but that.”
(See http://stackoverflow.com/q/910585/124386 – first comment on the question.)
If you right-click on the date of the comment you get a direct link to the comment
http://stackoverflow.com/questions/910585/c-sharp-variable-length-args-which-is-better-and-why-arglist-params-array-o#comment720168_910585
Of course you can’t get a link sharing badge that way.
Oh, sure you can: http://stackoverflow.com/q/910585/304138#comment720168_910585 😉
I stand corrected
http://stackoverflow.com/help/badges/260/announcer?userid=304138
I blame the lack of of string identity on the existece of the evil, I mean, null references.
indeed, a billion dollars looks like a huge underestimate at this point
Pingback: The Morning Brew - Chris Alcock » The Morning Brew #1379
Does this mean that it is OK to have things like “r = a + b;” in your code? I’ve once read that you should replace it with “r = String.Concat(a, b);” because that’s (marginally) more performant. But if the compiler is clever enough to optimize it, I might as well keep using the + operator, right?
Readability is more important than pre-optimizations.
“String.Concat” clutters your codebase for no valuable gain.
Let the compiler optimize your code. Refactor to less readable code only if you need to.
The + operator has to be implemented somehow, and apparently it is implemented in terms of string.Concat. So I’d say replacing it with String.Concat is not useful. However, in case you are building a larger string with many concatenations, you should probably use a StringBuilder.
The only reason (in both Java and C#) to use a StringBuilder directly is if you’re using loops, because last time I checked at least the Java JIT wasn’t clever enough to generate optimal code.
That would probably be a rather useful optimization though (and would allow us programmers to make the code clearer), but I’m sure it’s not as simple as I’m thinking right now, certainly some evil edge case in there.
Thank you very much for this. My colleagues and I discuss this topic quite often at work.
This optimization is detectable by checking String.IsInterned. The documentation for interning says that the intern pool consists of “a single instance of each unique literal string constant declared in a program, as well as any unique instance of String you add programmatically.” In the case of “A” + “B”, the string that is interned is “AB”, which appears to contradict the documentation as stated, since the literal string constant “AB” does not appear in the program. It was introduced by the optimizer. I assume this is a documentation error, but it might surprise people who are using IsInterned for business logic.
Here’s another surprise with IsInterned:
string s = “AB”; // removing this line alters behavior
StringBuilder sb = new StringBuilder();
sb.Append(“A”);
sb.Append(“B”);
string guessme = string.IsInterned(sb.ToString());
I also have a pathological example where a string literal never enters the intern table (nor is any compiler-optimized concatenation of that string with any other string), which this margin is too narrow to contain.
(I should have made my guessme example more surprising by putting the line string s = “AB” *after* the call to string.IsInterned, to give the appearance of time travel.)
The word “literal” is unnecessary and misleading; it should simply say “string constant”.
Back in 2008 I had to write an application that did a lot of string manipulations and I ran into a lot of performance issues. When I used the Visual Studio Profiler, it told me that the chief offender was the string concatenation and the advice I had googled up was to use StringBuilder instead of string1 + string2. I did just that and the performance improved by 50%. Since then I almost always use StringBuilder when I need to concatenate more than 2 strings. I wonder, is there anything special about the way StringBuilder does the concatenation? The article that recommended it to me said that it does not create temporary copies (whereas the ‘normal’ concatenation does). So this is my question: is there a difference between the ‘normal’ concatenation and the way StringBuilder.Append() works? Thanks in advance.
Strings are immutable, so normal string concatenation creates a *new* string. At least naively, this means string concatenation has to copy the contents of both source strings to a completely new string object (string could be using some fancy data structure internally to avoid this, but my guess is that it does not). At the least it means that a new string object is allocated.
StringBuilder is mutable, it contains a buffer that can be changed / appended to directly. The Append() method does exactly that, it changes the StringBuilder by appending something to its internal buffer. The original contents of the StringBuilder don’t need to be copied, and there is no extra object creation.
So to summarize, a = b + c involves three String objects. The strings referred to by b and c remain unchanged, and a *new* String is assigned to a.
sb.Append(c) only involves one StringBuilder and one String, and the StringBuilder itself is changed.
Usually the simplicity of working with plain strings outweighs the performance benefit of StringBuilder, but if you concatenate long strings together or do many concatenations (in particular in loops) the advantage of StringBuilder can become significant.
If code says `someString = someString+”1″;` [both variables of type `string`], no other variable anywhere in the entire universe holds a reference to the string object referred to by variable `someString`, and the amount of space allocated for that object is large enough to accommodate another character, then having the code add another character to the the string object referred to by `someString` would be equivalent to generating a new string object containing the characters in `someString` plus the digit ‘1’. If, however, there exists somewhere in the universe some other reference to that same string object, modifying it cause code which expected to hold a reference to the unmodified sequence of characters to instead hold a reference to the modified sequence. Neither the compiler nor runtime have facilities for keeping track of whether a variable holds the only extant reference to an object, and so both are generally required to assume that such references might exist.
When code says `myStringBuilder.Append(“1”);` the `StringBuilder` knows that it holds the only reference to its buffer anywhere in the universe. Since nobody who holds a reference to the `StringBuilder` will be expecting to hold a reference to its unmodified value, it’s free to modify its buffer as it sees fit without having to create a new object.
Although there is some “magic” within the StringBuilder class which allows better performance than could be achieved without it, such magic is a relatively small part of why code that uses it can perform much better than code which builds strings by repeated concatenation.
CPython uses reference counting, so it can know that there is only a single reference to a given string. Thus, it can make the optimization of not having to copy a string if there is enough room after it to do the concatenation. Presumably other runtimes with immutable strings that use reference counting perform a similar optimization.
This will be the subject of part two.
What surprises me a bit in this story, is the choice of what to optimize. For example,
r = a + null;
is optimized, though it’s hard to imagine anyone actually writing such code.
On the other hand,
r = a + ‘.’;
is not, although I have seen that code appear in real world programs a lot.
If you look at other string manipulation methods, things like
a.IndexOf(“c”) -> a.IndexOf(‘c’)
a.ToString() -> a
‘c’.ToString() -> “c”
could be interesting as well, even if I they do assume compiler knowledge of the framework (but then again, so do the string concatenation optimizations).
I often have to review code written by others, and have written custom Code Analysis rules for these (and many other more interesting issues). Obviously, I am now experimenting with and counting on Roslyn CodeIssueProviders for this sort of thing.
In fact, it could be argued that things like
r = a + null;
might be better dealt with in a CodeIssueProvider, than in the compiler itself.
On the other hand, people might write
#if DEBUG
const suffix = ” (debug)”;
#else
const suffix = null;
#endif
…
s = s + suffix;
And in that case, we obviously do want the optimization.
Too useful, Thanks
I stopped at the Coverity booth at the Detroit Telematics Update in Novi, MI and mentioned your name, and that I’m a fan of your blog! They laughed.
So what about side-effects? Does it matter whether the string expressions to be concatenated are local variables or function calls?
Wow. That awakward day when you learn that strings in C# don’t make a monoid… or do they? Is “null” even a string? If not, then “pure” strings do form the free monoid (over the set of Unicode characters, sans issues with non-BMP stuff). However, “NullString + NullString != NullString” still breaks down Nullable<T> monad. I can understand why it was made this way, however — I still remember that one program where I concatenated like 30 strings together in SQL, and was extremely frustrated that I when I got a sudden NULL as a result, I had to look thorugh all 30 operands to find that one unlucky NULL-string. Wrapping every string in COALESCE(s, “”) solved the problem.
Pingback: String Concatenation using '+' operator | BlogoSfera
Pingback: Windows 8 Tips Featured In The Daily Six Pack: June 26, 2013
The C# compiler doesn’t know that String.Empty is actually “” and the identity of concatenation, so it doesn’t simplify “foo” + String.Empty. Another point for “” over ugly String.Empty.
Pingback: Why isn't the empty string an identity of concatenation? | BlogoSfera
Pingback: String Concatenation using ‘+’ operator | Ask Programming & Technology
Pingback: How to: What is the difference between String.Empty and "" (empty string)? | SevenNet
Pingback: Passing parameters by reference? Let me fix that for you | CL-UAT
Pingback: Mutability of string when string doesn't change in C#? | 我爱源码网
Pingback: Optimizing associative operations | Fabulous adventures in coding
I was wondering about string interning implications of this. In the code below why does A + B not get optimised and then Interned where “E” + “F” does ?
class Program
{
static void Main(string[] args)
{
string value1 = Console.ReadLine();
string A = “A”;
string B = “B”;
string a = A + B;
string b = value1 + A + B;
string c = A + B + value1;
string d = “D” + value1;
string ef = “E” + “F”;
Console.WriteLine(A);
Console.WriteLine(B);
Console.WriteLine(a);
Console.WriteLine(b);
Console.WriteLine(c);
Console.WriteLine(d);
Console.WriteLine(ef);
while (true)
{
Console.WriteLine(“Enter string value to check:”);
string value = Console.ReadLine();
Console.WriteLine(String.IsInterned(value)??”NotInterned”);
}
}
}
}
Output:
C
A
B
AB
CAB
ABC
DC
EF
Enter string value to check:
A
A
Enter string value to check:
B
B
Enter string value to check:
C
NotInterned
Enter string value to check:
AB
NotInterned
Enter string value to check:
CAB
NotInterned
Enter string value to check:
ABC
NotInterned
Enter string value to check:
D
D
Enter string value to check:
DC
NotInterned
Enter string value to check:
E
NotInterned
Enter string value to check:
F
NotInterned
Enter string value to check:
EF
EF
Enter string value to check:
Pingback: Конкатенация строк с использованием оператора «+» — Вопросы и ответы по программированию