Wednesday, July 13. 2011
String Concatenation with XML and Entity Encoding Problems
The Problem
Unfortunately, SQL Server does not provide a built-in string concatenation aggregate operator, a la GROUP_CONCAT
from MySQL. Because of this deficiency, many workarounds, hacks, and custom code has been written to accomplish this goal. An excellent survey of some of the available methods was compiled by Anith Sen in his post Concatenating Row Values in Transact-SQL. For anyone using "The blackbox XML methods" to concatenate strings, there is at least one more important caveat to be aware of: XML Entity Encoding. Consider the following example:
INSERT INTO @Strings VALUES ('x < y when: ');
INSERT INTO @Strings VALUES ('x = 1 &&');
INSERT INTO @Strings VALUES ('y = 2');
SELECT Text + ' '
FROM @Strings
FOR XML PATH('')
This results in the text x < y when: x = 1 && y = 2
. But why? The answer is that when an xml value is cast to a string (as it is implicitly in the above query, because the TYPE directive in the FOR XML statement is omitted) it is entity-encoded for interpretation by an XML parser. Without entity encoding, the string would not parse into equivalent XML, if it was even valid XML at all.
A Solution
To avoid this problem, we can query the value of the resulting XML expression in a way that avoids entity encoding. This can be done using the value() method of the xml type as follows:
INSERT INTO @Strings VALUES ('x < y when: ');
INSERT INTO @Strings VALUES ('x = 1 &&');
INSERT INTO @Strings VALUES ('y = 2');
SELECT (SELECT Text + ' '
FROM @Strings
FOR XML PATH(''), TYPE).VALUE('.', 'NVARCHAR(4000)')
Note the addition of the TYPE directive to the FOR XML statement, it is required. Also note that NVARCHAR(4000)
can be replaced with NVARCHAR(MAX)
or any other SQL data type, as appropriate. Since the value method does not escape its output, the text is returned unchanged.
For the curious, I have attempted to benchmark the differences between these two methods and encountered inconsistent and conflicting results. In most scenarios using the value method was slightly slower than the alternative, but the difference seems to be overwhelmed by other minor differences in test configuration. As always, your mileage may vary.