Saturday, October 22. 2011
Problems with NULL Large Object Data Types and 64-bit Replication
I recently encountered an interesting problem where constraint checks on VARBINARY(MAX)
columns were failing during synchronization for merge-replicated clients. After a lot of investigation, I have found that when a 64-bit SQL Server 2008 R2 subscriber synchronizes a table which contains large object data types on an article published with @stream_blob_columns = N'true'
any constraint checks which depend on the values being NULL will fail. Note that 32-bit clients work properly and that without the constraint checks the final value would be NULL, but apparently during synchronization it is in a temporary state of being non-NULL and fails before reaching that point.
Although the release notes in KB2463333 do not mention a fix for this problem, it appears that Service Pack 1 for SQL Server 2008 R2 at least contains a workaround. When testing with SP1, the insert fails due to the constraint violation but the retry succeeds. However, even with the workaround, performance problems related to the retry are a legitimate concern. I would suggest considering disabling @stream_blob_columns
on the affected articles for anyone encountering this problem.
Wednesday, July 13. 2011
String Concatenation with XML and Entity Encoding Problems
The Problem
Unfortunately, SQL Server does not provide a built-in string concatenation aggregate operator, a la GROUP_CONCAT
from MySQL. Because of this deficiency, many workarounds, hacks, and custom code has been written to accomplish this goal. An excellent survey of some of the available methods was compiled by Anith Sen in his post Concatenating Row Values in Transact-SQL. For anyone using "The blackbox XML methods" to concatenate strings, there is at least one more important caveat to be aware of: XML Entity Encoding. Consider the following example:
INSERT INTO @Strings VALUES ('x < y when: ');
INSERT INTO @Strings VALUES ('x = 1 &&');
INSERT INTO @Strings VALUES ('y = 2');
SELECT Text + ' '
FROM @Strings
FOR XML PATH('')
This results in the text x < y when: x = 1 && y = 2
. But why? The answer is that when an xml value is cast to a string (as it is implicitly in the above query, because the TYPE directive in the FOR XML statement is omitted) it is entity-encoded for interpretation by an XML parser. Without entity encoding, the string would not parse into equivalent XML, if it was even valid XML at all.
A Solution
To avoid this problem, we can query the value of the resulting XML expression in a way that avoids entity encoding. This can be done using the value() method of the xml type as follows:
INSERT INTO @Strings VALUES ('x < y when: ');
INSERT INTO @Strings VALUES ('x = 1 &&');
INSERT INTO @Strings VALUES ('y = 2');
SELECT (SELECT Text + ' '
FROM @Strings
FOR XML PATH(''), TYPE).VALUE('.', 'NVARCHAR(4000)')
Note the addition of the TYPE directive to the FOR XML statement, it is required. Also note that NVARCHAR(4000)
can be replaced with NVARCHAR(MAX)
or any other SQL data type, as appropriate. Since the value method does not escape its output, the text is returned unchanged.
For the curious, I have attempted to benchmark the differences between these two methods and encountered inconsistent and conflicting results. In most scenarios using the value method was slightly slower than the alternative, but the difference seems to be overwhelmed by other minor differences in test configuration. As always, your mileage may vary.
Friday, May 6. 2011
Corrupted MSmerge_contents90_forall.bcp in Dynamic Snapshot
Executive Summary
A problem with dynamic snapshot generation may cause some BCP files to contain stale data at the end of the file. Re-creation of the dynamic snapshot does not solve the problem. Removing the old dynamic snapshot before re-creating it may solve the problem temporarily.
Background
I am using Merge Replication with SQL Server 2008 R2 (server and clients) and recently encountered the following error message during synchronization on one of the subscribers:
This message only appeared on one of the subscribers. It was completely reproducible and occurred every time the subscriber attempted to synchronize. Reinitializing the subscriber and recreating the snapshot had no effect. The error message would also appear on a test machine set to use the same partition (via setting the HOST_NAME property to match the subscriber with the errors).
Note that this problem appears to be the same as the one reported in Microsoft Connect #646157, that was closed as a duplicate of #646156... which appears to be inaccessible. How very frustrating!
Investigation Technical Details
One of the initial steps that I used to try to isolate the problem was to copy MSmerge_contents90_forall.bcp (which is listed in the bcp invocation in the error message) to a test machine and attempt to load it. This can be done using the SQL BULK INSERT statement, or using the bcp utility. I tried the following SQL:
FROM 'C:\File\Path\MSmerge_contents90_forall.bcp'
WITH (DATAFILETYPE = 'widenative')
Which produced the following output:
It confirms that the file is invalid and we have a row number that may be close to where the error occurs in the file, but not enough to isolate the problem yet. So, next a try with bcp. Unfortunately, the bcp command in the synchronization error message is not applicable to replication configurations using snapshots in native format (the default). Instead, I used the following command:
A description of the meaning of these command options can be found on the bcp Utility page on MSDN. The command produced the following output:
At least some of the rows were successfully loaded. The rows saved to errfile.dat, which could not be loaded, do not appear to be sane (e.g. negative values in the tablenick column), suggesting some sort of data corruption. But again, no real indication of what is happening.
At this point I was lost. I looked at SQL Profiler traces during snapshot creation and poked around in the data without success. I decided to write a bcp file parser to determine the exact source and nature of the corruption. What I found was 11 bytes which were invalid:
001c820: ffff ff10 48a2 984a 33cb c046 a44e 1d9d ....H..J3..F.N.. 001c830: b826 5368 ffff ff84 c1b3 d97a 56cd e5ff .&Sh.......zV... 001c840: ffff d088 0b05 109b 6e58 9e16 c611 e0ba ........nX...... 001c850: 8800 145e 281b 87ed 0c00 0000 0000 0008 ...^(........... 001c860: 700c 0000 0000 0000 0b00 ece6 e53e de7e p............>.~ 001c870: 0400 0000 ffff ff10 7482 32a1 50b6 4d4a ........t.2.P.MJ
If these bytes were removed, the file would parse completely without error. Now we are closer to a real cause, and perhaps a solution.
Next I recreated the dynamic snapshot for the partition while running Process Monitor (a tool I highly recommend) and looked for accesses to MSmerge_contents90_forall.bcp, particularly at the offset where the corruption begins. What I found is that data is written up to the byte before this offset, but not at this byte or after it (by any process). Looking back further in the log revealed that the file was opened with disposition OpenIf, rather than CreateIf, meaning that it was not truncated. Also, there is no call to SetEndOfFileInformationFile/SetAllocationInformationFile made to truncate the file or resize it after writing. Eureka! We've found the problem!
If the size of MSmerge_contents90_forall.bcp (or any other bcp file) shrinks between the previous snapshot and the current snapshot, stale data will be left at the end of the file and errors will occur (unless it happens to end on a row boundary, in which case the stale data will be loaded, potentially causing future errors). The workaround was simple: Delete the folder for the dynamic partition (or individual bcp files with errors) and recreate the snapshot.
Best of luck in solving your problems.
Sunday, February 6. 2011
Very Slow Data Repartitioning in SQL Server Replication with Precomputed Partitions
Background
I am using Merge Replication with SQL Server 2008 R2 (server and clients). I have setup a publication with a rather complex filtering hierarchy rooted in a table with Parameterized Row Filters extended many times through Join Filters. Making modifications to data in a table near the root of this hierarchy would trigger repartitioning of the replicated data which would never complete and would cause deadlock errors in the other connections to the database while it ran (I let it run overnight once in single user mode, but had to kill it after 13 hours...).
Investigation Technical Details
After a lot of watching in SQL Profiler and digging around in the triggers/procedures which perform the repartitioning I found the culprit. The replication DML trigger on the table (MSMerge_op_GUID
) called the repartitioning procedure (MSmerge_expand_sp_GUID
) which included several queries with the following subquery:
FROM dbo.MSmerge_current_partition_mappings cpm WITH (ROWLOCK)
INNER JOIN dbo.MSmerge_contents mc2 WITH (ROWLOCK) ON cpm.rowguid = mc2.rowguid AND mc2.marker = @marker
Looking at the execution plan for any of the queries showed that this subquery was responsible for at least 40% of the total query cost. Both of these tables are quite large (~800k and ~425k rows respectively in my DB) and neither had indexes to cover this (sub-)query.
Solution
So, of course, I added the following indexes (with naming conventions to match the existing indexes):
ON dbo.MSmerge_current_partition_mappings (rowguid, partition_id);
CREATE INDEX nc6MSmerge_contents
ON dbo.MSmerge_contents (marker, rowguid);
After adding these indexes, the repartitioning operations completed in under 20 minutes!
Caveat
Both of these tables are heavily used and often modified (depending on the workload), so adding more indexes may not be the best solution for databases with high-performance requirements where repartitioning is rare and non-repartitioning operations are the norm. If that is the case, I suggest creating the above indexes before major repartitioning operations and removing them once the repartition is complete. However, that being said, I have been able to leave these indexes on the tables with no noticeable performance impact and a significant reduction in execution time for data modifications which involve repartitioning data.
Friday, June 25. 2010
SQL Server Missing from Synchronization Manager
Symptoms
After (re-)creating a subscription to a pull merge replication publication from SQL Server Express 2005, the subscription fails to appear in Synchronization Manager. After further investigation, the symptom was determined to be restricted to non-Administrators.
Things To Check
- Make sure
sp_addmergepullsubscription_agent
was run with@enabled_for_syncmgr = 'TRUE'
. This requirement differs from previous SQL Server versions where this was the default. When this parameter is not set to 'TRUE', the subscription will not appear in Synchronization Manager - Make sure the subscription can be synchronized outside of Synchronization Manager (to confirm that it is a problem when run through replsync.dll - in Synchronization Manager). The easiest way to do this is using replmerg from the command-line.
- Make sure the user has permissions to write to
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\80\Replication\Subscriptions
. Without write permission, SQL Server will silently fail to appear in Synchronization Manager. Also, by default in many configurations, non-Administrators do not have write access to this key so it must be adjusted manually.
The last item is particularly important and required quite a bit of my time to determine... which resulted in the need for this post. Hopefully one of the above suggestions will help you avoid spending the same amount of time that I did to solve this problem.
Sunday, June 20. 2010
Triggers, NOT FOR REPLICATION Identity, and Merge Replication
Initial Note/Warning
I hope that there is a more elegant, simple, and robust method for dealing with the issues described in this post than the one that I present herein. My method is a rather ugly kludge that relies on undocumented features of SQL Server and is not nearly as parallelizable as I would like. If you are aware of a better method (which I hope exists), please don't hesitate to post it in the comments. Until a better method is posted, I invite you to use and/or learn from the method presented in this post.
Background
Consider an SQL Server 2005 database which is published for merge
replication from a single publisher with multiple subscribers. The database
contains 2 tables which we are considering:
Forms
and
InboxForms
.
Forms
contains some sort of form data and
InboxForms
references all forms which are
present in a user's "inbox". Each of the tables contains an
INT IDENTITY
primary key column and several other
columns of data that are not relevant to the problem at hand. The publication
is filtered based on the Forms that each user is able to view (determined by
some separate mechanism not described here). When a new row is inserted into
Forms
, a trigger is used to route the forms
into the appropriate inbox(es).
The Problem
The routing can not occur (exclusively) at the subscribers, because the
filter for the subscriber will not necessarily include
user information for the recipient of the form and the form can not be placed
into InboxForms
if it recipient user does not
exist on the subscriber. So, the trigger must run on the publisher during
synchronization when inserts
are performed by the merge agent (i.e. the trigger must not be marked
NOT FOR REPLICATION
). However, in this configuration,
when the merge agent runs, the following error message is produced:
Explicit value must be specified for identity column in table 'InboxForms' either when IDENTITY_INSERT is set to ON or when a replication user is inserting into a NOT FOR REPLICATION identity column.
The problem is that the INT IDENTITY
column in
InboxForms
was marked NOT FOR
REPLICATION
when the publication was created in order to facilitate
automatic identity range management, which is described in the
Replicating
Identity Columns article in the SQL Server Books Online.
NOT FOR REPLICATION
behaves very similarly to
IDENTITY_INSERT
(as hinted at in the error message), such that when a row is inserted by the
merge agent, the ideneity seed value is not incremented and the value of the
identity column must be explicitly specified. Note, however, that it is not
the same mechanism as IDENTITY_INSERT, so changing IDENTITY_INSERT in the
trigger will not remove the requirement for explicitly specified identity
values.
The Solution
One method to solve this problem is to disable NOT FOR
REPLICATION
, as suggested in
KB908711 (which
specifically addresses this issue). However, using this option will
interfere with automatic identity range management, since the identity values
generated on the subscribers can not be copied to the publisher, and other
steps will need to be taken to manually manage identity values. For me, this
was an unacceptably high price to pay and another solution was required.
A solution which does not interfere with automatic identity range
management is to calculate values for the identity columns and explicitly
specify those values when they are required. In order to accomplish this,
understanding several features of T-SQL is required: In order to determine
when the values are required, the trigger needs to test if it is being run
from the merge replication agent. This can be done by testing for the
'replication_agent'
SESSIONPROPERTY
.
In order to determine appropriate values for the identity column, use
IDENT_CURRENT
and
IDENT_INCR
.
Note that using the maximum value for the identity column is not necessarily
correct because the maximum identity range will not necessarily be allocated
to the publisher.
DBCC
CHECKIDENT
can be used to update the identity seed value (which
is not affected by explicitly inserted identity values).
One other complicating factor in our implementation is that there is no way to atomically insert explicit identity values and update the identity seed value. Therefore, locking is required to prevent multiple connections from simultaneously updating the values and causing collisions. (Or collision errors must be caught and retried) In the following implementation, an exclusive table lock is acquired which prevents any inserts from occurring on the table when the trigger is running. This is a serious performance problem as it prevents any other operations on the locked table from completing while the trigger is executing. Keep this in mind when designing the queries that will run while the lock is held.
Now, without further ado, here's the trigger:
ALTER TRIGGER TR_RouteForms
ON dbo.Forms
AFTER INSERT
AS
BEGIN
-- Client isn't expecting routing counts from their insert
SET NOCOUNT ON;
IF SESSIONPROPERTY('replication_agent') <> 0
BEGIN
-- Running from the replication agent
-- Need explicit value for NOT FOR REPLICATION IDENTITY columns
-- Use transaction to limit lock scope
BEGIN TRAN
-- Variables for IDENT_CURRENT and IDENT_INCR, required
-- because DBCC CHECKIDENT syntax won't support nested parens
DECLARE @Ident INT, @Incr INT;
-- RowCnt used to preserve @@ROWCOUNT
DECLARE @RowCnt INT;
-- Must acquire exclusive lock on InboxForms to prevent other
-- inserts (which would invalidate the identity and cause
-- collisions in the identity column).
-- Select into variable to prevent resultset going to client
-- WHERE clause quickly evaluated, returns small (empty) result
DECLARE @Dummy INT;
SELECT @Dummy = InboxFormID
FROM InboxForms WITH (TABLOCK, XLOCK, HOLDLOCK)
WHERE InboxFormID = 0;
-- Perform the form routing (inserts into InboxForms)
SET @Ident = IDENT_CURRENT('InboxForms');
SET @Incr = IDENT_INCR('InboxForms');
INSERT INTO InboxForms (InboxFormID, FormID, ...)
SELECT @Ident + @Incr * ROW_NUMBER() OVER (ORDER BY FormID) AS InboxFormID, FormID, ...
FROM inserted
WHERE ...routing criteria...
SET @RowCnt = @@ROWCOUNT;
IF @RowCnt > 0
BEGIN
-- At least 1 form was routed, update the identity seed value
-- Note: Can't use MAX(InboxFormID) since publisher may not
-- have been allocated the maximum identity range
SET @Ident = @Ident + @Incr * @RowCnt;
DBCC CHECKIDENT (InboxForms, RESEED, @Ident)
WITH NO_INFOMSGS;
END
COMMIT TRAN
END
ELSE
BEGIN
-- NOT running from the replication agent
-- Can insert normally into NOT FOR REPLICATION IDENTITY columns
-- Perform the form routing (inserts into InboxForms)
INSERT INTO InboxForms (InboxFormID, FormID, ...)
SELECT @Ident + @Incr * ROW_NUMBER() OVER (ORDER BY FormID) AS InboxFormID, FormID, ...
FROM inserted
WHERE ...routing criteria...
END
END
Tuesday, May 11. 2010
Making Custom Replication Resolvers Work in SQL Server 2005
Background
SQL Server provides a very convenient method for implementing custom business logic in coordination with the synchronization/merge process of replication. For tasks which need to be done as data is synchronized, or decisions about resolving conflicts which are business-specific, implementing a custom resolver is a surprisingly straight-forward way to go. For more information, check out the following resources:
- How to: Implement a Business Logic Handler for a Merge Article (Replication Programming)
- How to: Implement a COM-Based Custom Conflict Resolver for a Merge Article (Replication Programming)
- BusinessLogicModule Class
Making It Work
Things are never quite as easy as they seem... Chances are, some sort of error message was generated once the DLL was deployed and the instructions in (1) were completed. For example, the following error is common:
Don't Panic. First, check that the DLL has been placed in the directory containing the merge agent on the subscriber (assuming this is being done with pull - place the DLL on the server for push) or registered in the GAC. This message can also indicate dependency problems for the DLL where dependent libraries can't be found/loaded. One way to test that the assembly can be loaded is to compile and run the following program in the same directory as the merge agent:
class Program
{
static void Main(string[] args)
{
TryLoadType(ASSEMBLY_NAME, CLASS_NAME);
// Leave window visible for non-CLI users
Console.ReadKey();
}
static void TryLoadType(string assemblyname, string typename)
{
try
{
Assembly asm = Assembly.Load(assemblyname);
if (asm == null)
{
Console.WriteLine("Failed to load assembly");
return;
}
Type type = asm.GetType(typename);
if (type == null)
{
Console.WriteLine("Failed to load type");
return;
}
ConstructorInfo constr = type.GetConstructor(new Type[0]);
if (constr == null)
{
Console.WriteLine("Failed to find 0-argument constructor");
return;
}
object instance = constr.Invoke(new object[0]);
Console.WriteLine("Successfully loaded " + type.Name);
}
catch (Exception ex)
{
Console.Error.WriteLine("Error loading type: " + ex.Message);
}
}
}
Note: It is very important to correctly determine where the merge agent executable is and where it is being run from when testing. The DLL search path includes both the directory in which the executable file exists and the directory from which it is run (for weak-named assemblies). replmerg.exe usually lives in C:\Program Files\Microsoft SQL Server\90\COM, but mobsync.exe (if you are using Synchronization Manager or Sync Center) is in C:\WINDOWS\system32, and this will have an effect on the assembly search path.
Make sure the names are exactly as they were specified in sp_registercustomresolver
. If the problem was a misnamed assembly or class (because you are like me and fat-fingered the name...) here's how you fix it: sp_registercustomresolver
can be re-run with the same @article_resolver
parameter to overwrite the information for that resolver. This overwrites the information stored in the registry at HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\90\Replication\ArticleResolver
(or a similar location for different versions/configurations). However, if the resolver has already been attached to an article, the information is also stored in sysmergearticles
in the article_resolver
(assembly name), resolver_clsid
(CLSID), and resolver_info
(.NET class name) columns. So, run an UPDATE on these columns to fix errors, as appropriate.
Good Luck!
Tuesday, April 27. 2010
identityrangemanagement and read-only tables
As a follow-up to yesterday's problems with exhausted identity ranges, I have come to a realization which I am now adopting as a best practice and I highly recommend for others:
@subscriber_upload_options = 2
), always set @identityrangemanagementoption = 'none'
.
The reasoning behind this advice is that when @identityrangemanagementoption = 'auto'
(the default in Management Studio), identity ranges will be allocated for all subscribers, even though they will never use these ranges because the table article is read-only. Yet, if the table has a small datatype for a primary key (such as tinyint or smallint) it is quite likely that the identity range will be exhausted (with the default identity range being 1000, this doesn't take many subscribers) and when this happens synchronization will cease.
If there is a reason for allocating these ranges to read-only table articles or for leaving automatic identity range mangement enabled, please post it in the comments.
Monday, April 26. 2010
Troubleshooting Failed to allocate new identity range
- The Publisher failed to allocate a new set of identity ranges for the subscription. This can occur when a Publisher or a republishing Subscriber has run out of identity ranges to allocate to its own Subscribers or when an identity column data type does not support an additional identity range allocation. If a republishing Subscriber has run out of identity ranges, synchronize the republishing Subscriber to obtain more identity ranges before restarting the synchronization. If a Publisher runs out of identit (Source: MSSQL_REPL, Error number: MSSQL_REPL-2147199417) Get help: http://help/MSSQL_REPL-2147199417
- Not enough range available to allocate a new range for a subscriber. (Source: MSSQLServer, Error number: 20668) Get help: http://help/20668
- Failed to allocate new identity range. (Source: MSSQLServer, Error number: 21197) Get help: http://help/21197
First, some background: I am using a (very) simple replication topology with a single publisher running SQL Server 2005 and ~10 subscribers running SQL Server 2005 Express (all with the latest service packs and patches, of course). No re-publishing servers, multiple publications, etc. The error occurred not long after I had added 2 new subscribers (which turns out to be quite important).
Note: For replication topologies which do include a re-publishing server, the error may be explained by KB941989, which can be solved by installing Cumulative Update 4 (or later updates including the fixes).
To find the cause of this error message, some technical information about how identity values are managed during replication is needed. A key source of information for table articles with identity columns is the MSmerge_identity_range table. This table contains a row for each range of values assigned to each subscriber for each article as well as an additional row holding the entire range of identity values for the table article (this is the row with is_pub_range = 1). In order to make sense of this table, the artid column can be joined to the sysmergearticles table and subid can be joined to the sysmergesubscriptions table. Take note of the max_used column (which only has a value for pub_range rows); this column holds the end of the maximum range that has been allocated for the article (technically 1-past-the-last, since it holds the max of next_range_end, but conceptually it is the end). When this value is near the end of the pub_range, then the available identity ranges are almost exhausted.
One quick way to determine if the problem is due to exhaustion of the available identity range, is with a query similar to the following:
SELECT name, range_begin, range_end, max_used
FROM MSmerge_identity_range mir
INNER JOIN sysmergearticles sma ON mir.artid = sma.artid
WHERE is_pub_range = 1 AND range_end <= max_used + pub_range
Note: In the (very unusual) case that the subscribers are given larger ranges than the publisher(s), replace pub_range with sub_range in the above query. If this query returns any rows, then the problem has been identified. The table articles listed have exhausted their available identity ranges. If the range_end value is below the maximum value for the identity datatype, then it should be possible to increase this value and solve the problem quite easily (Note: I have not found a documented way to do this "safely" and modifying this table directly is probably dangerous, but it solved the problem for me. YMMV). If the range_end value is the maximum for the datatype, then the underlying datatype will likely need to be changed (or significant compaction of the values will need to be done, somehow) and the article dropped and re-added to the publication.
Other sources that I have come across suggested running the sp_adjustpublisheridentityrange stored procedure. It should be safe to run in any case, although I didn't find it particularly useful.
Another suggestion for similar errors that was suggested on the MSDN Forums is to check the identity value on the publisher and that it is within the range allocated to the publisher (both in MSmerge_identity_range and in the check constraint that replication adds to the table for checking the identity range value).
If all of the above suggestions fail, my best suggestion is to watch the synchronization in SQL Profiler (for server versions where SQL Profiler is not available, AnjLab SQLProfiler is a nice alternative) and to examine some of the replication stored procedures using Management Studio (right click->Modify to see the SQL of the SP) to get ideas for what might be going wrong. In particular, consider examining sp_adjustpublisheridentityrange, MSget_identity_range_info, MSpub_adjust_identity, and similar identity management stored procedures.
Good Luck!