Friday, May 6. 2011
Corrupted MSmerge_contents90_forall.bcp in Dynamic Snapshot
Executive Summary
A problem with dynamic snapshot generation may cause some BCP files to contain stale data at the end of the file. Re-creation of the dynamic snapshot does not solve the problem. Removing the old dynamic snapshot before re-creating it may solve the problem temporarily.
Background
I am using Merge Replication with SQL Server 2008 R2 (server and clients) and recently encountered the following error message during synchronization on one of the subscribers:
This message only appeared on one of the subscribers. It was completely reproducible and occurred every time the subscriber attempted to synchronize. Reinitializing the subscriber and recreating the snapshot had no effect. The error message would also appear on a test machine set to use the same partition (via setting the HOST_NAME property to match the subscriber with the errors).
Note that this problem appears to be the same as the one reported in Microsoft Connect #646157, that was closed as a duplicate of #646156... which appears to be inaccessible. How very frustrating!
Investigation Technical Details
One of the initial steps that I used to try to isolate the problem was to copy MSmerge_contents90_forall.bcp (which is listed in the bcp invocation in the error message) to a test machine and attempt to load it. This can be done using the SQL BULK INSERT statement, or using the bcp utility. I tried the following SQL:
FROM 'C:\File\Path\MSmerge_contents90_forall.bcp'
WITH (DATAFILETYPE = 'widenative')
Which produced the following output:
It confirms that the file is invalid and we have a row number that may be close to where the error occurs in the file, but not enough to isolate the problem yet. So, next a try with bcp. Unfortunately, the bcp command in the synchronization error message is not applicable to replication configurations using snapshots in native format (the default). Instead, I used the following command:
A description of the meaning of these command options can be found on the bcp Utility page on MSDN. The command produced the following output:
At least some of the rows were successfully loaded. The rows saved to errfile.dat, which could not be loaded, do not appear to be sane (e.g. negative values in the tablenick column), suggesting some sort of data corruption. But again, no real indication of what is happening.
At this point I was lost. I looked at SQL Profiler traces during snapshot creation and poked around in the data without success. I decided to write a bcp file parser to determine the exact source and nature of the corruption. What I found was 11 bytes which were invalid:
001c820: ffff ff10 48a2 984a 33cb c046 a44e 1d9d ....H..J3..F.N.. 001c830: b826 5368 ffff ff84 c1b3 d97a 56cd e5ff .&Sh.......zV... 001c840: ffff d088 0b05 109b 6e58 9e16 c611 e0ba ........nX...... 001c850: 8800 145e 281b 87ed 0c00 0000 0000 0008 ...^(........... 001c860: 700c 0000 0000 0000 0b00 ece6 e53e de7e p............>.~ 001c870: 0400 0000 ffff ff10 7482 32a1 50b6 4d4a ........t.2.P.MJ
If these bytes were removed, the file would parse completely without error. Now we are closer to a real cause, and perhaps a solution.
Next I recreated the dynamic snapshot for the partition while running Process Monitor (a tool I highly recommend) and looked for accesses to MSmerge_contents90_forall.bcp, particularly at the offset where the corruption begins. What I found is that data is written up to the byte before this offset, but not at this byte or after it (by any process). Looking back further in the log revealed that the file was opened with disposition OpenIf, rather than CreateIf, meaning that it was not truncated. Also, there is no call to SetEndOfFileInformationFile/SetAllocationInformationFile made to truncate the file or resize it after writing. Eureka! We've found the problem!
If the size of MSmerge_contents90_forall.bcp (or any other bcp file) shrinks between the previous snapshot and the current snapshot, stale data will be left at the end of the file and errors will occur (unless it happens to end on a row boundary, in which case the stale data will be loaded, potentially causing future errors). The workaround was simple: Delete the folder for the dynamic partition (or individual bcp files with errors) and recreate the snapshot.
Best of luck in solving your problems.