Merge replication stopping without errors in SQL 2008 R2

A non-SQL MVP friend of mine, who also happens to be a client, asked me for some help again last week. I was planning on writing this up even before Rob Volk (@sql_r) listed his T-SQL Tuesday topic for this month.

TSQL2sDay150x150Earlier in the year, I (well, LobsterPot Solutions, although I’d been the person mostly involved) had helped out with a merge replication problem. The Merge Agent on the subscriber was just stopping every time, shortly after it started. With no errors anywhere – not in the Windows Event Log, the SQL Agent logs, not anywhere. We’d managed to get the system working again, but didn’t have a good reason about what had happened, and last week, the problem occurred again. I asked him about writing up the experience in a blog post, largely because of the red herrings that we encountered. It was an interesting experience for me, also because I didn’t end up touching my computer the whole time – just tapping on my phone via Twitter and Live Msgr.

You see, the thing with replication is that a useful troubleshooting option is to reinitialise the thing. We’d done that last time, and it had started to work again – eventually. I say eventually, because the link being used between the sites is relatively slow, and it took a long while for the initialisation to finish. Meanwhile, we’d been doing some investigation into what the problem could be, and were suitably pleased when the problem disappeared.

So I got a message saying that a replication problem had occurred again. Reinitialising wasn’t going to be an option this time either.

In this scenario, the subscriber having the problem happened to be in a different domain to the publisher. The other subscribers (within the domain) were fine, just this one in a different domain had the problem.

Part of the problem seemed to be a log file that wasn’t being backed up properly. They’d been trying to back up to a backup device that had a corruption, and the log file was growing. Turned out, this wasn’t related to the problem, but of course, any time you’re troubleshooting and you see something untoward, you wonder.

Having got past that problem, my next thought was that perhaps there was a problem with the account being used. But the other subscribers were using the same account, without any problems.

The client pointed out that that it was almost exactly six months since the last failure (later shown to be a complete red herring). It sounded like something might’ve expired. Checking through certificates and trusts showed no sign of anything, and besides, there wasn’t a problem running a command-prompt window using the account in question, from the subscriber box.

…except that when he ran the sqlcmd –E –S servername command I recommended, it failed with a Named Pipes error. I’ve seen problems with firewalls rejecting connections via Named Pipes but letting TCP/IP through, so I got him to look into SQL Configuration Manager to see what kind of connection was being preferred… Everything seemed fine. And strangely, he could connect via Management Studio. Turned out, he had a typo in the servername of the sqlcmd command. That particular red herring must’ve been reflected in his cheeks as he told me.

During the time, I also pinged a friend of mine to find out who I should ask, and Ted Kruger (@onpnt) ‘s name came up. Ted (and thanks again, Ted – really) reconfirmed some of my thoughts around the idea of an account expiring, and also suggesting bumping up the logging to level 4 (2 is Verbose, 4 is undocumented ridiculousness). I’d just told the client to push the logging up to level 2, but the log file wasn’t appearing. Checking permissions showed that the user did have permission on the folder, but still no file was appearing. Then it was noticed that the user had been switched earlier as part of the troubleshooting, and switching it back to the real user caused the log file to appear.

Still no errors. A lot more information being pushed out, but still no errors.

Ted suggested making sure the FQDNs were okay from both ends, in case the servers were unable to talk to each other. DNS problems can lead to hassles which can stop replication from working. No luck there either – it was all working fine.

Another server started to report a problem as well. These two boxes were both SQL 2008 R2 (SP1), while the others, still working, were SQL 2005.

Around this time, the client tried an idea that I’d shown him a few years ago – using a Profiler trace to see what was being called on the servers. It turned out that the last call being made on the publisher was sp_MSenumschemachange. A quick interwebs search on that showed a problem that exists in SQL Server 2008 R2, when stored procedures have more than 4000 characters. Running that stored procedure (with the same parameters) manually on SQL 2005 listed three stored procedures, the first of which did indeed have more than 4000 characters. Still no error though, and the problem as listed at http://support.microsoft.com/kb/2539378 describes an error that should occur in the Event log.

However, this problem is the type of thing that is fixed by a reinitialisation (because it doesn’t need to send the procedure change across as a transaction). And a look in the change history of the long stored procs (you all keep them, right?), showed that the problem from six months earlier could well have been down to this too.

Applying SP2 (with sufficient paranoia about backups and how to get back out again if necessary) fixed the problem. The stored proc changes went through immediately after the service pack was applied, and it’s been running happily since.

The funny thing is that I didn’t solve the problem. He had put the Profiler trace on the server, and had done the search that found a forum post pointing at this particular problem. I’d asked Ted too, and although he’d given some useful information, nothing that he’d come up with had actually been the solution either.

Sometimes, asking for help is the most useful thing you can do. Often though, you don’t end up getting the help from the person you asked – the sounding board is actually what you need.

@rob_farley

24 Hours of PASS coming up soon!

Massive thanks to all the people that have been shouting about this event already. I’ve seen quite a number of blog posts about it, and rather than listing some and missing others, please assume I’ve noticed your blog and accept my thanks.

But in case this is all news to you – the next 24 Hours of PASS event is less than a fortnight away (Sep 20/21)! And there’s lots of info about it at http://www.sqlpass.org/24hours/fall2012/ 

(Don’t ask why it’s “Fall 2012”. Apparently that’s what this time of year is called in at least two countries. I would call it “Spring”, personally, but do appreciate that it’s “Autumn” in the Northern Hemisphere…)

Yes, I blogged about it on the PASS blog a few weeks ago, but haven’t got around to writing about it here yet.

As always, 24HOP is going to have some amazing content. But it’s going to be pointing at the larger event, which now less than two months away. That’s right, this 24HOP is the Summit 2012 Preview event. Most of the precon speakers are going to be represented, as are half-day session presenters, quite a few of the Spotlight presenters and some of the Microsoft speakers too. When you look down the list of sessions at http://www.sqlpass.org/24hours/fall2012/SessionsbySchedule.aspx, you’ll find yourself wondering how you can fit them all in. Luckily, that’s not my problem. For me, it’s just about making sure that you can get to hear these people present, and get a taste for the amazing time that you’ll have if you can come to the Summit.

I see this 24HOP as the kind of thing that will just drive you crazy if you can’t get to the Summit. There will be so much great content, and every one of these presenters will be delivering even more than this at the Summit itself. If you tune into Jason Strate’s 24HOP session on the Plan Cache and are impressed – well, you can get to a longer session by him on that same topic at the Summit. And the same goes for all of them.

If you’re anything like me, you’ll find yourself looking at the Summit schedule, wishing you could get to several presentations for every time slot. So get yourself registered for 24HOP and help yourself make that decision.

And if you can’t go to the Summit, tune in anyway. You’ll still learn a lot, and you might just be able to help persuade someone to send you to the Summit after all (before the price goes up after Sep 30).