A non-SQL MVP friend of mine, who also happens to be a client, asked me for some help again last week. I was planning on writing this up even before Rob Volk (@sql_r) listed his T-SQL Tuesday topic for this month.
Earlier in the year, I (well, LobsterPot Solutions, although I'd been the person mostly involved) had helped out with a merge replication problem. The Merge Agent on the subscriber was just stopping every time, shortly after it started. With no errors anywhere – not in the Windows Event Log, the SQL Agent logs, not anywhere. We'd managed to get the system working again, but didn't have a good reason about what had happened, and last week, the problem occurred again. I asked him about writing up the experience in a blog post, largely because of the red herrings that we encountered. It was an interesting experience for me, also because I didn't end up touching my computer the whole time – just tapping on my phone via Twitter and Live Msgr.
You see, the thing with replication is that a useful troubleshooting option is to reinitialise the thing. We'd done that last time, and it had started to work again – eventually. I say eventually, because the link being used between the sites is relatively slow, and it took a long while for the initialisation to finish. Meanwhile, we'd been doing some investigation into what the problem could be, and were suitably pleased when the problem disappeared.
So I got a message saying that a replication problem had occurred again. Reinitialising wasn't going to be an option this time either.
In this scenario, the subscriber having the problem happened to be in a different domain to the publisher. The other subscribers (within the domain) were fine, just this one in a different domain had the problem.
Part of the problem seemed to be a log file that wasn't being backed up properly. They'd been trying to back up to a backup device that had a corruption, and the log file was growing. Turned out, this wasn't related to the problem, but of course, any time you're troubleshooting and you see something untoward, you wonder.
Having got past that problem, my next thought was that perhaps there was a problem with the account being used. But the other subscribers were using the same account, without any problems.
The client pointed out that that it was almost exactly six months since the last failure (later shown to be a complete red herring). It sounded like something might've expired. Checking through certificates and trusts showed no sign of anything, and besides, there wasn't a problem running a command-prompt window using the account in question, from the subscriber box.
…except that when he ran the sqlcmd –E –S servername command I recommended, it failed with a Named Pipes error. I've seen problems with firewalls rejecting connections via Named Pipes but letting TCP/IP through, so I got him to look into SQL Configuration Manager to see what kind of connection was being preferred… Everything seemed fine. And strangely, he could connect via Management Studio. Turned out, he had a typo in the servername of the sqlcmd command. That particular red herring must've been reflected in his cheeks as he told me.
During the time, I also pinged a friend of mine to find out who I should ask, and Ted Kruger (@onpnt) 's name came up. Ted (and thanks again, Ted – really) reconfirmed some of my thoughts around the idea of an account expiring, and also suggesting bumping up the logging to level 4 (2 is Verbose, 4 is undocumented ridiculousness). I'd just told the client to push the logging up to level 2, but the log file wasn't appearing. Checking permissions showed that the user did have permission on the folder, but still no file was appearing. Then it was noticed that the user had been switched earlier as part of the troubleshooting, and switching it back to the real user caused the log file to appear.
Still no errors. A lot more information being pushed out, but still no errors.
Ted suggested making sure the FQDNs were okay from both ends, in case the servers were unable to talk to each other. DNS problems can lead to hassles which can stop replication from working. No luck there either – it was all working fine.
Another server started to report a problem as well. These two boxes were both SQL 2008 R2 (SP1), while the others, still working, were SQL 2005.
Around this time, the client tried an idea that I'd shown him a few years ago – using a Profiler trace to see what was being called on the servers. It turned out that the last call being made on the publisher was sp_MSenumschemachange. A quick interwebs search on that showed a problem that exists in SQL Server 2008 R2, when stored procedures have more than 4000 characters. Running that stored procedure (with the same parameters) manually on SQL 2005 listed three stored procedures, the first of which did indeed have more than 4000 characters. Still no error though, and the problem as listed at http://support.microsoft.com/kb/2539378 describes an error that should occur in the Event log.
However, this problem is the type of thing that is fixed by a reinitialisation (because it doesn't need to send the procedure change across as a transaction). And a look in the change history of the long stored procs (you all keep them, right?), showed that the problem from six months earlier could well have been down to this too.
Applying SP2 (with sufficient paranoia about backups and how to get back out again if necessary) fixed the problem. The stored proc changes went through immediately after the service pack was applied, and it's been running happily since.
The funny thing is that I didn't solve the problem. He had put the Profiler trace on the server, and had done the search that found a forum post pointing at this particular problem. I'd asked Ted too, and although he'd given some useful information, nothing that he'd come up with had actually been the solution either.
Sometimes, asking for help is the most useful thing you can do. Often though, you don't end up getting the help from the person you asked – the sounding board is actually what you need.