Recovering a clients Active Directory

2010/07/16

Ahhh Fridays. The day I most look forward to in the week.

As I usually do I spent my train trip to work listening to some music, reading my book and thinking about my tickets and which ones I could get wrapped up before the weekend.

At 0850 my desk phone rings, being the only person who had arrived at work by that time I took the call and found myself talking to one of my preferred clients who unfortunately works in one of the sectors I dislike, the rag trade.

The poor guy sounded pretty stressed and right away I knew something was up.

“I installed a new 2008R2 DC VM yesterday and I think it did something bad to my AD, this morning no one can logon and all my stores are offline! Can you come out on site and help me out?”

I spent the next ~15 minutes going through the timeline of events from Thursday with him. It didn’t sound like he had done anything particularly nasty so I was a little surprised he was in such a state. The more he told me about what had happened and the things he had tried to repair the problems the more certain I became that my weekend was about to become a lot less enjoyable.

An hour later I’m on site and looking at the event logs on his two DC’s. The plan was to spend an hour or two going through the logs and trying to determine what the sequence of events was so I could then devise a plan to get things running again.

The first thing I found was that the event logs on the two running DCs only went back to late yesterday afternoon. Odd. I dug out the c:\windows\debug\dcpromo.log on the 2k8R2 DC that was suspected of causing the faults and had a look through it. There were some errors present that indicated that the promo may not have gone as smoothly as it should and there were errors when the 2K8R2 DC was demoted later in the afternoon on Thursday.

Since the original faults on Thursday afternoon the client had already performed system state restores of the two 2K3 DCs on Thursday night and Friday morning. Still no one could logon and that was when he called me.

The event logs I did have showed me that FRS wasn’t happy, there were issues locating a Global Catalog and that the DC’s kept logging that they couldn’t find the domain. There wasn’t a lot to tell me what the root cause of the problem was.

After two hours of trying to piece together what was causing the problem and not getting anywhere I logged a fault with Microsoft to talk to one of their techs. I was promised a call back within 8 business hours. It was midday at this stage and we weren’t all that sure that we would hear from MS before the weekend. A weekend of downtime was not something the client wanted to contemplate. A rebuild of AD from scratch was being considered.

What really threw my troubleshooting today was all the different errors and warnings I was getting. I couldn’t tell which one was the root error causing the problems. I had multiple symptoms:

  • No one could logon to the domain. I was lucky to be able to logon to the DC’s.
  • Event logs saying the DC’s couldnt locate a GC in the domain.
  • No sysvol or netlogon shares on the DC’s.
  • There were no indications of a USN rollback thankfully.

With no better options I decided to concentrate on the first error in the event logs after a reboot. I would work that problem till I had a solution. Event Source: ntfrs, Event ID: 13508. Eventid.net was useful. I had found other people with a similar issue, missing sysvol and netlogon shares. I then found my way over to KB958804. Scenario 2 seemed to match my situation.

I did as the fix instructed and 10 minutes later we were up and running again. That was it.

I was relieved and so was the client. He was up and running, his Line of Business apps all started fine, staff were able to logon and things all just suddenly started working. It felt like one of those movie scenes where the hero throws a simple switch and all of a sudden the super hyperatomic defense system comes online and saves humanity.

What could I have done in hindsight to try and resolve this quicker? I’m not really sure. I think I relied too much on the event logs giving me useful information. I need to also be better prepared to start from a freshpoint and focus the troubleshooting on the first error.

I want to think on this for a while longer.

Advertisements