Best Practices

2010/07/22

I was asked some questions recently about vitualisation best practices and it really threw me into thinking about what “Best Practice” really means.

It’s a common enough term used in IT. Vendor releases piece of technology and has supporting documentation on how to install and configure the technology. If you’re lucky the vendor also provides some use cases which probably highlight the technology’s strength.

As the people who use the technology become more familiar with it over time, they will learn, memorise and hopefully document what the best way to use the technology is and how to avoid its pitfalls, its common mistakes and how best to tune it for what ever it is needs to be achieved. As this now explicit knowledge is shared it can start to take on the “Best Practice” feeling. Vendors may even take this knowledge and documentation that was out in the public realm and turn it into a Whitepaper or some such documentation. This sort of documentation from a vendor accelerates the acceptance of the knowledge as “Best Practice”.

I would call this example “Industry Best Practice”. It is a collection of knowledge learnt over time and through experience that the best way to deploy/configure/maintain this piece of technology to achieve A, B or C is by doing things X, Y and Z.

This differs from something I suggest is called “Vendor Best Practice” where the vendor has said, clearly in their documentation that a particular feature or setting should be set in this precise way in this scenario. A good example of this is the VMFS alignment in VMware datastores.

There is a third realm of “Best Practice” which I hesitate to call “Practice Best Practice”. I would describe this as the way a business or person goes about operating their IT (not just Virtualisation) environments. This should not be confused with an organisations policies or procedures. Policies are overriding business rules such as “Gifts from clients must go into the company gift registry”. Procedures are rules or recipes that describe how specific tasks are started, conducted and completed. I suggest that a  “Practice Best Practice” describes a high level holistic way in which an organisation operates its IT. An example of this would be an inclusive goal such as “All new virtual machines that go into the production cluster are automatically added to the backup regime in such as way as to guarantee recovery of the virtual machine in a disaster recovery scenario”.

For a few decades now IT has grown used to doing things in the physical world. One server, doing single or multiple things with little contention for resources. In fact, in the Windows world, it became the norm to accept that a server operating with an average ~2% CPU usage (for example) was normal. In larger IT environments it became accepted that a new app MUST live on its own dedicated server so that it could be silo’d in its own little operating system world and not interfere with the other children in the room. Lead times for server builds were measured in weeks or even months and business units that dealt with IT had come to expect that as normal. (And often bemoaned how slow IT can move).

Now we have Virtualisation which turns these ways of doing things on its head. It allows us to do more with less. We can provision new servers in a matter of hours instead of weeks, its now far more economical to silo our apps into individual servers and our windows boxes still only use ~2% of the CPU while the other ~98% is used to run other servers! What a fabulous world we live in. And that’s barely scratched the surface of the benefits of vitualisation.

So, how does “Industry Best Practice”, “Vendor Best Practice” and “Practice Best Practice” apply to virtualisation?

I have my thoughts to share on the matter and will do so in the coming days.


Killing Windows Server 2008 Processes remotely

2010/07/22

This is a quick and easy one.

Found a client Window Server 2008 this morning that I couldn’t RDP into. When I attempted to RDP in I would be prompted for the username and password and then the connection would just die silently.

The host was a VM so I fired up the VMware vSphere Client and took a look at the console of the server. The screen was displaying “Shutting down Acronis Scheduler2 Service service”. The server was trying to end this process/service to do a restart from the scheduled Windows Update install from Monday morning at 3am.

I arranged a quick outage window during the clients lunch hour and started trying to resolve this. I figured that if I could kill the hung service remotely then the reboot would probably carry on normally.

I jumped on to the clients terminal server and fired up Computer Manager and connected it to the server that was having problems. Found three services in a “Stopping” state.

“C:\Program Files (x86)\Common Files\Acronis\Schedule2\schedul2.exe”
“C:\Program Files (x86)\Acronis\AMS\ManagementServer.exe”
“C:\Program Files (x86)\Acronis\BackupAndRecovery\mms.exe”

So, I now knew which processes were causing the hang up. How to kill them remotely since I couldn’t logon.

Some quick Google phoo found me this. From the terminal server I used tasklist.exe and taskkill.exe to kill the schedul2.exe process.

Sure enough the console then carried on with the shutdown and restart of the server.

The more I used Windows Server 2008 the more impressed I am with the features and tools it offers admins these days.

Update (20100812): I’ve had to repeat a similar process on Server 2003 and found that tasklist.exe and taskkill.exe are also present in 2003. They even have the same usage switches.


Recovering a clients Active Directory

2010/07/16

Ahhh Fridays. The day I most look forward to in the week.

As I usually do I spent my train trip to work listening to some music, reading my book and thinking about my tickets and which ones I could get wrapped up before the weekend.

At 0850 my desk phone rings, being the only person who had arrived at work by that time I took the call and found myself talking to one of my preferred clients who unfortunately works in one of the sectors I dislike, the rag trade.

The poor guy sounded pretty stressed and right away I knew something was up.

“I installed a new 2008R2 DC VM yesterday and I think it did something bad to my AD, this morning no one can logon and all my stores are offline! Can you come out on site and help me out?”

I spent the next ~15 minutes going through the timeline of events from Thursday with him. It didn’t sound like he had done anything particularly nasty so I was a little surprised he was in such a state. The more he told me about what had happened and the things he had tried to repair the problems the more certain I became that my weekend was about to become a lot less enjoyable.

An hour later I’m on site and looking at the event logs on his two DC’s. The plan was to spend an hour or two going through the logs and trying to determine what the sequence of events was so I could then devise a plan to get things running again.

The first thing I found was that the event logs on the two running DCs only went back to late yesterday afternoon. Odd. I dug out the c:\windows\debug\dcpromo.log on the 2k8R2 DC that was suspected of causing the faults and had a look through it. There were some errors present that indicated that the promo may not have gone as smoothly as it should and there were errors when the 2K8R2 DC was demoted later in the afternoon on Thursday.

Since the original faults on Thursday afternoon the client had already performed system state restores of the two 2K3 DCs on Thursday night and Friday morning. Still no one could logon and that was when he called me.

The event logs I did have showed me that FRS wasn’t happy, there were issues locating a Global Catalog and that the DC’s kept logging that they couldn’t find the domain. There wasn’t a lot to tell me what the root cause of the problem was.

After two hours of trying to piece together what was causing the problem and not getting anywhere I logged a fault with Microsoft to talk to one of their techs. I was promised a call back within 8 business hours. It was midday at this stage and we weren’t all that sure that we would hear from MS before the weekend. A weekend of downtime was not something the client wanted to contemplate. A rebuild of AD from scratch was being considered.

What really threw my troubleshooting today was all the different errors and warnings I was getting. I couldn’t tell which one was the root error causing the problems. I had multiple symptoms:

  • No one could logon to the domain. I was lucky to be able to logon to the DC’s.
  • Event logs saying the DC’s couldnt locate a GC in the domain.
  • No sysvol or netlogon shares on the DC’s.
  • There were no indications of a USN rollback thankfully.

With no better options I decided to concentrate on the first error in the event logs after a reboot. I would work that problem till I had a solution. Event Source: ntfrs, Event ID: 13508. Eventid.net was useful. I had found other people with a similar issue, missing sysvol and netlogon shares. I then found my way over to KB958804. Scenario 2 seemed to match my situation.

I did as the fix instructed and 10 minutes later we were up and running again. That was it.

I was relieved and so was the client. He was up and running, his Line of Business apps all started fine, staff were able to logon and things all just suddenly started working. It felt like one of those movie scenes where the hero throws a simple switch and all of a sudden the super hyperatomic defense system comes online and saves humanity.

What could I have done in hindsight to try and resolve this quicker? I’m not really sure. I think I relied too much on the event logs giving me useful information. I need to also be better prepared to start from a freshpoint and focus the troubleshooting on the first error.

I want to think on this for a while longer.


First!

2010/07/16

After today’s experience unfscking a clients AD upgrade I thought it was about time I setup a place to capture and share my experiences.

I’ve been reading other sysadmin blogs now via Google Reader for a long while and its always been at the back of my mind that I would like to join the party and share what I know and learn.

I can’t promise fireworks, I cant promise earthshaking revelations, I sure as heck can’t promise eloquent writing, and hell, I probably cannot even share regular posts.

I will try though.