Russell Coker posted a piece on how to build and run a cluster of computers and I think it aligns with VMware clusters. I have paraphrased some of his points and will comment on how they apply to VMware.
Try and ensure your cluster nodes are the same hardware.
This seems like a simple enough suggestion and for a new cluster I would say its easily achievable. Over longer periods though, as a company’s VMware cluster ages it can be hard to get the same make and model of server to add as a new node to your cluster. Adding a new node at a later date can introduce driver and firmware compatibility issues and can make HA and DRS not behave as admins have become accustomed to as it’s likely the new host has more resources and hence more slots.
A different strategy to handle this growth problem is to create multiple VMware cluster’s. Once your first VMware cluster is full, instead of adding new nodes, create a new cluster of hosts and scale things out horizontally. This has the generational benefit of allowing you to build the new cluster using all the knowledge you gained from building your first cluster (you captured that knowledge right?) and so not repeat the same mistakes.
A draw back to this iterative generational approach is that you can lose your economies of scale and not make full utilisation of the storage and network IO capabilities of your first cluster.
You don’t need decent hardware but you should consider at least raid1 if your doing DAS. You also need to consider redundant PSU’s etc.
I would argue that you should have redundant PSUs in your cluster members because it means you can move them between PDU’s/UPS’s as needed. Multiple NICS (I think four is a bare minimum for VMware hosts) should be required and then at least 2, preferably 4 SAN HBA ports as well.
You can boot your ESX(i) nodes from a USB key plugged into the mobo. I like this approach for its simplicity but it introduces a single point of failure to the node and that is not good design.
Setup test/use cases for testing the cluster and actually do test them!
Ahh testing. The first thing to get sliced from the project plan when money and time are getting tight. I’ve had clients ask to take all testing out of their project plan just to save money.
You MUST do testing on your cluster and there are many facets to target in your testing. I approach testing from the top down. Start at the top layer with your application and work your way down through the stack, through the middle, to the back end. Down through the operating system, then the network and storage and finally down into the physical infrastructure and connections.
You want to design test cases for each layer and find answers to questions such as:
Q: What does my app do if I lose a node and HA has to restart the VM?
Q: How does my middleware and backend servers cope if a front end server goes away for a while while HA is getting it started on another node?
Q: How does my front end cope if the middle and backend servers are restarted on a different node?
Q: Is the order in which my back end servers start up important and does my cluster understand this?
Q: How well does my operating system cope with ungraceful shutdowns? Ive seen Windows servers that wont boot because of underlying filesystem corruption in the C:\Windows\System32\config\system registry data. That blocks the benefit of HA right there.
Q: Is my network (virtual and physical) setup properly so that HA events arent going to start my VMs on isolated networks?
Q: Are my physical network and storage links redundant? Have I proven it by pulling cables out of hosts and SANs?
Q: What effects does a loss of network and storage redundancy have on IO loads?
You should focus carefully on the money & time allocated for testing, and agree with the business owners on what testing is performed and how the results of the tests are measured.
Your applications need to play nice in a cluster
VMware makes it easy to run non-cluster savvy applications in a cluster environment because its the hypervisor layer doing the clustering . In a Fault Tolerant VM the app wont ever see an outage, for High Availability the outage is limited by the time taken for the cluster to detect the failed host and start the VM on another cluster node (assuming available slots and no splitbrain problems).
An example I’m thinking of here is the file based accounting application. You know the one where the user creates an invoice and when they press their F9 key to accept the creation, the app goes off and writes the changes to a dozen different files for things such as stock, GL, Journal etc etc? This particular design of application is highly susceptible to interruption. If the host it’s running on dies in the middle of an update then you have some files with the updated data and not other’s. I’ve seen this happen.
This type of application will most likely not survive a VMware HA event where the host dies and another host has to restart the VM. In this scenario your app is going to need further attention when the VM it runs on comes back to life. Sure you could make the VM a FT one but that brings up other design issues.
It’s better that your app is designed and written in such a way as to be able to handle this type of event.
Your admins need to know how to operate the cluster.
VMware makes this appear easy but when a cluster has problems then admins need to know how the cluster is put together and what makes it tick. To understand VMware HA and DRS I recommend Duncan Eppings HA/DRS Deepdive.
Your admins need to be savvy enough to know that if your upgrading the firmware on a node in the cluster you MUST upgrade the firmware on all the nodes. Plan for this carefully. Test carefully.
An good example of how this could bite you would be when you upgrade the firmware on an HBA in a node only to find that when the node comes back online that it glitches your storage controller and causes problems for the other nodes in the cluster. You shouldn’t upgrade firmware in production unless you really really need to. Pedantic and/or not so great admins will try and make sure their servers always have the latest BIO and firmware, just for the sake of it. That’s not a good enough reason. If it aint broke, don’t fix it.
Russell says “Running a cluster is something that you should either do properly or not at all. If you do it badly then the result can easily be less uptime than a single well-run system.”
I couldn’t agree more. A complicated beast needs quality equipment and quality people to allow the beast to reach its maximum potential.
Hopefully Ive explored the relevant parts of Russells cluster thoughts and how I think they can apply to VMware clusters.