Cold Starting a Datacenter

How would you prepare for a complete datacenter power outage?

I had the opportunity to consider this recently — last weekend, some of the VMware labs needed to be shut down completely for widespread power maintenance.  It was a minor inconvenience, but fortunately these occasions are few and far between.  At least I fared better than Scott Lowe who recently powered off all of his gear for nothing.  Ouch!

In this particular lab the primary hardware is a couple dozen HP servers and two CLARiiON storage arrays.  Naturally, all of the supporting infrastructure is virtualized, so I looked at this as an opportunity to see how I could best manage such an event.  Oh, by the way, this lab is two states away, so everything is done remotely.

When I returned to the office after the weekend, I was pleased to find things up and running again.  These were my main considerations:

Make vCenter Server easy to find

While some will still debate the merits of running management servers on physical machines, I say virtualize them.  However, with the wonders of VMware DRS automatically balancing workloads across a cluster, it may not be entirely clear where the vCenter Server management VM is running. This is an important piece of information because if something goes wrong and vCenter Server is not running, it may be necessary to connect directly to a host for manual startup.

I used one of the newer vSphere features to create a DRS group — consisting of one VM and one host — that would keep my VC VM on a desired host:

Another option would have been to simply disable DRS for the vCenter VM, but when the servers all suddenly powered on, I could not control where that VM would be initially started.  Using the DRS groups technique, the VM can start and then migrate to the preferred host.  If a problem prevented the host from booting up, then at least vCenter would start on another host in the cluster.

The DRS rules looked like this:

Also note that I have an anti-affinity rule that strives to keep the two domain controllers on different hosts to better tolerate an outage.  Overkill for a lab, perhaps, but it’s easy in vSphere — why not take advantage of it?

Start your engines!

The only other thing that I configured before shutting everything down was automatic startup for the domain controllers and the vCenter Server VMs, as seen here:

The VMs started as requested when power was restored to the datacenter.  Everything came up, so I consider this a success.  It’ll probably be years before something like this happens again, but it’s good to know that there are some features in vSphere to accommodate whatever IT challenges are thrown your way.

Is your datacenter ready?

What measures have you put in place to recover from a complete datacenter power event?  Have you deployed a small portion of key infrastructure on physical machines or virtualized them on a separate management cluster?  Will you trust the automatic startup feature, and if you use VMware DRS, do you have rules in place to steer things in a predictable direction?

Tags: ,

11 comments

  1. Brandon’s avatar

    Hrm. I think you should use two hosts in your DRS group. What if the host you wanted vCenter on had issues coming back up as in it died during the abrupt power failure? The single “preferred” host wouldn’t be around and you’d have to find it again… not that it would be difficult. I also like to document the datastore the vCenter VM is on, just in case for some reason it doesn’t get reregistered… perhaps you put HA in maintenance mode and someone made a bad mistake… NO HA. That way you can bring up a host, login directly with the vSphere Client, browse to the specific datastore… register the vCenter VM on the host and boot it up.

    I’ve never tested the individual host’s configuration for automatic startup of VMs in a HA cluster configuration. I suppose that makes things nice if you shut things down gracefully, but if power was lost and HA was enabled… HA would start powering things back up when one of the primary hosts became available again. Duncan has a post on scenario, I’d hate to interfere with that. Plus, isn’t that list static per host, so I guess you would need to stick not only vCenter to a host, but also the other infrastructure servers you might want? If DRS moved one of those servers to another host, the automatic startup rule defined in the hosts configuration wouldn’t follow it to one of the other hosts, right? I’ve been known to be wrong, if so let me know because I’m curious.

    Sorry, I prefer to think of things in a non-graceful event kind of way. Hopefully you have a good UPS setup and a backup generator. There are smaller 3-4 host environments that don’t have that though.

    1. Eric Gray’s avatar

      Brandon,

      I appreciate your perspective. True, two hosts in that DRS group would be preferred. In this environment, I have a 4-host management cluster where all the infrastructure VMs live, so it would not have been too challenging to find the VM — mostly wanted to exercise the feature. If my environment was bigger I’d do as you suggest.

      On the plus side, the automatic startup rule does migrate to another host along with the VM – just make sure the feature is enabled on appropriate hosts so it will function.

      Thanks for the feedback.

      Eric

    2. Brandon’s avatar

      Learn something new every day.

    3. Denis Baturin’s avatar

      There are some problems with “automatic start up and shutdown” option. If you virtual machines moved to another host you lose startup order. VMs are placed to “Any Order” section.

      Is your cluster HA-enabled? If so then after reconfiguring HA-agent(every restart of host) the “automatic start up and shutdown” will be disabled.

      1. Brandon’s avatar

        Great information. As I said I’ve never tested the host’s automatic startup/shutdown options in an HA cluster, but I had put it on my to-do list after Eric’s response. I need to go ahead and give myself a test-script and go through all the situations I can think of. Unfortunately, I’ve been through a complete powerloss before. It happened when the building our DC was in was hit by lightning. The strike screwed up the UPS to the point it was interfering with street power. At the time, there was no way to bypass the UPS (good grief). Again, at the time, no SRM or any kind of warm/cold site to rely on either. It just goes to show you that no matter how well you think you’ve got it planned, something can still cause the unexpected. It is why understanding how to recover from a worst-case scenario is important.

      2. Eric Gray’s avatar

        Denis,

        Thanks for your feedback. It’s true that there are some limitations with the automatic startup feature, and it’s not really intended to replace HA for unplanned VM restarts.

        You are correct that reconfiguring HA disables automatic startup, but for whatever reason, simply rebooting a host does not disable it. That makes it a fine way to start VMs after a planned outage.

        Eric

        1. Denis Baturin’s avatar

          Unfortunately, every time when ha-cluster-member rebooted it reconfigure ha-agent!

          IMHO the one solution for cold start for HA-cluster is disabled autostart and two scripts.
          One on each host which trying to start vm with vMA and second on vMA with commands to start dc, sql, vc etc in correct order.

          1. LJ’s avatar

            I’ll add a story similar to Brandon’s … our DC suffered a total power loss as a combination of a faulty UPS and a scheduled generator test.

            But we ran into headaches at a lower level, and so I’m curious how you address those.

            For us,once power was restored, our ESX hosts rebooted automatically as they were configured, as did all the network equipment and the SANs. However, the SANs weren’t fully up (or done with consistency checks) by the time the ESX hosts were ready, and so the ESX hosts marked the SANS offline and then didn’t boot any of the VMs.

            Here our story meets your’s… with NO vm hosts… we didn’t have DHCP, DNS, or domain services. We could only log into physical servers with local credentials and static IPs, and had to scramble to remember IPs for all the critical stuff (e.g. the ESX hosts).

            After we got to the ESX host consoles, we began the rebuilding process of marking SANs available, and manually booting VMs. Finally, after several hours we got our vCenter instance back up.

            For us, those ‘lessons learned’ are that we need to put more labels on servers (MACs, IPs, etc), and that even if it’s old fashioned, have a couple of physical and independent servers around (vCenter, DNS).

            But I’d be glad how to hear how you dealt with asynchronous boots of network, SAN, ESX, vCenter?

            Too everybody else… Good Luck. 🙂

          2. dale scriven’s avatar

            LJ,

            Good point (sorry I realise this thread is a little old but I still think its worth carrying on). There are devices you can buy that you can delay power being applied to various outlets so you could for example have a time delay of about 30 minutes to give the array enough time to come online before the hosts boot (which can be combined with the autopoweron features included on hosts such as hp 380’s). That should resolve the issue.

            But I am still quite surprised that there appears to be no graceful way of performing a intelligent poweron of vm’s from a complete cold start. HA will just start everything and only provides 3 levels of priority, even so DC’s booting normally take longer to come online than other service vm’s so your enviroment is still in a mess anyway and autostart may or may not work.

            Perhaps we should suggest this be added to 5 u2 as it just seems that HA and autostart need to be a little more tied together and everything will be rosy.

            Dale

Comments are now closed.