Easy recovery from a full VMware ESX datastore

This is the third article in a series on VMware vSphere thin-provisioned virtual disks.  Now that we’ve covered:

You may be nearly convinced to start using thin provisioning, but still wondering…

What happens if a datastore fills up?

When a datastore runs out of space, thin-provisioned virtual disks can no longer dynamically grow to accommodate additional storage demand.  When VMware ESX detects this condition, virtual machines in need of additional storage are instantly paused to prevent guest operating systems from failing. Conversely, VMs that that read and write to existing allocated storage blocks will continue running without issue — not all virtual machines will be paused just because a datastore is out of space.

If you ever find yourself in this situation, it’s not hard to fix.  Here is one simple approach, step-by-step:

  1. Free up some space by deleting or moving files — ISO images or powered-off VMs would be perfect
  2. Resume one of the paused VMs
  3. Use Storage VMotion to move the disks for that VM to another datastore
  4. Resume the remaining VMs

Watch the procedure in action:

Depending on the size and storage demand of each VM, additional migrations may be needed.  An alternative resolution would be to add additional space to the SAN LUN and grow the VMFS volume.

The Experiment

To simulate a sudden storage demand by the thin-provisioned VMs in the above video, I simply copied a large file from a network share to each Windows Server 2003 VM simultaneously.

For the curious, below is a PowerShell script for the task.  Run it from anywhere — it uses Sysinternals psexec to remotely initiate a file copy on each VM from a network share.

VMware ESX is Resilient

You may have been surprised at how easy it is to recover from a full datastore — without so much as a guest OS reboot.  It’s a testament to the rock-solid architecture behind VMware ESX and VMFS.  No other virtualization platform comes close.  Try for yourself.  See what happens if a group of thin-provisioned Hyper-V virtual machines suddenly run out of storage — it’s not going to be pretty.

Tags: , , , , , , , ,

11 comments

  1. Jason Boche’s avatar

    Another option for the back pocket: Find a VM that can be powered off. Chances are (and by default) it has no memory reservation configured. The net result is that when the VM is powered off, a VMkernel swap file equal to the size of assigned RAM can safely be removed when the VM is powered down.

    Alternatively, creating a reservation equal to assigned memory on the fly will zero out the swap file but the zeroing won’t actually happen until the next power operation of the VM.

  2. Ben Thomas’s avatar

    Cool post! I would be interested to try that with some preventative alarms to automatically move off powered off VMs with powerCLI. Seems to make a case for an “emergency” empty LUN for situations like this.

  3. Eric Gray’s avatar

    Jason, thanks for pointing out those additional recovery choices.

    Ben, interesting idea. If you come up with something let me know. Nice blog, by the way.

  4. Stu Fox’s avatar

    By not pretty you mean worst case when the machines get paused just like on VMware?

    Of course you’d be monitoring the host volume with OpsMgr so it would proactively alert you anyway and you’d have planned for that. Hell, even perfmon would probably do that for you.

    1. Eric Gray’s avatar

      Stu,

      That would not be the worst case — more like best case. If it ever happens to you, let me know.

      Eric

    2. Stu Fox’s avatar

      You’ll get warned in the event log at 2GB free, and at 200MB free the VM’s will get paused on Hyper-V. That would be the worst case – that you ignored the warnings and didn’t take action. Of course with OpsMgr you would get warned earlier.

    3. Eric Gray’s avatar

      That is good to know, thanks for the details. So all of the VMs on a LUN get paused at the same time whether or not they need additional storage — then it is a race against the 200MB clock. For the VMs that don’t make it… unpredictable results.

      Monitoring virtualization storage is crucial, which is why vSphere accounts not only for space currently used but for space allocated by thin provisioning — critical for managing risk. As far as I can tell, SCOM doesn’t address this aspect.

    4. Amal’s avatar

      I’ve not tried this in 4.1 but does ESX perform the same pause function if the datastore goes offline completely? If not, it should. A couple years ago I had a catastrophic loss of service with one of my iSCSI storage arrays. Recovery was simple, but the VM guests had all hard-crashed. It would be nice if those machines could have been put into instantaneous suspension (halt and suspend all CPU and storage calls) until the storage array was able to be brought back online.

    5. subit’s avatar

      I deleted some servers. Still I could not see any improvement in free space in the data store. Can you say what can be any possible reason?

Comments are now closed.