The previous post in this series on Red Hat Enterprise Virtualization (RHEV), explained that the RHEV Manager is not just a mission critical component of the infrastructure — it’s a huge single point of failure as well.
What happens when the other major component of a RHEV infrastructure fails? Can you rely on RHEV High Availability (HA) to quickly and reliably restart affected VMs when a RHEV Hypervisor fails? It depends – as you will see.
First, let’s make sure everyone is up to speed on HA capabilities provided by the gold standard in virtualization:
VMware HA
VMware HA is a robust feature that was first introduced with Virtual Infrastructure 3 in 2006.
VMware vCenter Server is required to configure HA options and add VMware ESX hosts to a cluster, but after that vCenter is hands-off — ESX hosts communicate among themselves to reliably restart virtual machines. In fact, VMware HA can even restart vCenter Server if it is running inside a protected VM — wrap your head around that one.
Powerful options are available for administrators, such as specifying the restart priority of virtual machines and whether or not to force VMs to power off if a host becomes isolated from the rest of the cluster.

VMware has heavily invested in this technology, reducing risk for customers that virtualize with vSphere. For even more information on VMware HA, take a look at Duncan Epping’s HA Deep Dive.
RHEV HA [ha ha]
Looking at this Red Hat Enterprise Virtualization competitive comparison, you’d might assume that RHEV and vSphere are on equal footing when it comes to protecting virtual machines with HA:

Unsightly details behind the marketing
RHEV HA sounds great in the marketing brochure, but there are a few problems with the execution. RHEV Manager is a single point of failure — running on a physical Windows box — and it’s also the actual brain behind HA. Yes, RHEV-M is responsible for restarting virtual machines when a host fails. If the manager is down, no HA for you!
That alone makes RHEV HA something less than “HA” for most production environments, but there are a few other key weaknesses:
- HA must be manually enabled for each virtual machine — no cluster-wide settings
- No cluster admission control — administrators must manually ensure sufficient capacity would be available in a cluster to accommodate a host failure
- No VM restart priority to ensure the most critical workloads and dependencies are brought online first
- Primitive split-brain protection requires IPMI or other out-of-band management interface to force a host shutdown
- Cannot protect the RHEV Manager itself — chicken-and-egg situation
Wow, I didn’t notice those details in the comparison brochure.
Decide
Whether your datacenter is running Windows Server or the mighty Red Hat Enterprise Linux, doesn’t it makes sense to trust the proven leader in virtualization? VMware vSphere is simply the most reliable platform for consolidating workloads and building your private cloud. Going beyond exceptional HA is VMware FT – mirroring mission-critical VMs on backup hosts means zero downtime from host failures.
Related posts:
-
Can’t argue with your findings here, but I can argue with this:
“Going beyond exceptional HA is VMware FT – mirroring mission-critical VMs on backup hosts means zero downtime from host failures.”
You mean mirroring mission-critical 1vCPU, VM’s….
-
Hmm, looking at the vmware pricing PDF, it seems that “HA” is available in “Standard” edition, it’s also available in the higher end versions of essentials, for smaller installs I believe.
Good post though, I’ve been a devoted vmware fan/user for about 11 years now, though mid-longer term I have grave concerns about where EMC is trying to take them, I plan to stick with vSphere for at least v4, and re-evaluate options again in 2011/2012 whenever the next refresh comes around.
-
You must be deathly afraid of RHEV with all this bashing…
VMware HA is nothing more than a rebirth of Legato AAM with some tweaks for esx.
Unless something has changed, your comment “ESX hosts communicate among themselves to reliably restart virtual machines.” isn’t entirely accurate unless you add a caveat of “under most conditions”. They can only reliably restart virtual machines if a primary node is still alive.
Marcel has done a good job of covering the basics:
http://up2v.wordpress.com/2009/03/04/dc28-vmware-ha-cluster-in-enterprise-deep-dive-and-best-practises/-
Yes …. you need a primary node … so, if you have bad luck enough to lost all the 5 primary nodes at the same time, than HA will not work.
Very likely situation uh ???-
Depends on the definition of “likely situation”. Depending on the size of the cluster, the type of servers, etc., yes, it’s an entirely realistic scenario.
If your definition is “happens on a weekly basis” then the answer is no. But with that sort of approach, why have redundant power, raid-6, or redundant fabrics? Screw it, let’s just pretend there’s no failure modes and hope customers never ask.
-
So, you are suggesting that losing 5 ESX hosts at the same time, is realistic, and might happen anytime ?
-
You’re suggesting a blade enclosure has never failed in the field, never will, and if it does happen, it’ll be during a scheduled outage?
I’m not sure if you’re trolling or just have 0 experience in a datacenter. Either way, yes, I’m saying it’s entirely possible to lose 5 servers in one shot. I hope you aren’t in charge of architecture.
-
No personal offenses please, let’s maintain professional discussions here.
It is an very known best practice to spread blades across difference enclosures to avoid this problem.Blade enclosure failures are rare, a minimal possibility. If that happens, even with HA, you are in serious trouble, since you will loose many, many hosts at the same time, and capacity will be an issue here (unless you have let’s say, 40/50% idle capacity on your cluster).
-
So why exactly do you find it important to spread blades across enclosures, but find it completely unimportant to inform end-users of the primary node issue? It would seem to me those two actions/beliefs are entirely in conflict with one another.
Lots of things are “rare”, that doesn’t mean they should be ignored or dismissed with a sarcastic response, as you chose to do.
-
Again, this is a very known practice to spread blades across enclosures, what is your concern here ?
You pointed a possible problem, and I showed that well designed environments will never suffer from it.
This primary node “issue” is not a secret, it is very well known. This will happen very, very rarely, you cannot say it is a general problem that will hit everyone.
And again, blade enclosures failure rates are minimal to none, given you have a decente datacenter infra.
-
Maybe I missed something, but as Fernando said primary nodes issue is not a secret and I saw no efforts to conceal it from end users.
Actually if end user admin is not too lazy to open google he would know that. If admin is not too lazy to search through internet what “slot size” mean in HA Runtime Info he would know how exactly HA works, what is primary node and what is HA slot.
-
-
-
LOL…I’ve seen a chassis die, which is why I balance ESX hosts across 2 chassis
I’ve seen facilities electricians cut UPS shutoff switch lines, entire racks blow BOTH circuits, sewer lines back up and flood raised floors. I dunno maybe I just have bad luck, but ANYTHING is possible!
-
Agree ! But in this case, HA will not save you !!!
-
I mean, anything will save you in a situation like this.
-
-
-
-
-
-
-
-
My Red Hat KVM clusters are completely bulletproof (unless we lose both geographically separated datacenters).
I use LVM for my virtual disks and mirror them to storage at the other datacenter, create bonded links across PCI network NICs to seperate switchs and create my cluster with blades in diffferent chassis’ at both locations.
And, by the way, you don’t need IPMI or out-of-band mgmt interfaces to fence nodes. SCSI persistent reservations isolate physical machines from shared storage quite elegantly.
Think about this scenario though. Let’s assume you have an HA VMware VM running web services and the VM is fine, but the HTTP service on the guest crashes. For my KVM clusters, I can configure a cluster of virtual machines running XVM fencing and configure a clustered HTTP service, using shared storage, on them. This way, even if my KVM guests are all healthy from an OS perspective, my HTTP services migrate automatically on failure. I could also just run the HTTP directly on the physical blades using the same cluster and shared storage as my HA KVM guests.
Everything I need for this, from nuts to bolts on the software side, you get with Red Hat. That means I have one support channel to deal with. This is really the beauty of it all.
The per socket cost of VMware ESX is outrageously expensive when you consider the alternatives. Especially if you aren’t fully utilizing the ESX server. On top of the substantial cost of VMware, you still need to pony up for the OS and OS support you will need to put on your virtual guests. And if you have issues with your guests, don’t be surprised that if you call VMware they tell you you have an OS issue or vice versa from Microsoft, Red Hat, Sun, etc. Who needs a middle man?
The support model for OS vendors creating their own virtualization technologies ends the finger pointing games, the bringing down of guests to install new versions of VMware tools, etc. In my opinion, VMware should be a bit nervous right now and they should be slashing the pricing model of their products to prevent further bleeding. When heavy hitters like IBM start using KVM/RHEV for their clouds like I was just reading about, the writing is on the wall.
-
Hey dude, what the fuck is wrong with you. RHEV-M is work in progress an cost on a friction of what vmware do. I suffered using vm infrastructure 3 but now the new vSphere is better. RHEV-M is still infant, give some time for the developer and then compare. Most of you facts is outdated. The latest RHEV comes with ksm and and other feature that work well for me. Are you using RHEV-M. I am. I implemented at least in 10 data center and all my customers are happy. So RHEV-M work and it is simple.
-
According to this document, RHEV-M can be HA’d
http://www.redhat.com/f/pdf/rhev/final2.2/DOC255_RH_WP_RHEV_D_2832287_0610_ma_web.pdfIt’s also possible to provide HA for RHEV-M using RHEL’s Cluster Manager.
-
On any RHEL server, KVM or RHEV-M, you can set up “shared nothing” HA clusters using DRBD (Disk Replicated Block Device) http://www.drbd.org/.
I used DRBD on a KVM cluster as the shared storage and I complete HA…not even shared storage. Since I have a complete copy of the data, I am literally sharing nothing. I am pretty sure that you can’t utilize DRBD on ESX because the VMKernel handles ALL access to the VMFS datastores. The Linux kenel for the SC can’t get to that data.
So. for RHEV-M, just set DRBD on the raw partitions where the RHEV-M resides and you are all set…at the block level. You will have to have access to all the VLANs at both locations though if you want the KVM guests to have networking when you switch over.
-
John, DRBD is a great software with one problem. You can’t buy support for it, and you can’t have any guarantees and guaranteed time-to-reply and time-to-fix.
This is critical for some enterprises.-
-
Does they provide support in all time zones, all languages and 24*7? Does they have representatives in Europe, and what especially concerns me, in Russia?
-
They are headquartered in Austria. Not sure on the Russian support, but your English seems to be good enough on this forum.
I will concede, and this is for VMware as well, if you don’t have skilled on-site IT people, you probably don’t deserve to be running on the enterprise level. DRBD is no different. For me, it is just an extra level of data protection. I still do regular backups of the primary storage. If DRBD fails, you still have access to your primary node for whatever that’s worth.
-
-
-
-
Don’t just comment an not update. Who the hell ask you to compare with something in beta release. The brochure is for complete release 3.0. Things you said RHEV don’t have already there. It’s looks like every thing about vmware is the best. Are you paid by them to write this? Check the release update of RHEV and comment.
-
RHEV HA sounds great in the marketing brochure, but there are a few problems with the execution. RHEV Manager is a single point of failure — running on a physical Windows box — and it’s also the actual brain behind HA. Yes, RHEV-M is responsible for restarting virtual machines when a host fails. If the manager is down, no HA for you!
>>> Incorrect, the manager can be made Highly avaiable.That alone makes RHEV HA something less than “HA” for most production environments, but there are a few other key weaknesses:
* HA must be manually enabled for each virtual machine — no cluster-wide settings
>>> So?* No cluster admission control — administrators must manually ensure sufficient capacity would be available in a cluster to accommodate a host failure
>>> Are your administrators unable to perform the simple math necessary to determine if enough capacity exists to handle Virtual Machine HA?* No VM restart priority to ensure the most critical workloads and dependencies are brought online first
>>> Yes, in RHEV 2.2 which is released now* Primitive split-brain protection requires IPMI or other out-of-band management interface to force a host shutdown
>>> Fencing a physical host is something that is used in many cluster solutions. How is this primitive?* Cannot protect the RHEV Manager itself — chicken-and-egg situation
>>> You can cluster a Windows virtual machine using KVM and Red Hat Cluster Suite to provide a highly available Manager. You could also cluster the application server and database to provide HA at that layer.Wow, I didn’t notice those details in the comparison brochure.
>>> You also didn’t mention that it’s 25% the price of VMWare, the next version of the hypervisor scales to 4096 cores , and is going to leverage SELinux, something that VMWare will never do.Have a nice day!
-
Thanks Bro.. Finally somebody who understand RHEV. You forgot to mention, it is also OPEN SOURCE. You can request for the source code from REDHAT, something VMWARE will never do.
Bye..
-
-
Point is every organization makes tall claims about their Product’s features and no-one highlights the weaknesses.
VMWare claims FT to be such an important feature. How many times the customer is actually told :
That it can have only 1 vCPU and that too with upto 20% overhead. I’ve heard of VMWare claims that FT can make RAC redundant. Imagine running a DB VM with a single vcpu and upto 20% overhead
That it requires a dedicated Gigabit Ethernet network between the physical servers, 10 Gigabit Ethernet should be considered if VMware FT is enabled for many virtual machines on the same host.
That you can’t use memory over-commit, thin provisioning, hot-plugging of devices and even snapshots with FT. All the features that VMWare claims are critical for a virtual environment and charges you $$$s for.Xen 4.0 also provides FT and possibly with a lot less overhead and multiple vcpu support
-
If only that were true.. There is this wonderful thing called human error, like the DC engineer who throws the wrong circuit breaker after a power feed failure and turns your entire rack off and off goes your fancy blade enclosure (and I had this happen last week so it does happen!)
-
Hi Eric, love your blog!
Silly question, how do you make this images/screenshots look like this ?
like it’s some paper been cut
TIA
-
Thank you Eric.
-
Guys, I hope the argument between RedHat vs VMware is still alive here since i need to take a decision on the same. I have some critical web servers internal to my business that would have squid-apache-mysql installed. I am looking for a HA solution either from RedHat or VMware HA/DRS but i m not too sure which one to consider looking at the reliability, cost for licenses, response to which the fail over of services can be triggered. I proposed RHEL cluster with two servers hosting ESX in which VMs would be running so that one HW can fail over to the other in the event of a HW failure etc., but i had one VMware guy propose HA/DRS solutions. Let me know honestly which one i should go with … mainly with complexities of implementing them, reliability/stability and response
thanks.
-
RHEV is fully capable for HA for both guest level, as a feature, and on managment level, env. design
http://www.berezins.com/2012/09/18/highly-available-rhev-manager-3-0/
-
Bottom line: VMware is Virtualization for Dummies (with alot of money to waste). KVM/RHEV is for real admins/engineers who know how to do stuff. It is infinitely more configurable than vmware with its Windoze-style pull-down menus and check boxes – it even has a native virtualization shell so you can script virt. related stuff (virsh). Try that vmware (and no, that vmware-cli crap is not native). Dont know what – if anything – vmware is doing with powershell, but Im sure it will be windows-style retarded.
I have watched vmware go from a pretty open platform running on linux to a sort of vm-hosting appliance running on a crippled linux base. Vmware is on the forefront of dumbing down system admin (a trail microsoft blazed in the 90s), and thus chasing away real admins who have technical prowess and ability to do stuff competently on an enterprise level (see former unix admins like myself
. I forsook (past tense of forsake
vmware as soon as I saw the brilliance and true ‘nix-type brilliance of KVM. Nuff said. JG

RSS Feed
Follow
57 comments
Comments feed for this article
Trackback link: http://www.vcritical.com/2010/04/red-hat-enterprise-virtualization-ha-ha-ha/trackback/