It’s not Hyper-V — it’s the drivers!

VMware vSphere experts know that the ESX architecture has a critical advantage over other hypervisors like Xen, KVM, and Hyper-V.  Instead of relying on general-purpose third-party device drivers, VMware ESX comes with hardened, stress-tested drivers — ready for your toughest enterprise workloads.

Windows deserves applause for reliability improvements in recent years.  Unfortunately, the most reliable Windows design will never be able to counteract the damage that can be inflicted by a misbehaving device driver.  In fact, take a look at this slide from a Mark Russinovich TechEd session where he makes the point that the majority of Windows blue screens (BSODs) are caused by third-party drivers:

Experts agree — Windows reliability is a function of driver reliability.

What about those drivers included with Windows?

You might be under the impression that these third-party drivers are for off-brand NICs purchased from the clearance bin at Fry’s.  That’s not exactly the case — plenty of drivers are included right on the Windows DVD.

Consider this recent situation detailed by Mark Wilson.  He set up a Hyper-V test machine with some Intel PRO/100 NIC cards and when he plugged in an Ethernet cable — instant BSOD!

Mark’s conclusion:

It’s good to know that Hyper-V was not at fault here: sure, it shows that a rogue device driver can bring down a Windows system but that’s hardly breaking news…

Meanwhile, back at the datacenter

Microsoft Hyper-V poster child Crutchfield recently found that the Broadcom NICs embedded in every one of their Dell servers were randomly losing virtual switch bindings after upgrading to Hyper-V R2.  Hello downtime!

The administrator left us with the following thoughts:

Currently, Microsoft doesn’t have a work around for this issue and nor is there an ETA for resolution. After all, it may not be Microsoft, but could very well be Broadcom.

So, what do we do? Well, in my configurations I can’t afford these little gotchas and I will be working only with my trusted Intel NIC’s for my virtual networks.

Uh, you might want to check with Mark Wilson first about Intel drivers and Hyper-V.

The chain breaks when the weakest link fails

Notice the consistent theme in these two scenarios?  The administrators are not blaming Hyper-V, instead focusing on the third-party hardware and drivers.  Folks, it just does not matter — a hypervisor failure is incredibly disruptive.

When selecting the foundation for your private cloud, will you choose a hypervisor based on a general-purpose OS that aims to support hundreds of hardware variations from across the spectrum?  Or will you build on a rock-sold, purpose-built platform that is specifically designed to aggregate and pool resources in your datacenter?

Tags: ,

49 comments

  1. TimC’s avatar

    It’s not just Microsoft. I have a pair of supermicro servers in the lab that will reboot if you attempt to rescan for a LUN on any version of ESX.

    To claim VMware has some sort of superiority in the driver arena is bs. I had a customer just the other day who has some of the new QLogic 10Gbit cards that are fully supported. When they updated to vsphere4u1, which came with a new driver, the cards stopped working. Something in the new drivers cause VLAN’s not to route at all. VMware support had no answer as to what wrong, and no ETA on a fix.

    1. Anton Zhbankov’s avatar

      Are these SuperMicro servers and their HBAs on HCL?

      1. TimC’s avatar

        The HBA’s were, I don’t recall on the servers.

        The QLogic cards are, as are the multiple servers they were tested in, so clearly being on the HCL doesn’t make a lick of difference.

        1. Anton Zhbankov’s avatar

          HCL for servers actually means NICs + RAID. In 99% cases ESX doesn’t really care about CPUs or chipset.
          Sad to hear that VMware has no even ETA for this problem.

          I have a friend in another big telecom company and he deleted Hyper-V after spending a week to force Hyper-V work with Broadcom NIC teaming. ESXi works perfectly, but you have to perform some wild dances naked in night forest under full moon to figure how to make it work in Hyper-V.

          1. TimC’s avatar

            I can tell you the NIC’s were for sure on the HCL, they were intel gigE’s and worked fine.

          2. Ben Armstrong’s avatar

            This one always makes me laugh:

            VMware: Our drivers are better – because they are built just for virtualization

            Microsoft: Our drivers are better because we have eons of experience testing and qualifying drivers, formal programs for qualification, have been used for critical workloads longer than x86 virtualization has existed, and they are deployed on much larger scales than VMware ever does.

            Even if I did not work at Microsoft (which I do) I would see that Microsoft drivers receive more thorough testing and validation than VMware drivers do.

            Cheers,
            Ben

            1. Anton Zhbankov’s avatar

              What exactly means “we have eons of experience testing and qualifying drivers”? Spending 10 times less time for testing, but for 100 times more drivers brings us to situation when actually 10 times more time was spent for testing.
              I don’t care about this number. I can actually afford one reboot of my desktop a month because of drivers, but one reboot of virtualization host? No, sir!

              What I want as big telecom customer – stability. Super stability. 24*7*365.

            2. Alberto Farronato’s avatar

              Ben,

              Microsoft may as well have a ton of experience testing and qualifying drivers as you say, but the data that Eric shows at the beginning of the post suggests that it is not making a difference from a stability stand point.

              Ciao
              Alberto (I work for VMware)

            3. Mark Wilson’s avatar

              @Eric You have taken my situation out of context. As my blog post describes, the BSOD only occurred with a particular checkbox active (that’s new for Hyper-V R2) and using old, unsupported, and frankly too-slow-for-any-decent-virtualisation-workload (100Mbps) network adapters. Turning off the Large Send Offload in the adapter properties also fixed the problem so I didn’t even need to switch out the drivers.

              Here’s the other side of the coin (which I decided not to highlight in my post in order to avoid riling ESX fanboys) – I can’t even run ESX or ESXi on some of my hardware – because of the ESX architecture integrates drivers into the hypervisor – see http://www.markwilson.co.uk/blog/2009/08/failing-to-run-vmware-esxi-on-a-notebook-computer.htm

              At least Hyper-V gives me the flexibility to do something about it, rather than waiting eons for VMware to give me a solution (or not).

              1. Anton Zhbankov’s avatar

                Mark, please tell me – is it really a problem that enterprise virtualization solution can’t run on some notebook? Would you prefer cluster of notebooks instead if IBM or HP blades in your datacenter?

                The only reason I see to run it on notebook – demonstration purposes. But you need hardware virtualization for Hyper-V and you can not run Hyper-V in Hyper-V VM, while ESX runs perfectly under ESX and can even run its own VMs.

                1. Mark Wilson’s avatar

                  Hardware virtualisation support is available in most modern PCs and servers – that’s not a problem. And yes, the example I cited of not being able to run ESXi was for demonstration purposes: in this case , ESX in a hosted VM would not have helped us.

                  In a datacentre, of course I’d use the right hardware for the job. Incidentally, I’d also use the right software – which might be vSphere, or it might be Hyper-V/SCVMM (or even XenServer).

                  We’re drifting off point here though: Eric’s post is about drivers causing hypervisor failure. Except that he only picked up on the points in my post that suited his purpose, not that the NICs in question were out of support from both Intel and Microsoft, and that I needed to make particular configuration changes to cause a problem. The issue here is a VMware employee picking up on a given situation and twisting it to suit his employer’s PR. Informed? Maybe. Virtualization? Well, yes. Criticism? Certainly! Objective? Certainly not!

                  There are two vendors in this conversation with two architectures. Both maintain that theirs is better. In reality, sometimes one is a better fit for customer requirements than the other, and vice versa – but, as a service provider, I’d like to have some flexibility.

                  1. Anton Zhbankov’s avatar

                    >Hardware virtualisation support is available in most modern PCs and servers – that’s not a problem.

                    Do you really think so?

                    I have brand new HP desktop with dual core CPU *without* hardware virtualization. So I can not run even Windows XP Mode under my Windows 7. I also have brand new SONY notebook with CPU that does not support hardware virtualization.

                    Actually it was not Eric who have spoken about drivers in this case – it was Mark Russinovich. 70% of BSODs caused by drivers – that’s horrible number for solution with very wide hardware support. I suppose even Realtek NICs are supported, but is it an advantage over ESXi?

                    >The issue here is a VMware employee picking up on a given situation and twisting it to suit his employer’s PR.

                    Not sure. Eric is VMware employee, yeah, but I haven’t seen him fullfilling PR requests. There is no need to fabricate arguments against Hyper-V with good old MS marketing team.
                    Hyper-V is interesting product for sure, it is very interesting for SMBs because of laying on general purpose OS with wide hardware support, but it’s the same reason Hyper-V is not really welcome in big enterprise. There is one Hyper-V installation in Russia that was shown by MS as perfect case, but…
                    1) MS supports them for 0$. Just to be able to waive a flag
                    2) There were no critical services under Hyper-V
                    3) IT Manager of this customer was on vSphere Launch in Russia and he asked me about options to migrate to vSphere. Even free support and dedicated engineers was not enough to continue Hyper-V usage.

                    1. Grigory Orlov’s avatar

                      >There is one Hyper-V installation in Russia that was shown by MS as perfect case, but…
                      Strange, but here (http://www.microsoft.com/casestudies/Case_Study_Search_Results.aspx?Type=1&ProTaxID=14973&CouTaxID=1489&LangID=64) I can found 19 cases with Hyper-V, half of it – cases for big companies (like GazPromNeft) – you can compare with only six cases for VMware in Russia (http://www.vmware.com/a/customers/country/178/Russian+Federation).

                      1. Anton Zhbankov’s avatar

                        Grigory, what you find strange – that VMware published less cases than Microsoft?

                        1) Cases for big companies are not always big cases.
                        2) It is no strange at all that after 2 years of very aggressive marketing including lies Microsoft has some cases.

                        It wouldn’t be news for you that most such decisions are made by top-management for political reasons or under influence of professional sales and there was no technical comparisons, no technical experts were asked etc.

                        I said what I heard, nothing more.

                        After all Hyper-V, especially R2, is not THAT bad solution, but it seriously lacks a lot of features that vSphere has. As Eric said problem with drivers and patches are not THAT bad with Hyper-V, but it exists! We’re engineers, so let all the marketing bullshit be thrown away.

                        Actually it is not Hyper-V we discuss 90% of time in such conversations, but Microsoft marketing statements about it.

                      2. Eric Gray’s avatar

                        Is it really out of context? I summarized the situation, linked to the article and quoted the portion that supports my thesis — Hyper-V relies on third-party general purpose OS drivers that are the most common cause of Windows crashes.

                        Additionally, it is amusing that Hyper-V fans are very generous when confronted with trouble. “Whew, it wasn’t Hyper-V, it was the network driver [SMB stack, kernel, IE,…].”

                        1. Mark Wilson’s avatar

                          Sorry it’s taken a couple of days to reply Eric – I should be grateful that you linked to my post (thank you).

                          I don’t think we’ll agree on this the outcome but the discussion has been… interesting. And I’m glad to provide some amusement 😉 even if I did try to make the point in another comment that I work on both sides of the divide… and that this VMware vs. Microsoft nonsense is not helpful for customers (or partners)!

                        2. Fernando’s avatar

                          Mark,

                          The usability to run ESX on some old and crap HW is be design. Exactly to avoid problems like the ones Eric mentions.

                          Ben,

                          All these years of driver certification, to still MS itself to acknowledge drivers are responsible for 70% of BSOD.

                          1. TimC’s avatar

                            It’s not just “old and crap HW”. I’ve got a pair of rackable systems servers that worked just fine under esx 3.5, and VMware just randomly decided to drop support for the onboard SATA controllers in 4.0, leaving me no upgrade path for servers that aren’t really all that old (opteron 2216 CPU’s, amd based motherboard chipset).

                            Granted, the servers aren’t “on the HCL”, but Hyper-V, Citrix/Xen, Solaris/Xen, RHEL/KVM all work fine. Had they never worked, fine, VMware is a young company and can’t cover every piece of hardware under the sun, but to remove support fro something that previously worked? That’s more than a bit frustrating. And in a home lab I’m not going to drop $5k on a pair of comparable new dell’s.

                            1. Anton Zhbankov’s avatar

                              >leaving me no upgrade path for servers that aren’t really all that old

                              Buy some cheap supported controllers. Isn’t it an option? You don’t have to buy new servers with hardware virtualization to run ESX 4. But you can’t run even Hyper-V 1.0 without it.

                            2. Fernando’s avatar

                              Which server are you talking about ?
                              I must agree with you if VMware removed support for HW devices.

                            3. Mark Wilson’s avatar

                              Yep. It’s by design. And sometimes that design choice is limiting. Other times it works well. Incidentally, ESX would run on the “old and crap” NICs that caused me some isses… but not on some newer (admittedly not server-class) hardware.

                            4. Alberto Farronato’s avatar

                              Mark,

                              ESXi is not a video game. The fact that you can’t run it on your noteboook is actually a good thing.

                              Alberto

                              1. Mark Wilson’s avatar

                                Sure, ESXi is not a video game. But our consultants can’t (practically) take servers from the VMware HCL on the road to see customers and hosted virtualisation solutions (e.g. VMware Workstation) are not always suitable either. That’s just one scenario where ESX/ESXi would not suit our business but Hyper-V did. Equally there are others where we need vSphere functionality that’s not available with Hyper-V/SCVMM.

                                My issue is not with ESX or Hyper-V. My issue is that this VMware vs. Microsoft nonsense is not helpful for customers.

                                Mark

                                (VCP and MVP for Virtual Machine technology… not employed by VMware or Microsoft… working for a global systems integrator in the real world with real customers)

                                1. Anton Zhbankov’s avatar

                                  Bad, bad ESXi. It can’t run on certain notebooks you’ve bought for consultants.

                                  I’m end customer and I don’t care that ESXi can or can’t run on notebooks at all. What I care about – how stable ESXi is on my brand servers with top Xeons and a lot of expensive memory, not about comfort of some consultants that in most cases know less than me about things they want to sell me.

                                  1. Anton Zhbankov’s avatar

                                    Sorry, forgot to mention my titles.
                                    MCSA, VCP, VMware vExpert. Senior infrastructure administrator of real big telecom customer.

                                    1. Mark Wilson’s avatar

                                      (I only mentioned my titles to say I work on both sides of the fence)

                                    2. Shawn’s avatar

                                      I love the fact that I can plug my GPS into my Hyper-V server and update it.
                                      My kids love the fact that they can play all of their games.
                                      Having access to my receipt scanner is alway nice, I can sit in the server room and do my expenses.
                                      I can watch movies, play online poker…
                                      There’s no end to the usefulness of the Hyper-V parent partition!
                                      Who wants a boring ESXi console?

                                      1. Fernando’s avatar

                                        LOLLLLLLLLLLLL !!! I assume you are kidding obviously, otherwise, this sets the record for the more fun competitive argument ever !

                                        1. Anton Zhbankov’s avatar

                                          Actually I saw once a problem report that Windows games were missing in new build of well known high end voice mail solution running on NT4 for high end PBX.

                                        2. Eric Gray’s avatar

                                          That was one of the funniest comments I have seen. Thanks for your contribution!

                                        3. jamnose’s avatar

                                          LoL m not sure whether u work for the company to play games in Hyper-V, is that really helps you to play games umm… mostly i have to agree with you, instead Hyper-V mostly useful to play games nothing more 😛

                                        4. Ben Armstrong’s avatar

                                          Fernando / et all,

                                          Another perspective on this – what is missing here is some scope. Yes, 70% of crashes are caused by drivers – without understanding the % of systems crashing compared to systems deployed this number is accademic.

                                          Think of it this way (leaving out company names):

                                          Company 1: Of 10,000 deployments 1,000 crashed in our code and 700 crashed in the drivers.

                                          Company 2: Of 10,000 deplouments 300 crashed in our code and 700 crashed in the drivers.

                                          Company 1 says: ~40% of crashes happen in the driver code on our system.
                                          Company 2 says: ~70% of crashes happen in the driver code on our system.

                                          Given all of this data – we can actually see that the driver quality is the same (in this hypothetical scenario). Unfortunately in this case we do not have the full numbers on how many deployments VMware / Microsoft has, how many crashes there are total, etc…

                                          The purpose of this presentation (when originally given) was to talk about the health of the Windows ecosystem and platform as a whole. This is why the figure is a comparision of driver crashes to total crashes.

                                          To make an assertion about the quality of drivers on one platform versus another platform – which you need is the percentage of driver crashes compared to total number of deployments.

                                          Neither Microsoft or VMware have provided this information. And I beleive it is foolish of Eric to try and use the data above to try and paint a picture of Windows driver quality.

                                          Cheers,
                                          Ben

                                          1. Fernando’s avatar

                                            Well, of course we do not have such numbers, but ESX is well known by its reliability, to the point Redmond Magazine to recognize it.
                                            And this include the drivers.
                                            Regardless the exact percentage, any reasonable person can conclude carefully tested and hardened drivers for fewer, enterprise class only, HW devices, are more stable.
                                            And facts can support this, such as Eric points.

                                          2. Eric Gray’s avatar

                                            If I gave you the impression that this article was about extrapolating the absolute quantity of system crashes and that ESX has fewer than Windows, then I need to be more clear next time.

                                            I am trying to get people to think about Hyper-V as part of an irreducibly complex system that depends on Windows drivers, kernel, networking, and other services to function.

                                            Last month there was a fair amount of “first Hyper-V patch ever!” rhetoric, which completely dismisses the reality that the module that consists of the Windows hypervisor is not an autonomous system — the rest of the Windows OS is just as much a part of the hypervisor host. This is not a patching discussion, so I will stop there.

                                            I do find it amusing that you are so quick to jump on me about misusing numbers when Microsoft execs have been telling bold-faced lies about Hyper-V stats since the beginning. “One million Hyper-V downloads” and “24% market share after one month” are two favorites.

                                          3. Alberto Farronato’s avatar

                                            Mark,

                                            I recognize that ESXi may not be as good as Hyper-V when it comes to demos – kudos to Microsoft for getting Hyper-V to be the best in this use case – however, I am pretty sure that your consultants are more interested in being able to solve customers’ problems using a reliable virtualization platform rather than just showcasing software.

                                            Ciao
                                            Alberto

                                            1. Mark Wilson’s avatar

                                              Alberto,
                                              “more interested in being able to solve customers’ problems using a reliable virtualization platform”

                                              Yep. And ESXi, Hyper-V, Xen or many other hypervisors could do that for me, although reading this blog post (and some of the comments), one could be forgiven for thinking that there was only one reliable hypervisor in the world.

                                              It’s late on this side of the pond, and I’m tired, so there will be no more comments here from me tonight. Have fun.

                                              Mark

                                            2. Jason Boche’s avatar

                                              It seems to me that the issue is with Intel and Broadcom. I use Olicom Token-Ring drivers and thus am not impacted.

                                              ::passes the token to the next responder::

                                              1. Eric Gray’s avatar

                                                I hear that Live Migration really flies over 10Gb Token Ring.

                                              2. Brian’s avatar

                                                Still bringing the Token ring love. Ha

                                              3. Stu Fox’s avatar

                                                Which just goes to show that no good deed goes unpunished…

                                              4. Eric Gray’s avatar

                                                Great discussion — I really appreciate that we could have a debate like this without dipping into personal attacks.

                                                Fernando, Anton: you guys are the best. Thanks for being Friends of VCritical(tm).

                                              5. Rob McShinksy’s avatar

                                                Seems a little this article is a little bit of a soap box Eric. Love the site, but fact is, with any Hypervisor upgrade or any upgrade on any platform there can be issues drivers or not. VMware is no stranger to this fact. I will agree that Microsoft has had driver issues in the past, but its mass number or hardware it “can” install Hyper-V on is both a strength and a weekness and people definately push this limit. In a large virtual enterprise environment, I would expect that there would be upgrade testing on exact hardware with appropriate backout plans. The latter should always be considered to make sure downtime is reduced.

                                              6. Nik Simpson’s avatar

                                                Two points,

                                                1. The slide showing those driver crash statistics includes the desktop versions of Windows where bad drivers or flaky hardware are much more common.

                                                2. The %age is useless anyway. What you care about is frequency. Let’s say I have a server OS that crashes ten times in a year, with seven of them related to driver problems, now somebody wants me to run a server OS that crashes 100 times a year, with 70 of them due to driver problems. Both have the same percentage of crashes related to drivers, but I know which one I’d want!

                                                Frankly, I’d beamazed if crashes due to device drivers wasn’t the case for the majority of crashes in any x86 OS, that’s where you get closest to the hardware and where errors are almost invariably unrecoverable for the OS.

                                              7. Anton Zhbankov’s avatar

                                                Another MS marketing bullshit in Russia: http://virt.cnews.ru/info

                                                *Virtualization* championship

                                                Here are rules:
                                                Number of servers – 1 point for server, 5 points max
                                                Hyper-V – 10 points
                                                Hyper-V R2 – 10 points
                                                System Center – 5 points
                                                High Availability – 3 points
                                                Web server – 5 points

                                                100 hosts vSphere project with High Availability and Site Disaster Recovery will lose 18 to 25 to Hyper-V project with 2 low-end hosts and 5-6 VMs.

                                                What is it? It’s not just a usual marketing bullshit, it’s highly purified and concentrated bullshit.

                                              8. Rob McShinsky’s avatar

                                                It’s not VMware…..It’s the Drivers…
                                                Since you pointed out Roger’s apparent Hyper-V trouble the Broadcom drivers.

                                                http://bit.ly/aHawrH

                                                http://bit.ly/aHsUQx

                                                I will not argue that drivers are a problem at times, but when researching problems with drivers or firmware within my Hyper-V environment, I usually find potential solutions from a VMware site or blogger. Thanks for blazing the trail.
                                                As far as the markets of Hyper-V, their job it create interest in the product. I am sure you have worked with enough “technical” sales reps to know that their representation of the product is approximately accurate. To imply that VMware does not share in this common tactic is a bit naive.

                                                1. Fernando’s avatar

                                                  The second link you provide does not shows a driver problem, but a HW problem.
                                                  The first one indeed is a driver problem, and here is the interesting part: A VMware patch for it, since VMware controls the entire stack.
                                                  If that problem happens with Hyper-V, there’s no other solution as waiting the NIC vendor to release a new driver.

                                                  With your post, you basically enforced the superiority of the VMware driver model.

                                                  1. Stu Fox’s avatar

                                                    So what you’re saying is that it’s better to wait for VMware to release a patch for the entire hypervisor layer than it is to wait for the manufacturer of the device to release a driver level patch? Either way you’re waiting for someone to release a fix, it’s just a question of who. You have no guarantee either way of this happening quickly – unless you have evidence to the contrary?

                                                    Just looking at the Broadcom drivers for Windows Server 2008R2 shows that they release driver updates reasonably regularly (monthly it would appear) so it’s not like you’re waiting long.

                                                    I’m sure there are positives and negatives to both approaches, and as usual the truth is a casualty.

                                                    1. tb’s avatar

                                                      If Broadcom releases updates for their drivers that often, one would question that they didn’t get them stable in the first place (and still hasn’t?).

                                                      For me, being a n00b in the Hyper-V area (not trusting Microsoft in the first place), is there anything better than the WHQL to ensure stability?
                                                      How do I find the best hardware for my Hyper-V machine?
                                                      I wouldn’t want to invest in a system just to find that my hypervisor crashes once in a while. (same goes for ESX, but they have recommended systems and control over their own driver updates.

                                                      If I buy two identical machines, but install them with a months difference, I could get different versions of drivers, causing issues on one, but not the other.
                                                      How would I know which one is the right one?

Comments are now closed.