VMware vCenter really needs an agent

One of the great features of VMware vCenter Server (a.k.a. VirtualCenter, or VC) is the performance charts for virtual machines and other objects in the inventory.  Here is a typical chart:

VMware vCenter Server performance chart

Managing the data behind these charts is actually a very demanding responsibility of vCenter: receiving, storing, and later aggregating to keep the database size under control.  Starting with VC 2.5, a couple of big changes were made in this area to help with scalability: the schema was modified to divide stats data into multiple tables, and the rollup duty (data aggregation) was delegated to the database server’s native job scheduling mechanism.  This article focuses on SQL Server and the SQL Server Agent; there is an equivalent capability in Oracle as you would suspect.

SQL Server Agent vCenter jobs

The VC 2.5 installer creates a set of three jobs (for past day, month, and week) and throws up a notice reminding you to make sure the SQL Server Agent is running on your database server — for good reason, ignore it at your own peril.  You can see the jobs in the SQL Server Management Studio.

You’ll notice in this example there are actually three instances of each of the three jobs because there are multiple vCenter databases hosted on this SQL Server.  While not typical, it makes perfect sense in a test environment.

Backstory In the initial VC 2.5 release, the jobs always had a static name, thus making it impossible to run multiple vCenter Servers against one SQL Server instance.  I knew that was going to annoy a lot of people, so I filed a bug to have it changed — finally got it in 2.5 Update 2.

What happens if the SQL Server Agent is stopped and these rollup jobs quit running?  A picture is worth a thousand words:

SQL Agent jobs disabled

In this chart, the abrupt increase in database rows corresponds to the point in time that the rollup jobs were disabled.  vCenter continues receiving performance stats from hosts and dutifully depositing them in VPX_HIST_STAT1.

Essentially, the most granular data (VPX_HIST_STAT1) starts piling up, never aggregating into the summarized tables for longer-term storage.  In a large environment, it is not uncommon to see millions of rows in these tables.  If something goes wrong with the rollup jobs, those millions start multiplying in a hurry.

One nice thing about having the rollups controlled by the SQL Agent is that VC does not even need to be running for the process to take place, which also takes some load off of the VC server.  After resuming the jobs, the data flood quickly subsides and things go back on track:

SQL Agent jobs resumed

One note about SQL Server Express, the bundled database included with VC: There is no SQL Agent in that product.  Before you panic, please realize that VMware engineers know this.  For those deployments, the rollup stored procedures are simply kicked off by vCenter Server (vpxd), just like VC 2.0.x.  I could not believe the creative workarounds some well-meaning folks have attempted to implement because of this situation — it had something to do with the Windows task scheduler.  Please do not do that!

Hopefully now you have a little better understanding of the relationship between your database job scheduler, vCenter Server, and the size of your vCenter database.

Tags: , , ,

1 comment

Comments are now closed.