How Mentor Graphics solved complex OS provisioning failure rate with HP Server Automation

We’re excited to see one of our premier clients, Mentor Graphics, have great success implementing HP Software solutions.James Bagley, SW developer, writes about their experiences in the HP Software Solution Blog post.

By James Bagley, Grid Computing / SW Developer, Mentor Graphics

Editor’s note: This article is part of an ongoing series of guest posts by HP Software customers about Automation and Cloud Management use cases.

Mentor Graphics works in Electronic Design Automation (EDA), designing, testing and verifying computer processors, printed circuit boards, automotive electrical systems like infotainment systems and wire routing for a modern jumbo jet.

One of Mentor’s products allows designers to virtualize and simulate a computer processor, send it commands and examine the output. Considering that the latest CPU chip designs have about 5 billion transistors, the cost of tooling up to make one is pretty significant; you want to be reasonably sure it’s going to work first.

Not surprisingly, these simulations and verifications are very CPU- and Memory-intensive, while other types of regression testing take a small amount of resources per test, but the tests number in the hundreds of thousands.

That’s why Mentor (and most of our customers) use some variant of grid computing to manage these engineering processes and get the right resources to the right process.

So. Many. Operating Systems.

But the broad range of large and small tasks also requires a broad range of small and large computers in our data center, and compatibility testing dictates that we also keep a fairly broad range of operating systems available. Finally, to gain the broadest support for compiled binary, you need to compile on the oldest available OS. As a result, we are often bumping up against both the oldest OS we can still get support for as well as the latest bleeding-edge OS available from vendors.

Our computing environment has 50 unique hardware models in one data center, not including Sun or IBM (non-Intel) systems, with 31 unique operating systems and 35 different configurations. Altogether, that gives us more than 50,000 possible combinations.

Unreliable OS Provisioning

The challenge we faced was in the ability of automated OS provisioning tools to successfully complete their tasks. Initially, our efforts in automating OS provisioning experienced unacceptable failure rates due to the inability of the tool or process to establish compatibility between hardware and software. When a critical-path deployment for a high-profile division was blocked by issues with the system, my boss came to me and said, “Just make it work.” (That’s a very dangerous thing to say to a programmer like myself.)

So what causes the high failure rate?

I found technicians were selecting operating systems in HP Server Automation (SA) that were not compatible with the hardware being provisioned. So in some ways, the failure rate was artificially high since we would see 3-5 failures for one host as they hunted around trying one OS after another until one worked.

My programmer background told me that this was pretty inefficient way of deriving compatibility. Surely I could come up with some way to relate operating systems to hardware models to avoid the error in the first place.

We worked up some new hardware compatibility logic in form of a dynamic dropdown menu system that would offer only the hardware that was compatible to the OS being provisioned. Figure 1 (below) shows a SQL schema of what we came up with. “mp” in the bottom right stands for “management port”, and mptype would be HP, Dell, VM or IPMI, while “productName” is the server model name.

(Note the “gfs_boot_minutes” and “osbp_minutes” in the bottom left of the graphic.)

SQL Schema.png

Fig. 1: Mentor Graphics SQL schema to relate operating systems to hardware.

This hardware compatibility logic helped us achieve a 94.35 percent resolution rate, and only 5.65 percent failure.

This is clearly, a big improvement!

Populating an OS compatibility database

In order to develop this dynamic drop down system, we created a database. But how do you populate or maintain an OS compatibility database? Nobody would really want that job populating 50,000 possible combinations — that’s like one full-time employee doing data entry for a year!

Instead, we populated the database using a validation process in Server Automation. Validation is a sort of regression test. Each successful installation is recorded in the database.

The validation also recorded various time measurements. This turned out to be side benefit to another problem: a technician would typically need to wait for a deployment to fail via timeout before trying again. Now we use the recorded times to create timeout values appropriate for whatever hardware and operating system combination is being deployed. This has helped reduce latency times, in cases where the process was going to fail for other reasons.

Time to Provision

A second side benefit is that the regression helped us measure the speed of provisioning for each pair. What we found was a full regression test of every OS typically takes between 10 and 12 hours. At the end of which, whoever is running the test gets an email that looks like Figure 2.

regression results.png

Fig 2: Mentor Graphics OS Build Plan Regression Results

This simple report shows what operating systems work on what hardware, and also the reverse — what hardware is compatible with what operating system. Now we can suggest to management the right hardware to purchase—based on the pairing data. We also manage inventory by capacity, so we can deliver different service levels in terms of time to provision and pick the right pair based on time-to-market needs.

With a little ingenuity, we have been able to substantially improve the OS provisioning reliability, and make valuable use of timeout values and time to provision data — all using HP Server Automation.

Learn more