Microsoft: Thinking small for resiliency

Archived Content

The following content is from an older version of this website, and may not display correctly.

Microsoft’s David Gauthier spends a lot of time thinking about service availability and how physical infrastructure contributes to it. As Director of Data Center Architecture for Microsoft’s Global Foundation Services, Gauthier is responsible for the technical strategy and direction of Microsoft’s global data center footprint.

While designing some of the world’s largest data centers, Gauthier is always thinking about “failing small”. This is because, he says, scale can breed complexity. “When you are running hundreds of thousands of servers, equipment failure is a normal operating condition. At any given moment, we can have a thousand or more machines in a failed state. Our cloud services are engineered with this in mind to ensure that we can deliver on our SLAs,” Gauthier says.

Microsoft has been running data centers since 1989 and large-scale online services since 1995, so it is no stranger to the traditional enterprise IT space. “In the old days, we subscribed to the ‘mission critical’ enterprise mindset. Fault-tolerant and concurrently maintainable facilities, redundant power supplies, hot-swap everything. With hardware redundancy, a developer could scale up his workload on a small number of (very expensive) machines.” But this can lead to a false sense of security, Gauthier says.

“My machines are in a Tier IV data center, they’re totally protected, right? Maybe not.” Gauthier says despite investment in hardware redundancy, failures and outages can still occur.

“Complex maintenance routines with myriad electrical wrap-arounds are a recipe for human errors. Scale-up servers are an easy target for Distributed Denial of Service (DDoS) attacks and large L2 network domains can DDoS themselves with broadcast storms.”

The answer lies in resilient software. “There is a core imperative for CIOs and IT pros to recognize no amount of money will abate hardware failures or human errors. As cloud providers and a new generation of developers embrace this, service availability is increasingly engineered at the software platform and application level.”

Rather than scaling up hardware with faster processors and more RAM, cloud platforms such as Microsoft’s Windows Azure enable seamless scale-out of applications across many commodity servers in multiple data centers.

“PaaS and IaaS services like Azure abstract the physical hardware away from the application software enabling developers to create resilient software abstracted from the vagaries of the physical world,” Gauthier says.

“The telemetry and tools available today to debug software are several orders of magnitude more advanced than even the best data center commissioning program or standard operating procedure. Software error handling routines can resolve an issue far faster than a human with a crash cart. During a major storm, smart algorithms can decide in the blink of an eye to migrate users to another data center because it is less expensive than starting the generators.”

Getting smart with apps
In a hardware abstracted environment, Gauthier thinks there is room for the data center to become an active participant in the real-time availability decisions made in software.

“In the Cloud, applications should really be able to understand the context of their environment. Smartly engineered apps can migrate around different machines and different data centers almost at will, but the availability of the service is dependent on how that workload is placed on top of the physical infrastructure. There needs to be a GPS guiding the workload to the right physical destination. Data centers, servers and networks need to be engineered in a way that deeply understands failure and maintenance domains to eliminate the risk of broadly correlated failures within the system,” Gauthier says. “Narrowly correlated failures, uncorrelated failures? Now we’re talking fail small.”

Fail small is straightforward concept. “Accept that things will fail and work to constrain that failure to the smallest impact domain that is economically reasonable. In Microsoft’s older enterprise data centers, the electrical failure domain was around 300 servers. In some of our newer data centers with high degrees of software resiliency, it’s over 2,000.”

“Once you start managing service availability in software and aligning to small failure domains, you start seeing things that are no longer necessary from your hardware. This shift offers an inflection point to re-examine how data centers are designed and operated. The first thing to go is the redundancy inside the server. The second, the chord to the rack and a bunch of electrical maintenance complexity. Then maybe you handle higher temperatures and save a ton of energy by curbing usage on the hottest days.”

As the software gets more robust and resilient, more CAPEX and OPEX efficiencies start showing up. “It’s a virtuous cycle – hardware fails, software improves to mitigate hardware dependencies, hardware simplified, costs saved, rinse and repeat,” Gauthier says.

Sometimes, traditional pieces of data center hardware disappear entirely. “We actually have a significant number of megawatts that are not backed up by diesel generators and have been running successfully for over three years. We’ve lost utility a couple of times, but the resiliency of the software ensured that the end-user never noticed.”

Microsoft’s infrastructure is informed by a robust TCO model that balances a multi-tiered view of failure correlations across electrical, mechanical, network, fire, weather and EPO domains with relevant industry and historical MTBF and MTTR data.

“We view everything inside the data center as a converged system and as such, we’ve got a different mindset on how to optimize it. By leveraging investments in software resiliency and continuous improvement, we’ve avoided significant capital investments, driven our energy efficiency up and reduced carbon footprint.”

David will be speaking at DatacenterDynamics Converged London on Day 2 (15 November) in Hall 1 at 9:30. To see the full event program, click here. To read more London articles from FOCUS 26, visit our digital edition here.

Microsoft: Thinking small for resiliency

Archived Content

Roof cover boards specified for Amazon data centre in Zaragoza

Packaging Your Power: An Insider’s Look At Data Center Backup Generator Enclosures

The Power of Now: Accelerate the Datacenter

The path to sustainability and carbon neutrality in data center infrastructure management