Datacenters: The Need For A Monitoring Framework

For a proper usage and collaboration between BMS, DCIM, CMDB, etc. the usage of an architectural framework is recommended.

CONTEXT

A datacenter is basically a value stack. A supply chain of stack elements where each element is a service component (People, Process and Technology that adds up to an  service). For each element in the stack the IT organization has to assure the quality as agreed on. In essence these quality attributes were performance/capacity, availability/continuity, confidentiality/integrity, and compliance. And nowadays also sustainability. One of the greatest challenges for the IT organization was and is to coherently manage these quality attributes for the complete service stack or supply chain.

Currently a mixture of management systems is used to manage the datacenter service stack: BMS, DCIM, CMDB, and System & Network Management Systems.

GETTING RID OF THE SILOES

As explained in “Datacenters: blending BIM, DCIM, CMDB, etc.” we are still talking about working in silos where each of the participants that is involved in the life cycle of the Datacenter is using its own information sets and systems. To achieve real general improvements (instead of local optimizing successes) a better collaboration and information exchange between the different participants is needed.

FRAMEWORK

To steer and control the datacenter usage successfully a monitoring system should be in place to get this done. Accepting the fact that the participants are using different systems we have to find a way to improve the collaboration and information exchange between the systems. There for we need some kind of reference, an architectural framework.

For designing an efficient monitoring framework, it is important to assemble a coherent system of functional building blocks or service components. Loose coupling and strong cohesion, encapsulation and the use of Facade and Model–View–Controller (MVC) patterns is strongly wanted because of the many proprietary solutions that are involved.

BUILDING BLOCKS

Based on an earlier blog about energy monitoring a short description of the most common building blocks will be given:

  • Most vendors have their own proprietary API’s  to interface with the metering devices. Because metering differ within and between data centers these differences should be encapsulated in standard ‘Facility usage services‘. Services for the primary, secondary and tertiary power supply and usage, the cooling, the air handling.
  • For the IT infrastructure (servers, storage and network components) usage we got the same kind of issues. So the same receipt, encapsulation of proprietary API’s in standard ‘IT usage services‘, must be used.
  • Environmental conditions outside the data center, the weather, has its influences on the the data center so proper information about this must be available by a dedicated Outdoor service component.
  • For a specific data center a DC Usage Service Bus must be available to have a common interface for exchanging usage information with reporting systems.
  • The DC Data Store is a repository (Operational Data Store or Dataware House) for datacenter usage data across data centers.
  • The Configuration management database(s) (CMDB) is a repository with the system configuration information of the Facility Infrastructure and the IT infrastructure of the data centers.
  • The Manufactures specification databases stores specifications/claims of components as provided by the manufactures.
  • The IT capacity database stores the available capacity (processing power and storage) size that is available for a certain time frame.
  • The IT workload database stores the workload (processing power and storage) size that must be processed in a certain time frame.
  • The DC Policy Base is a repository with all the policies, rules, targets and thresholds about the datacenter usage.
  • The Enterprise DC Usage Service Bus must be available to have a common interface for exchanging policies, workload capacity, CMDB, manufacturer’s  and usage information of the involved data centers, with reporting systems.
  • The Composite services deliver different views and reports of the energy usage by assembling information from the different basic services by means of the Enterprise Bus.
  • The DC Usage Portal is the presentation layer for the different stakeholders that want to know something about the usage of the Datacenter.

 DC Monitoring Framework

ARCHITECTURE APPROACH

Usage of an architectural framework (reference architecture) is a must to get a monitoring environment working. The modular approach focussed on standard interfaces gives the opportunity of “rip and replace” of components. It also gives the possibility to extend the framework with other service components. The service bus provides a standard exchange of data (based on messages) between the applications and prevents the making of dedicated, proprietary point to point communication channels. Also to get this framework working a standard data model is mandatory.

Amazon post mortem on Dublin DC outage reconfirms domino theory

As promised Amazon released a post mortem report  on the data center outage in Dublin. It reconfirmed the domino theory: if one availability zone come under the influence of errors and outages, then other availability zone would follow in a domino effect.

Amazon stated, as mentioned in an earlier blog entry about the Dublin outage, that the utility provider now believes it was not a lightning strike that brought down the 110kV 10 megawatt transformer. This outage was the start of a row of incidents that finaly would bring the Dublin data center on its knees.

“With no utility power, and backup generators for a large portion of this Availability Zone disabled, there was insufficient power for all of the servers in the Availability Zone to continue operating. Uninterruptable Power Supplies (UPSs) that provide a short period of battery power quickly drained and we lost power to almost all of the EC2 instances and 58% of the EBS volumes in that Availability Zone. We also lost power to the EC2 networking gear that connects this Availability Zone to the Internet and connects this Availability Zone to the other Availability Zones in the Region.”

In 24 minutes Amazon “were seeing launch delays and API errors in all (emphasis by Infrarati) EU West Availability Zones.” The reason of this was “The management servers which receive requests continued to route requests to management servers in the affected Availability Zone. Because the management servers in the affected Availability Zone were inaccessible, requests routed to those servers failed. Second, the EC2 management servers receiving requests were continuing to accept RunInstances requests targeted at the impacted Availability Zone. Rather than failing these requests immediately, they were queued and our management servers attempted to process them.”

“Fairly quickly, a large number of these requests began to queue up and we overloaded the management servers receiving requests, which were waiting for these queued requests to complete. The combination of these two factors caused long delays in launching instances and higher error rates for the EU West EC2 APIs.” Later on Amazon was able to restore power to enough of the network services that they were able to re-connect. However the problem they found was that their database cluster was in an unstable condition. The last blow was that “Separately, and independent from issues emanating from the power disruption, we discovered an error in the EBS software that cleans up unused storage for snapshots after customers have deleted an EBS snapshot.”

In their description of actions to prevent recurrence Amazon stated that “Over the last few months, we have been developing further isolation of EC2 control plane components (i.e. the APIs) to eliminate possible latency or failure in one Availability Zone from impacting our ability to process calls to other Availability Zones.” (emphasis by Infrarati).

The Dublin incident shows that Amazon is still developing and improving the isolation between availability zones. Services in one zone are not yet safeguarded from incidents in other availability zones. That is a ‘must know’ for the customer instead of a ‘nice to know’.

Amazon and Microsoft in Dublin down, resiliency of cloud computing

A lightning strike in Dublin, Ireland, has caused downtime for many sites using Amazon’s EC2 cloud computing platform, as well as users of Microsoft’s BPOS.
Amazon said that lightning struck a transformer near its data center, causing an explosion and fire that knocked out utility service and left it unable to start its generators, resulting in a total power outage.
Some quotes from the Amazon dashboard (Amazon Elastic Compute Cloud (Ireland)):

11:13 AM PDT We are investigating connectivity issues in the EU-WEST-1 region.

3:01 PM PDT A quick update on what we know so far about the event. What we have is preliminary, but we want to share it with you. We understand at this point that a lighting strike hit a transformer from a utility provider to one of our Availability Zones in Dublin, sparking an explosion and fire. Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators. The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them. Power sources must be phase-synchronized before they can be brought online to load. Bringing these generators online required manual synchronization. We’ve now restored power to the Availability Zone and are bringing EC2 instances up. We’ll be carefully reviewing the isolation that exists between the control system and other components. The event began at 10:41 AM PDT with instances beginning to recover at 1:47 PM PDT.

Notice the 30 minutes difference between the first issue message on the dashboard (11:13) and the statement about when the event began, 10:41

11:04 PM PDT We know many of you are anxiously waiting for your instances and volumes to become available and we want to give you more detail on why the recovery of the remaining instances and volumes is taking so long. Due to the scale of the power disruption, a large number of EBS servers lost power and require manual operations before volumes can be restored. Restoring these volumes requires that we make an extra copy of all data, which has consumed most spare capacity and slowed our recovery process. We’ve been able to restore EC2 instances without attached EBS volumes, as well as some EC2 instances with attached EBS volumes. We are in the process of installing additional capacity in order to support this process both by adding available capacity currently onsite and by moving capacity from other availability zones to the affected zone. While many volumes will be restored over the next several hours, we anticipate that it will take 24-48 hours (emphasis made by Infrarati) until the process is completed. In some cases EC2 instances or EBS servers lost power before writes to their volumes were completely consistent. Because of this, in some cases we will provide customers with a recovery snapshot instead of restoring their volume so they can validate the health of their volumes before returning them to service. We will contact those customers with information about their recovery snapshot.

Remarkably as stated in the Irish Timesan spokes woman of  the Electricity Supply Board (ESB), Ireland’s premier electricity utility, said the incident occurred at 6.15 PM on Sunday and caused a power outage in the area for about an hour. However, she said power to Amazon was interrupted for “less than a second before an automatic supply restoration kicked in.”

Microsoft doesn’t use a public dashboard but their twitter feed  stated “on Sunday 7 august 23:30 CET Europe data center power issue affects access to #bpos“. Then 4 hours later there was the tweet “#BPOS services are back online for EMEA customers“. A pity that there isn’t an explanation how also their data center went down. Is it the same cause as that brought the Amazon data center down?

The idea on cloud computing is basically that the offered services are location independent. The customer doesn’t have to worry and doesn’t have to know on which location the services are produced. He even doesn’t have to know how the services are provided (the inner working of the provided services).
The incident in Dublin shows that at the current moment this assumptions are wrong. As a customer of cloud computing services you still have to have a good understanding of the location and working of the provided services to get a good understanding of the risks that are at stake in terms of resiliency and business continuity. Only then you can make the proper choices in which way cloud computing services can help your organization or business without business continuity surprises. Proper risk management when using cloud computing services deserves better attention.

UPDATE 10 august

Now three days later Amazon, as showed on their dashboard, is still struggling with recovery. In Informationweek there is interesting article about the complexity of fail over design according to this article

It’s still possible that having the ability to fail-over to a second availability zone within the data center would have saved a customer’s system. Availability zones within an Amazon data center typically have different sources of power and telecommunications, allowing one to fail and others to pick up parts of its load. But not everyone has signed up for failover service to a second zone, and Amazon spokesman Drew Herdener declined to say whether secondary zones remained available in Dublin after the primary zone outage.(emphasis made by Infrarati)

UPDATE 11 august

ESB Networks confirms that it suffered a failure in a 110kV transformer in City West, Dublin at 6:16 p.m. local time on Sunday, August 7. The Irish Times revealed some new information about the outage.

The cause of this failure is still being investigated at this time, but our initial assessment of lightning as the cause has now been ruled out,” a statement from ESB Networks said. The ESB also said on Monday, that Amazon was interrupted for “less than a second before an automatic supply restoration kicked in”. Yesterday the ESB confirmed that Amazon was one of about 100 customers affected by the ongoing service interruption. “This initial supply disruption lasted for approximately one hour as ESB Networks worked to restore supply. There was an ongoing partial outage in the area until 11pm. The interruption affected about 100 customers in the Citywest area, including Amazon and a number of other data centers.

A second Amazon data centre in south Dublin experienced a “voltage dip which lasted for less than one second”, the ESB said yesterday. However, this data centre was not directly affected by the power cut. This one-second voltage dip that had been cited added to the confusion about Sunday’s events.

ESB made it clear that in referencing a lightning strike, Amazon was sharing its best information at the time. “Amazon accurately reported the information which had been passed to them from workers at the site,” said Marguerite Sayers, Head of Asset Management at ESB Networks. “Both the explosion and fire were localized to the bushings or insulators of the transformer and did not require the attendance of the emergency services. The extent of associated internal damage to the transformer was serious and resulted in supply interruption to a number of customers, and also impacted Amazon’s systems, as they have reported.

The article confirms the statement that you as a customer have to have a good understanding of the location and working of the provided services to get a good understanding of the risks that are at stake in terms of resiliency and business continuity. Although the question rise to which level you must have this knowledge. It looks like the outage in Dublin is not only about IT design but also about facility engineering. So how far does the responsibility of customer extend and where begins the responsibility of the provider?

Datacenters, an architecture approach Part II

Recapture part 1 of Datacenters, an architecture approach

“Improving a data center starts with a proper understanding of the business model and the business architecture that is being used and not by a simple roll out of the newest technology.”

To meet the business and technology needs of a data center an architecture development method is wanted. To take an architecture approach in constructing a data center you first have to start with

  1. A proper comprehension of the business model that is being used …
  2. then you can formulate the business architecture and
  3. finally you can start with designing the IT architecture and defining a technical design and finding appropriate products.”

Business Architecture

As a follow up we dive in to the business architecture of a data center. In making the business model, as discussed earlier, a general statement on the value proposition has been made. Therefore we now zoom in on this value proposition that was defined in the business model. But before we start we have to take some current developments in to account. The widespread adoption of a Service Oriented Business approach by means of virtualization technology and service oriented architecture (SOA) has a profound impact on the data center. It is therefore wise to describe a Service Oriented Data Center offering in terms of cloud computing service models. Also if the proposed data center is for internal use it is still wise to use general, cloud computing, service models. The NIST institute has made the following, draft, service model definitions:

  • Cloud Software as a Service (SaaS). The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web-based email). The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.
  • Cloud Platform as a Service (PaaS). The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations.
  • Cloud Infrastructure as a Service (IaaS). The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).”

We can add to this that in practice we can see that Platform as a Service has currently two flavours. A simple one, where the platform is a standardized operating system, or an extended one where also a standardized database and/or middleware environment is offered.

Another interesting cloud computing document used as input describes a Value Framework. The Value Framework, as described in “Do Clouds Compute? A Framework for Estimating the Value of Cloud Computing” by M. Klems et al. , presents an easy and very general usable framework for cloud computing. I have been working on the idea that it can also be used with some small changes and modifications to describe different kinds of data center value creations. This modified framework, see below, forms the business architecture of a data center.

Data Center Business Architecture Framework

Data Center Business Architecture Framework

It start with defining and explaining the business domain, this forms the link between the customer one has in mind and the value proposition. Then we get the business objectives. Is it about reducing costs, reducing time to create value or increasing quality and functions of the data center? In other words what is the main business objective we want to achieve. Also we have to pay attention to the demand behavior. Can we predict the demand of the customer in some way or another or not? Next question to be answered is which kind of service model you want to provide. There we use the NIST service model definitions, with two extensions: the extended PaaS model and the Colocation service.  So is it just square meters (Colocation As A Service) or are you offering Software As A Service? Each standard service offering should have a well defined interface. Beware that the consumer does not manage or control the underlying components of the platform that is provided. The consequence is that by choosing one of the service models you are  in fact ruling out certain customers because they want to manage or control certain components. Next step is to define the so called “technical requirements”. At this level of abstraction it is all about the service and/or quality you want to offer. So you have to be explicit about capacity/performance, availability, continuity, security and last but not least sustainability demands.

Now that we have al this information we can define the computing service, its function, the qualities and the quantities. Depending on the proposed service model you define your needs on software, processing power, storage, network and facilities (power, cooling, etc.) services. Where service stands for the sum of people, process & technology. This part forms the input for the resource costs. Here we can differentiate between capital expenditures (CAPEX) and operational expenditure (OPEX) and direct and indirect costs. To make ‘what-if’ analysis possible for different scenarios you can make and use a computing service reference model that has the same building blocks as the proposed scenarios.

By using this framework as a business architecture you have a tool for discussing your data center ideas or proposal with senior management.

Next time some words on performance management…

Datacenters, an architecture approach

Many IT people underestimate that a great new technology can be insufficient to build a successful and sustainable business. Because of their trust in a technology’s superiority they fail to spend enough time exploring business drivers, (alternative) business models and business architecture. They often go with the first IT product they come up with. Yet, history is littered with great technologies, products that didn’t succeed.

IT people could greatly improve their success chances by spending more time with understanding the current used business model and business architecture or searching and finding new business models and business architectures. Every IT technology, IT product or IT service can be used but the challenge is to find the IT solution that fits the best with the business model and business architecture that is being used. Improving a data center starts with a proper understanding of the business model and the business architecture that is being used and not by a simple roll out of the newest technology.

To give some examples. Virtualization is a great technology to improve the utilization of your servers. But if the people of the datacenter have the incentive to maximize the utilization of floorspace you can expect resistance to your virtualization plans.

Another example is that part of the problem to get IT green is that one of the involved parties can make a choice or transaction that has an effect on other parties that are not accounted for in the market price. For instance, a firm using excessive energy and thereby emitting carbon will typically not take into account the costs that its carbon emissions imposes on others. As a result, carbon emissions in excess of the social optimum level may occur. In economic terminology there is an externality.

And using cloud computing terminology what is your datacenter really offering? Is it SaaS, PaaS, IaaS or just a facility center on top of which third parties can build their propositions. And then what are the (technical)implications when you are offering a SaaS, PaaS or IaaS service?

Bottom line your technology proposals must be in sync with the business drivers and the business model.

Architecture

To meet the business and technology needs of a data center an architecture development method is wanted. To take an architecture approach (see for example TOGAF) in constructing a datacenter you first have to start with

  1. A proper comprehension of the business model that is being used …
  2. then you can formulate the business architecture and
  3. finally you can start with designing the IT architecture and defining a technical design and finding appropriate products.

Business model

A method to describe a business model that has become extremely popular this last year is the business model canvas of Osterwalder. If you are unfamiliar with this concept have a look at this side or this slideshow Basically the concept introducing a standard language and format for talking about business models. Nine key items serve as the building blocks for all business models:

  1. Customer segments: Who will use the product?
  2. Value proposition: Why will they use the product?
  3. Channels: How will the product be delivered to the customers?
  4. Customer relationships: how will you develop and maintain contact with your customers in each segment?
  5. Revenue streams: How is revenue generated from which customer segments?
  6. Key activities: What are the key things that you need to do to create and deliver the product?
  7. Key resources: What assets are required to create and deliver the product?
  8. Key partners: Who will you want to partner with (e.g suppliers, outsourcing)
  9. Cost structure: What are the main sources of cost required to create and deliver the product?

These building blocks are laid out on a page (canvas) in a very specific way, referred to as a ‘business model canvas’. The business model canvas can be used to describe any of a wide variety of business models.

Business model canvas of Osterwalder

There are two extensions on this model. One of them is taking the “extended enterprise perspective” and is paying more attention to the partners and the customers.

Extended business model canvas of Fielt

The other one takes the issue of economic externality and sustainability into account by adding two building blocks: a societal costs and societal benefits.

Extended business model canvas from Businessmodeldesign

All summarized this canvas is a great tool for an architect to start a conversation with the business to gain a good understanding of what they want to achieve and how they want to achieve their goals.

Next time some words on business architecture …

Follow Up: Datacenters, an architecture approach Part 2

Datacenters, big is ugly?

OZZO data centerIn the book “The Big Switch” Nicholas Carr makes a historical analysis to build the idea that the data centers/ Internet is following the same developmental path as electric power did 100 years ago. At that time companies stopped generating their own power and plugged into the newly built electric grid that was fed with electric energy by huge generic power plants. The big switch is between today’s proprietary corporate data centers to what Carr calls the world wide computer, basically the cloud with some huge generic data centers that provides web services that will be as ubiquitous, networked and shared as electricity now is. This modern cloud computing infrastructure is following the same structure as the electricity infrastructure: the plant (data center), transmission network (Internet) and distribution networks (MAN, (W)LAN) to give process power and storage capacity to all kind of end-devices. A nice metaphor but is the metaphor right? Is the current power grid architecture able to accommodate the ever rising energy demands?  And by taking the current power grid architecture as an example for the IT infrastructure architecture do we really get a sustainable, robust IT infrastructure? Not everybody is following this line of reasoning.

Instead of relying only on centralized data centers there is another solution, another paradigm, that is much more focussing on an intelligent localized delivery of service, the nano data center as discussed in an earlier blog entry. These two kind of solutions can even be mixed in a hybrid service model where a macro, centralized, delivery model works together with a localized delivery model using intelligent two-way digital technology to control power supply. An example of this hybrid approach is developed in Amsterdam by the OZZO project. The OZZO Project’s mission is to ‘Build an energy-neutral data center in the Amsterdam Metropolitan Area before the end of 2015. This data center produces by itself all the energy it needs, with neither CO2 emission nor nuclear power.’

According to OZZO the data center should function within smart, three-layer grid: for data, electrical energy, and thermal energy. These are preconditions, as is full encryption of all data involved for security and privacy reasons. Possible and actual energy generation and reuse at a given point in the grid serve as drivers for data center or node allocation, size, capacity, and use. Processing and storage move fluidly over the grid in response to real-time local facility and energy intelligence, always looking for optimum efficiency.

The motto of OZZO is ‘Data follows energy’. In there HotColdFrozenData(™) concept, an intelligent distinction is made between high-frequency use data and low-frequency use data. On average, offices and individuals use and change 11% of their data intensively, i.e., every day (hot); 15% is seldom accessed (cold); and 74% is practically never looked at any more (frozen). Special software can classify data streams real-time. After classification and segmentation, data is deduplicated, consolidated, and stored separately on appropriate media. Data can change classifications from hot to cold to frozen, but frozen data can also become hot at times.

In this way a sustainable Information Smart Grid is built based on several kinds of nodes. OZZO is not following the evolutionary path as described by Nicholas Carr. That is a traditional scale up of capacity by centralization and a simple-minded reach for economy of scale neglecting the tradeoffs of growing management complexity of the central node and capacity issues of the network (Internet and the power grid). In answering the question ‘Build an energy-neutral data center’ or ‘How to eat an elephant’ OZZO chooses for a divide-and-conquer strategy. By creating a new architecture with different types of nodes (‘data centers’) they want to create a sustainable distributed grid and take care of the issues that accompanies a centralization approach.

Energy Elasticity: The Adaptive Data Center

Data centers are significant consumers of energy and it is also increasingly clear that there is much room for improved energy usage. However, there does seem to be a preference for focusing on energy efficiency rather than energy elasticity. Energy elasticity is the degree in which energy consumption is changing when the workload to be processed is changing. For example, IT infrastructure which has a high degree of energy elasticity is one characterised by consuming significantly less power when it’s idle compared to when it’s running at its maximum processing potential. Conversely, an IT infrastructure which has a low degree of energy elasticity consumes almost the same amount of electricity whether it’s in use or idle.  We can use this simple equation:

Elasticity = (% change in workload / % change in energy usage)

If elasticity is greater than or equal to one, the curve is considered to be elastic. If it is less than one, the curve is said to be inelastic.

Given the fact that it isn’t unusual that servers operating under the ten per cent average utilization and most servers don’t have a high energy elasticity (According to IDC, a server operating at 10% utilization still consumes the same power and cooling as a server operating at 75% utilization) it is worthwhile to focus more on energy elasticity. A picture can say more than words so this energy elasticity issue is very good visualized in a presentation of Clemens Pfeiffer CTO of Power Assure, at the NASA IT Summit 2010. As you can see without optimization, energy elasticity, power consumption is indifferent to changes in application load.

Load Optimization (c)Power Assure

Servers

Barroso and Holzle of Google have made the case for energy proportional (energy elastic) computing based on the observation that servers in data centers to-day operate at well below peak load levels on an average. According to them energy-efficiency characteristics is primarily the responsibility of component and system designers, ”They should aim to develop machines that consume energy in proportion to the amount of work performed”. A popular technique for delivering someway of energy proportional behavior in servers right now is consolidation using virtualization. By abstracting your application from the hardware, you could shift things across a data center dynamically. These techniques

  • utilize heterogeneity to select the most power-efficient servers at any given time
  • utilize live Virtual Machine (VM) migration to vary the number of active servers in response to workload variation
  • provide control over power consumption by allowing the number of active servers to be increased or decreased one at a time.

Although servers are the biggest consumers of energy, storage and network devices are also consumers. In the EPA Report to Congress on Server and Data Center Energy Efficiency is suggested that, servers will on average account for about 75 percent of total IT equipment energy use, storage devices will account for around 15 percent, and network equipment will account for around 10 percent. For storage and network devices energy elasticity is also a relevant issue.

Storage

Organizations have increased demand for storing digital data, both in terms of amount and duration due to new and existing applications and to regulations. As stated in a research of Florida University and IBM it is expected that storage energy consumption will continue to increase in the future as data volumes grow and disk performance and capacity scaling slow:

  • storage capacity per drive is increasing more slowly, which will force the acquisition of more drives to accommodate growing capacity requirements
  • performance improvements per drive have not and will not keep pace with capacity improvements.

Storage will therefore consuming an increasing percentage of the energy that is being used by the IT infrastructure. Of the data set that is being stored only a small set is active. So it is the same story as for the servers, on an average storage operate at well below peak load levels. A potential energy reduction of 40-75% by using a energy proportional system is claimed. According to the same research there are some storage energy saving techniques available:

  • Consolidation: Aggregation of data into fewer storage devices whenever performance requirements permit.
  • Tiering/Migration: Placement/movement of data into storage devices that best fit its performance requirements
  • Write off-loading: Diversion of newly written data to enable spinning down disks for longer periods
  • Adaptive seek speeds: Allow trading off performance for power reduction by slowing the seek and waiting an additional rotational delay before servicing the I/O.
  • Workload shaping: Batching I/O requests to allow hard disks to enter low power modes for extended periods, or to allow workload mix optimizations .
  • Opportunistic spindown: Spinning down hard disks when idle for a given period.
  • Spindown/MAID: Maintaining disks with unused data spundown most of the time.
  • Dedup/compression: storing smaller amounts of data using very efficient

Storage virtualization can also help but component and system designers should aim to develop machines that consume energy in proportion to the amount of work performed. There is still a way to go to get energy elastic storage.

Network

According to a paper of the USENIX conference NSDI’10 “today’s network elements are also not energy proportional: fixed overheads such as fans, switch chips, and transceivers waste power at low loads. Even though the traffic varies significantly with time, the rack and aggregation switches associated with these servers draw constant power.” And again the same recipe dooms up, component and system designers should aim to develop machines that consume energy in proportion to the amount of work performed. On the other hand, as explained in the paper, some kind of network optimizer must monitor traffic requirements. Choosing and adjusting the network components to meet those energy, performance and fault tolerance requirements and powers down as many unneeded links and switches as possible. In this way, on average, savings of 25-40% of the network energy in data centers is claimed.

Cooling

Making servers, storage and the network in data centers energy-proportional we will also need to take air-conditioning and cooling needs into account. Fluctuations in energy usage is equivalent to fluctuations in warmth, and the question is if air-conditioning can be quickly zoned up and down to cool the particular data center zones that see increased server, storage or network use. As Dave Craven of Spinwave Systems, stated in a recent editorial article of the Processor “Unfortunately, the mechanical systems used to cool and ventilate large data centers haven’t kept up with technological advances seen in the IT world”. “Many buildings where they are putting newer technology and processes are still being heated and cooled by processes designed 20 years ago” Craven adds to this. Given the fact that the PUE is driven by the cooling efficiency (see for example the white paper of Trendpoint) it looks like cooling is the weak spot to create an energy elastic data center.

Next step

The idea of ‘disabling’ critical infrastructure components in data centers has been considered taboo. Any dynamic energy management system that attempts to achieve energy elasticity (proportionality) by powering off a subset of idle components must demonstrate that the active components can still meet the current offered load, as well for a rapid inactive-to-active mode transition and/or can meet changing load in the immediate future. The power savings must be worthwhile, performance effects must be minimal, and fault tolerance must not be sacrificed.

Energy management has emerged as one of the most significant challenges faced by data center operators. Defining this energy management control knob to tune between energy efficiency, performance, and fault tolerance, must come from a combination of improved components and improved component management. The data center is a dynamic complex system with a lot of interdependencies. Managing, orchestrating, these kinds of systems ask for sophisticated math models and software that uses algorithms to automatically make the necessary adjustments in the system.