Courtesy Pexels Marcus Aurelius

Data Centres: The Sleeping Beauty Problem

Temperatures, heavy workloads, and pressure for business uptime on one hand- while dealing with underutilisation and capacity issues on the other hand- looks like data centres need more than a ‘Good Morning’ peck to wake them up to their full force.

Amidst signs of positive shifts in efficiency, power savviness, drop in outages and uptick in uptime- data centres and IT infrastructure boxes are still struggling with the coma of underutilisation, carbon footprint and unplanned downtime.  However, the story is slightly different for hyperscalers, co-location set-ups, IT teams with sharp focus and goals when compared to an average enterprise. We chat with Jay Dietrich, Research Director of Sustainability, Uptime Institute and unravel why a creature which is both the beauty and the beast can stay chained in glass-box constraints. Most of which connect to a simple snooze button.

Pratima H

A recent survey by Uptime Institute’s 13th Annual Global Data Center survey shows that average global Power Usage Effectiveness (PUE) levels have remained flat for four years. Can we expect a pattern of improvement in this area?

The average PUE from the 2024 Global Data Center Survey was 1.56. It was 1.58 in the 2023 survey and 1.55 in the 2022 survey. PUE numbers – in certain areas- are coming down. New data centres, by and large, come in the range of 1.2 and 1.3. Large co-location centres and hyperscalers, too, fall in that ambit. It is all about the efficiency of your IT team and how well an enterprise can utilise the capacity. At 10 per cent levels, there are limited energy savings and a lot of capital is also wasted. When you take it up to 30 per cent utilisation you do the same amount of work with less equipment. We see that 40 per cent operators have set goals for improving utilisations- as per some surveys.

Do hyperscalers have better habits and approaches vis a vis the usual enterprises?

Hyperscalers do not provide a lot of operating data about their IT – so it is hard to say. But if you look at SaaS areas, players like Google and AWS gain from full control of their infrastructure and achieve 60-65 per cent utilisation rates. For IaaS areas, the number can be around 20 per cent. Most enterprises operate at 20 per cent levels. They hesitate when it comes to refresh and consolidation of IT equipment. Else, they can operate at the same level as a Hyperscalers. Financial enterprises have twin data centres – to ensure reliability and uptime. They are exactly alike in workloads and capacity with split workloads in different regions with separate power supplies.

Why do most enterprises hesitate to consolidate their IT footprint?

Most software and tools for tracking and workload placements are around; but they cost money. It can also be a Black Box. But I guess the main reason is that they do not want to take risks. We need to change that attitude because it will bring them capital savings too. General refresh ratio is around 1:1. It expands their capacity- specially with new generations of processors. I do not understand why enterprises do not take some risk.

The recent Uptime survey also shows how outages are taking a slightly downward slope- More than half of operators reported they have had an outage at their site in the past three years, the lowest number yet recorded (with 69 per cent in 2021, to 60 per cent in 2022). This continues a trend of steady improvement. Power outages seem to continue to be cited as the single biggest cause of outages.  You think the pattern will continue?

There is more attention on optimisation because it has become integrated with Ops and there is a high level of intention of reliability. Specially with goals, metrics and training happening on a continuous basis. There is emphasis on identifying and anticipating disruptions and move past just reacting to them. If training, system management and Black Swan events do not interfere, we can see better uptime. Specially now, when systems have become so important to companies. That said, there is always going to be some downtime. That is, perhaps, also why some CIOs are hesitant on consolidation. They may wonder- why to risk reliability for other factors. Reliability and uptime are their key focus areas.

I do not understand why enterprises do not take some risk.

Jay Dietrich, Uptime Institute

Have Telcos seen a rise in outages in the latest survey? Is it because of rise in demand or something else? Are they more susceptible to disruptions in downtime due to extreme weather?

Uptime Institute has an outage tracking database that comes outages we find in the literature and through our network and outages reported to us by members. In this database, Telco datacenters averaged 21 per cent of the outages from 2016 to 2023 and were 30 per cent of the outages in 2023. The data indicate that Telco outages have increased as a percentage of overall data center outages over the past several years. However, the input does not provide a strong statistical basis to reach conclusions.

The causes of 2023 outages where we had information are shown in the graphic here. We would assume the distribution would be relevant to Telcos. Cooling was only identified as the cause of six per cent of the outages Uptime recorded in 2023. Again, there is not an adequate sample size to draw statistically valid conclusions, but the order of impact is largely the same over the 2016 to 2023 period. (See: Infographic)

What’s your advice to CIOs? For short-term solutions?

Guide your IT teams and facilities and strengthen the capability of your cooling system. Have periodic assessments and evaluate enterprise networks, storage – all kinds of infrastructure. Optimise its use. Give goals to your IT teams. If you are at 10 per cent utilisation level, it can take you four to five years to go higher.

What should be done as a long-horizon approach?

Using new-age tools and approaches can make a huge difference not just from an efficiency-standpoint but also from the angles of reliability and sustainability. You can anticipate problems instead of just reacting to them.

Pratima H

 

Leave a Reply