Equanix ld8 data centre

It’s been a rough couple of weeks for the IT and Telecoms industry, particularly for customers based within London Docklands data centres. 

There has in fact been not one, but two data centre failures, albeit one with significantly more impact than the other:

  1. Equinix LD8  (Tuesday 18th August). Power loss.  
  2. Telstra (Thursday 27th August). Power loss.   

First up we have the Equinix failure. 

What made this event particularly painful for many, is that this facility is home to the LINX (London Internet Exchange). LINX is one of the world’s largest internet exchanges and home to some 700+ ISPs including BT and Virgin Media.

Events kicked off in the small hours at 4:23 a.m. with loss of electrical power to many customer server racks, on both the resilient A and B power feeds. Many customers (ISPs) did not have power restored until well into the evening (10 p.m.) – that is some 16+ hours of outage.

LINX themselves confirmed that at least 150 of their member ISPs suffered as a direct consequence of the failure, which in turn affected thousands of their customers. Typical services to have faulted would have been internet and telephony for business customers, by the thousands.

Can we suppose these affected service providers had another data centre that their service could failover to?

A good portion did, for example to the popular “Telehouse North” data centre facility – good news. However, the bad news is that the sudden, increased load into Telehouse North and other data centre facilities, caused some weird and wonderful behaviours for the ISP customers and their customers alike.  Effectively the diversion of network traffic from Equinix LD8 caused the ecosystem to bog down; not too uncommon unfortunately.


So what caused this eye-watering outage?

high voltage data centre

In short, they had a UPS (uninterruptible power supply) failure, which bizarrely happened in the same facility back in 2016 , albeit when it was running under the Telecity brand.

This time round, it seems as though the facility was midway through a roll-out to a new UPS system, during which customers were being transferred from the old system to the new. 

However the old UPS system experienced a failure, which in turn triggered the comprehensive fire detection system, which subsequently shut down the power to a large subset of customer racks.

Why did it take so long to bring services back online?

In short, a combination of component failure, coupled with health and safety requirements.

Sometimes with data centre power failures, it can be as simple as conducting system resets, coupled with health and safety/technical procedures to restore the power services. However the health and safety element is not to be underestimated.

As you might expect, the power systems within a data centre are complex but can also run at dangerously high voltage (400V), which requires specialist engineers to manage in emergency scenarios. 

So whilst it might just be a figurative trip switch which requires resetting, no chances can be taken and a full risk assessment must be carried out, by law, prior to resets taking place. (This was the case with another DC power outage that occurred at Pulsant South London in 2018) .

Coming back to the Equinix failure, our understanding is that to get customers back online, data centre technicians went ahead rolling out power supplies to the new UPS system. That is as opposed to restoring power to the old (failed) UPS system, which says something about the extent of the technical failure!

How did Equinix handle all of this?

Unfortunately for customers located within the failed regions of the data centre, Equinix’s own communications with customers was poor, as can be seen on Giganet’s status page feed throughout the event.

To make matters worse, customers trying to access their ailing computer racks, struggled to gain physical access to the building as Equinix’s own building access systems were down as a consequence. Crumbs.


Next up we have the Telstra DC failure

Telstra is an Australia-based telecoms organisation with many data centres worldwide. The facility based in Docklands is 115,000 square foot in size, with total capacity for 1800 racks.

On 27th August it is reported that this facility experienced a power outage in the lower half of the building. According to the London Fire Brigade, 25 crew members arrived on site at 9:26 a.m. to attend what they reported as a fire. Yikes.

Unconfirmed reports from CBR, suggest that a UPS caught fire, which in turn tripped the breakers supporting the bus bar. Customers then lost power to their racks.

Fortunately the fire was contained and partially damaged only one of the plant rooms. Subsequently engineers were able to restore power to customer racks soon after.

But let’s not play it down, it’s another complete power loss to customers, who in turn are providing critical services to their customers.

Common themes in both of these events

It’s almost uncanny how two revered data centre facilities, in the illustrious London Docklands, a relative stone’s throw apart, have both gone down with power failure within days of each other.

What’s more, it’s the old chestnut of a UPS-system failure which has affected them both (a recurring issue, it seems, for Equinix LD8). 


But the elephant in the room is…

elephant in room data centre

Despite the inherent redundancies baked into these data centres, they failed their customers by cutting off the life blood (electricity) to their equipment.

For those of you that are not as well versed with the power provision within data centre racks; as a customer with server equipment within a rack, you are given two independent power feeds. What is known as an “A feed” and a “B feed”.

Having these two feeds means you have two unconnected supplies to power your precious server equipment. Each feed ought to be powered by its own parallel set of UPS system, diesel generator, electricity substation and so on. So if a UPS system were to fail (catch fire etc!) on the A feed, the B feed ought to be unaffected, meaning you don’t lose power to your server rack.

But in both of these data centre incidents, there was a single point of failure, which contradicted the very purpose of a data centre – to keep the lights on.

There will no doubt be a few DC-savvy readers at this point thinking,

“Come on, there has to be an eventual pinch point, where power items on A and B feeds have to share common infrastructure”. 

For example, from server rack power socket to power station, there will be a convergence eventually. And, yes, this is true enough…  But it shouldn’t be at the UPS level.

Data centres of Equinix/Telstra’s calibre, ought to be supplying customers using truly resilient A and B feeds, at the very least via separate UPS systems.

Is this just a rare double yolker, two data centres, two weeks, one Isle of Dogs?

Sadly not… There have been multiple failures the past few years, often stemming from UPS failures, but ultimately highlighting single points of failure. Some more examples:

  1. Telecity – Power Failure (service impacting).
  2. British Airways – Power Failure (service impacting).
  3. The Bunker Secure Hosting – Power Failure (service impacting).

So why are these highly engineered facilities going offline?

high voltage data centre

Put simply, data centres are a collection of moving parts managed by human beings. 

They ought to be highly resilient, from power, cooling, internet connectivity, geographical risk – all of these items go into the melting pot when a data centre is built and operated. “No single point of failure”, “N+1″,”N+2” are all the terms that are used.

But all the while we have humans designing and operating them, they will from time to time, fail. More often than not, it’s a failure that can be absorbed without impact to the customer (and their customers). However, as we know, they can also succumb to more catastrophic failures – the service-impacting kind.

It’s just …at the moment, it seems to be happening too often.

What can colocation customers do to help prevent being affected from these power failures?

There are a few basic checks that colocation customers can conduct to help sound out the health of their power systems. For example:

  1. Ask a data centre provider to reveal to you the extent of separation between the A and B power feeds.  Can they show you building plans and plant rooms for example?
  2. Check the up-time history of the facility, at least for the past five years, preferably more.
  3. Enquire about the frequency of maintenance on facilities such as power and cooling. Also whether such maintenance has been deferred in recent years and, if so, why?
  4. If a facility has been acquired by the existing owner, probe for the changes that have since been put in place and for what reasons.
  5. Ask to speak with a long-standing tenant of the facility (ensure it’s for the same data hall which you intend to occupy). 

And if any of the above is daunting, find a specialist adviser that will happily undertake this work for you and report back with their findings (there are many great independent folks out there with such skills).

The audit is worth doing, as it’s not quick or easy to move your equipment into a data centre then back out once you find a shortcoming (or worse still, experience an outage).  (Unfortunately I speak from experience, as we’ve been in and out of a data centre within a year!  Incidentally no sooner did Cloud2Me move out, that facility suffered a complete power failure). 

So does moving our services to a public cloud provider eliminate our risk? I.e. AWS or Microsoft Azure?

Unfortunately not. The large public cloud providers use the same calibre of data centre as those we have so far been discussing. And they too have had many high-profile failures. Take this example, in which a lightning strike caused a large outage. A good job that lightning never strikes twice!


So what’s the solution to these failing facilities?

reliable parts network data centre

Well one thing is for sure, data centres will not stop experiencing failures.  Another thing is equally certain, they will continue to overstate the magnitude of their redundancy mechanisms, such as many of the failed centres did with their marketing.

It is instead down to every service provider that inhabits a data centre, to build a service fabric for their customers that accounts for a data centre going offline.  One of my favourite sayings to this effect is, “to build a reliable system from unreliable parts”. That is, to know that bits will fail and fall off, but regardless of this you can keep the plates spinning.

Whether it’s an internet, telephony or cloud services provider that is utilising a data centre, they ought to be using multiple data centre facilities in order to absorb the effect of one going offline.  For example, we at Cloud2Me utilise four data centres for our managed hosting platform, so that we can keep services such as our Hosted Desktop and Hosted Exchange system online for our customers.

Another example (albeit considerably different!) is Facebook. Like us, it has intentionally taken a data centre offline to stress test the disaster recovery mechanism .  Disaster recovery testing is never identical to a real failure, as you have prior warning but, conducted correctly, it goes a long way to ensuring that your service can survive the simulated failure.


Summary

As a Director of a managed hosting outfit, it doesn’t make for easy reading every time we see another data centre go offline.  This is especially so when we have the biggest names in the business suffering the failures, but it’s a dose of reality, for us service providers and customers alike.

Such incidents sharpen our minds to ensure that we build our services and platforms, “above the hardware”.  That is, we’re accustomed to having masses of redundant equipment inside our server racks, but more than this, we must treat the data centre itself as a hardware item, ready to fail.  

If you would like to speak to us with regards to making your IT infrastructure more resilient, please drop us a line at info@cloud2me.co.uk or 01737 304210.

Written by Jack Twomey


Older
Newer
  • “When I was looking for a hosted desktop solution, I wanted something that was reliable, secure and affordable. I shopped around the market place and Cloud2Me came up highly on all accounts.”

    Peter Bradley – Director
    Integral Talent

  • “We would certainly recommend the services Cloud2Me supply. Much less hassle than building our own cabled network, and Cloud2Me are very quick and helpful when responding to any questions.”

    Tristan Haines - Director
    Zepho Enterprises

  • “I was amazed at how quick the hosted desktop works, faster than my laptop! I love that I can have a pure business desktop which I can access from my windows desktop, home MAC and iPad. Cloud2Me has been a perfect solution provider. ”

    Graham Forbes - Director
    Interesting Apps

  • “I would highly recommend services from Cloud2Me. They’re good value, but most of all the service is excellent. There’s always support available at the end of the phone and Cloud2Me always go that extra mile to help out.”

    Charles Cridland – Director
    Your Parking Space

  • “We were initially cautious and uncertain of the security of working like this – this is something we have swiftly put behind us, the system is completely secure and reliable. Based on our experience I would highly recommend this service to anyone considering an ‘office in the cloud’ with Cloud2Me.”

    Bryn Towns - Director
    RoqSolid

  • “The technical support provided by Cloud2Me is great. Furthermore, I find the service offered by them to be very flexible and tailored to meet our individual needs… The way we are able to access our server wherever we are has transformed the way we work.”

    Dwain Coward
    Coward & Co Solicitors - Senior Partner

  • “I have not had any hitches and any maintenance they carry out never disrupts the services. The customer service is fantastic, even when I email late on an evening I still get a quick response that night, or first thing the next morning. I highly recommend Cloud2Me and their services.”

    Jody Nicholl
    Boolas Bakery

Ready to embrace the Cloud? Request a no obligation, 14 day free trial today.