by Sean Burlington

Why do Airline Computer Systems Crash? Examining the recent BA Downtime

Why do Airline Computer Systems Crash? Examining the recent BA Downtime

BA joins Southwest Airlines and Delta, suffering from a recent very public IT issue. Here are some reasons this kind of downtime happens, and tips for protecting essential airline systems.

Any frequent traveler knows delays happen. But when thousands of travelers around the globe are kept waiting for hours and hours for their British Airways flights – that’s a little more serious. CNN Money called it the result of a “computer glitch,” while the Daily Mail called it “a problem within the hub of their system [that] led to a power outage.”

I’m not trying to criticise British Airways here. The fact that this is even news is a testament to its otherwise rock-solid systems and services.

But something has clearly gone awry with BA’s IT systems, and for other airlines and travel companies concerned about facing similar disruption and downtime, I have some theories on how it happened, and a few pearls of wisdom to help prevent it happening to you.

So what’s really going on with BA?

Diagnosing exactly what went wrong is a little tricky seeing as I have no access to BA’s systems and processes. I certainly have a few theories though, and one I can debunk straight away:

  1. It’s a power outage problem – I’ll eat my hat if this is actually the case. For a mission-critical system like this, there’s no way it wouldn’t have some kind of backup power or external site protecting it from a simple power failure.
  2. It’s an update gone wrong – This definitely has a whiff of update issue about it. Likely a single-point-of-truth like a database was updated to a new version, and the problems only cropped up later – at a point when new code depended on the database, and rollback wasn’t possible.
  3. It’s a problem with BA’s new Global Distribution System – The Independent reported that BA recently deployed a new Global Distribution System (GDS), FLY. This new system could be the root of the problem, too.

While it would be great to point the finger at a single issue and be done with it, this is a complex case. To my mind, the cause of BA’s system is a mix of new system teething problems and update dilemmas – a bit of theory 2, and a bit of theory 3.

Good problems come in twos

So FLY. It’s a large GDS, and one of its sub-programs is a Departure Control System called Altea. To my mind, that’s the likely culprit: the devil behind the downtime.

Why do I think that? Because customers could still check-in to their BA flight online. The problem was just with the systems that handle onsite check-ins and coordinate plane departure. So it’s most likely that Altea was the culprit.

While BA staff saw “inadequacies compared with the previous system” after Altea’s recent deployment (according to the Independent), it’s unlikely this system alone caused everything to grind to a halt. After all, these systems hold up a huge portion of airline operations, and they’re given the rigorous testing you’d expect as a result.

The upgrade issue

What’s more likely is that an upgrade to the new Altea system knocked things out of place. By the time the problems were noticed, new code had been piled on top of the creaky update. Suddenly, it was too late to roll back to a previous version, and the whole thing fell down like a chocolate fireguard.

Not convinced? Here’s some evidence:

  • BA just finished deploying Altea – considering many big updates roll out a month after launch, this is more than just a coincidence.
  • The downtime started early Tuesday morning – Developers almost always choose early Tuesday mornings to push update. Everyone’s asleep or at work, so it helps minimize disruption.

The timing is a little too perfect for this not to be an upgrade issue with Altea.

What if I’m rolling out an upgrade soon?

The big thing to consider during your upgrade process is that staging environments do not 100% reflect what will happen in production. So when you test an upgrade in staging, your production systems might react completely differently to it. Why? Because when the chaos of live, human-driven market data is introduced, weird stuff can happen.

That’s how many of these problems start. An update is fine in staging, you push it to production, and everyone’s smiling. Then some actual real data goes into the system, and the whole thing craps itself. By that point though, the customer data is in the system – and the customer is expecting to be able to board their flight as usual. So you can’t just strip their transaction out and revert to an old software version.

Here’s a few ways to get around this issue:

  • Run automated tests throughout the entire software development lifecycle: Pretty obvious, this one – more tests will make it more likely you’ll catch any problems.
  • Create fake markets to test: Populate test markets with a list of fictional airports, flights, schedules and customers. You’ll never completely replicate human-driven data, but it gets you as close as can be.
  • Upgrade in isolation: Roll out an update to just one component of your wider GDS, and wait to see how it handles the update. Then you can start updating other parts of the system.
  • Virtualize a mini-production environment: You might not have the spare resources to clone your entire production environments for testing, but with virtualization you can clone a scaled-down replica at a fraction of the size (and a fraction of the cost).

These won’t ever guarantee a smooth upgrade, but if you follow these tips, you can at least help stack the deck in your favour.

Want some more tips on safely deploying new software upgrades? Get in touch with me here, or contact my awesome team at Lola Tech.