On Wednesday, July 20th, Southwest Airlines suffered a massive systems outage which resulted in the airline being essentially grounded, with over a thousand flights cancelled [1]. This incident forced the airline to temporarily resort to manual procedures which could not handle the normal volume, causing widespread delays and cancellations. The airline had previously suffered a significant outage in 2015 when over 500 flights were cancelled due to another systems glitch. Southwest CEO Gary Kelly was quoted as saying that the cause of the latest outage was a backup system which failed to take over when a router failed. Ironically, this more recent incident occurred shortly after the airline announced record profits, raising the question of whether or not enough efforts are being made to upgrade the airlines critical support systems. Kelly was further quoted as saying, “Southwest has an aging technology infrastructure”, but “the airline has been making significant investments to upgrade it”. Southwest is reported to claim that “it expects to replace the longstanding reservations system next year and replace other key systems over the next three to five years”. Let’s hope that the airline will embrace industry best practices, including DevOps, related to software and systems upgrades. Otherwise, future reports will likely describe outages due to the attempted improvements themselves.
Editor footnote added July 26th, 2016
[1] According to an article published July 25th, 2300 flights were cancelled and thousands more delayed. The cost of the outage may be as high as 10 millions dollars.