How to prevent global system outages
By Bob Aiello
The recent global outage caused by a Windows update has raised questions of whether or not we are just too dependent upon software for mission critical operations including airlines, banking systems and hospitals. Many folks talked about returning to paper pads and pens instead of relying upon complex software systems that can break with one faulty software update.
I have often written that we really do know how to reliably build, package and deploy code. While updating software is a very technical endeavor, the weak link in the chain may have more to do with organizational culture, management, leadership and a true commitment to quality than any gap in our technical processes. I was once asked during an interview if I was confident enough in my skills that I would manage the updates to a life support system that my loved one depended upon. Of course, my immediate response was “sure I can do that”. And this seemingly innocuous interview question has haunted me ever since. I have had loved ones dependent upon life support systems and somebody handles the updates to that system. Did they know what they were doing? Could I have done a better job and what if I made a mistake and the system went down?
CM Best Practices including effective source code management, automated build engineering, environment management, change control and automated release and deployment engineering are just the start. Code reviews and continuous testing are also fundamental best practices which many organizations do well. The cool kids at school have advanced our discipline with continuous delivery and deployment along with strategies such as blue/green deploys including that poor canary in the coal mine. We do know how to reliably release software but then why do we have so many serious global incidents from security breaches to global outages?
In many ways, I think that the problem is not in our technical processes but rather in how we run our organizations starting with who is in charge and making the decisions. Too often, folks with positional power do not listen to the technology experts who actually understand the complexities of the work that we do. I have held leadership positions in large organizations where I tried my best to be a positive influence. I have worked to be an industry thought leader including my role in the IEEE with efforts to create international standards in configuration management, DevOps and more recently SRE. High level managers can suffer from not having enough knowledge of technical details required to make the best decisions. Of course, it is essential to be a leader who enables the smart people around you to provide their expert guidance and knowledge. Too often organizational culture silences dissenting views, losing an important perspective from people who have deep technical knowledge. Deming said it best when he declared that we need to “drive out fear.” Many organizations also lack a true commitment to quality, instead focusing on short term profits.
I still believe firmly that we can create and update complex systems in a completely reliable and secure way. We need to focus on enabling organizations where the culture has an absolute commitment to quality and a culture where anyone feels empowered to pull the red cord stopping the assembly line. What say you? What do you feel that we need to do in order to improve our quality and reliability? (You can comment back on the original LinkedIn post here.