Behaviorally Speaking – Software Safety
by Bob Aiello
Software impacts our world in many important ways. Almost everything that we touch from the beginning to the end of our day relies upon software. For example, airline flight controls and nuclear power plants all rely upon complex software code that must be updated from time to time, tested and supported. Incidents in the past impacted the 911 emergency dispatch system and that was not the only time that emergency dispatch systems have suffered outages which impacted the response time for police, ambulance and fire department services. The software that enables the anti-missile defense system known as the Iron Dome in Israel has been credited with saving lives and underwent an extensive testing and validation effort. But the number of software glitches impacting trading systems and other complex financial systems could cause us to question whether or not our capability to manage software configuration management is really where it should be.
Many years ago, I was interviewed by a very smart technology manager for a position supporting a major New York based stock exchange. I went into the interview feeling pretty confident that I had the requisite skills and actually had been recommended by a manager who I had worked for previously at another company. I was surprised when during the interview I was asked a very pointed question about my capabilities. The manager asked me to imagine that I was supporting the software for a life support system which my loved one depended upon. He then asked me if I was confident that I would never make a mistake that could potentially impact the person (presumably my child, parent or spouse) who was dependent upon the life support system. I was pretty shocked at this question posed during a job interview and I managed to stay positive and I told the manager my methods worked and yes I would trust them on a life support system that could potentially impact someone who I cared about. But the question stayed with me for years to come. The truth is that someone has to upgrade the software used by life support systems and I am not completely confident that our industry has completely reliable methods to handle this work.
Some times ago I gave a full day class at a The Nuclear Information Technology Strategic Leadership (NITSL) conference. The NITSL is a nuclear industry group of all nuclear generation utilities that exchange information related to information technology management and quality issues. I am pleased to say that these colleagues valued software safety to such a degree that it was an ingrained aspect of their culture which impacted every aspect of their daily work.
From a configuration management perspective, the first step in software safety must be to establish the trusted base from the systems software to applications that are integrated with the hardware devices. The trusted base must start from the lowest levels of the system including the firmware, operating system and even the hardware itself. Applications must built, packaged and deployed deterministically to the trusted base in a manner that ensures that we know exactly what code is to be deployed and that we can verify that the correct code actually was indeed deployed to the target environment. Equally important is verifying that no unauthorized changes have occurred and that the trusted base is verifiable and fully tested. If you had a pacemaker that required software updates, obviously it would be essential that you can rely upon there being a trusted base that enables the pacemaker to function reliably and correctly.
Past outages at major stock exchanges and trading firms have shown that many complex financial systems obviously do not have an established trusted computing base and that has directly resulted in very steep losses for some firms and impacted thousands of people. The good news is that we actually do know how to build, package and deploy software reliably. We also know how to verify that the right code was deployed and that there are no unauthorized changes. These best practices are precisely what we discuss in application build, package and deployment including DevOps, although many firms struggle with their successful implementation. The key to success is to start from the beginning.
In my consulting work, I often find that companies actually do know what has to be done to reliably build, package and deploy software successfully. The problem is that they often begin doing the right thing much too late in the application lifecycle. Deming teaches us that quality must be built in from the beginning. The same is especially true when considering software safety.
Successful build and release engineers understand that smoke testing after a deployment is essential for a successful build and release process. When the software matters then you need to be verifying and validating the code from the very beginning to the end of the lifecycle. This means that your build stream should include unit testing, functional and non-functional (e.g. performance testing) and of course comprehensive regression testing. Good configuration management practices allow you to build a version of the code that can be instrumented for comprehensive code analysis and exhaustive automated testing. The truth is that these best practices are most successful when they are supported from the very beginning of the lifecycle and are a fundamental part of the culture of the organization. Don’t forget that the build and deploy pipeline must also be verifiable and trusted.
When I create an automated build and deployment system, I start from the ground up verifying the operating system itself and all of the system dependencies. I only trust the trusted base if I am able to verify it on a continuous basis and this become for me part of environment management (and monitoring).For example, the Center for Internet Security (CIS) provides an excellent consensus standard that explains in great detail exactly how to create a secure linux operating system. You will also find that the consensus standard also provides example code for verifying that the security baseline is configured as it should be. Successful, security engineering involves both configuring the operating system correctly and verifying on an ongoing basis that it stays configured in a secure way. This is fundamentally a core aspect of environment monitoring and is essential for ensuring the trusted base.
Software safety requires that systems be built and configured in a secure and reliable way. Changes need to be tracked and verified which is essentially the purpose of the physical configuration audit. There’s more to software safety and I hope that you will contact me to share your views on software safety best practices and get involved with the community based efforts to updated software safety standards!