Behaviorally Speaking – Building Reliable Systems
by Bob Aiello
Anyone who follows technology news is keenly aware that there have been a remarkable number of high profile system glitches over the last several years, at times, with catastrophic results. Major trading exchanges both in the US and in Tokyo have suffered serious outages that call into question the reliability of the world financial system itself. Knight Capital group has essentially ceased to exist as a corporate entity after what was reported to be a configuration management error that resulted in a one-day 440 million dollar loss. These incidents highlight the importance of effective configuration management best practices and place a strong focus on the need for reliable systems. But what exactly makes a system reliable and how do we implement reliable systems? This article describes some of the essential techniques necessary to ensure that systems can be upgraded and supported while enabling the business by providing frequent and continuous delivery of new system features. Mission critical and enterprise-wide computer systems today are often very complex with many moving parts and even more interfaces between components that present special challenges even for expert configuration management engineers. These systems are getting more complex as the demand for features and rapid time to market provides unique challenges that many technology professionals could not have envisioned even a few years ago. Computer systems do more today and many seem to learn more about us each and every day, evolving into complex knowledge management systems that seem to anticipate our every need. High frequency trading systems are just one example of complex computer systems that must be supported by industry best practices that can ensure rapid and reliable system upgrades and implementation of market driven new features. These same systems can result in severe consequences when systems glitches occur, especially as a result of a failed systems upgrade. Finra is a highly respected regulatory authority that has recently issued a targeted examination letter to ten firms that support high frequency trading systems. The letter requests that the firms provide information about their “software development lifecycle for trading algorithms, as well as controls surrounding automated trading technology” . Some organizations may find it challenging to demonstrate adequate IT controls, although really the goal should be for implementing effective IT controls that help ensure systems reliability. Many industries enjoy a very strong focus on quality and reliability.
A few years ago, I had the opportunity to teach configuration management best practices at an NITSL conference for nuclear power plant engineers and quality assurance professionals. Everyone in the room was committed to software safety including reliable safety systems. In the IEEE, we have working groups which help update the related industry standards that help define software reliability, measures of dependability and safety. Make sure that you contact me directly if you are interesting in hearing more about participating in these worthwhile endeavors. Standards and frameworks are valuable but it takes more than just guidelines to make reliable software. Most professionals focus on the importance of accurate requirements and well written test scripts which are essential, however not sufficient to really create reliable software. What really needs to happen is that we build in quality from the very beginning which is an essential teaching that many of us learned from quality management guru W. Edwards Deming .
The key to success is to build the automated deployment pipeline from the very beginning of the application development lifecycle. We all know that software systems must be built with quality in mind from the beginning and this includes the deployment framework itself. Using effective source code management practices along with automated application build, package and deployment is only the beginning. You also need to understand that building a deployment factory is a major systems development itself. It has been my experience that many CM professionals forget to build automated build, package and deployment systems with the same rigor that they would a trading system. As the old adage says, “the chain is only as strong as its weakest link” and inadequate deployment automation is indeed a very weak link.
Successful organizations understand that quality has to be a cultural norm. This means that development teams must take seriously everything from requirements management to version control of test scripts and release notes. Organizations that take the time to train and support developers in the use of robust version control solutions, automated application build languages such as Ant, Maven, Make and MSBuild. The tools and plumbing to build, package and deploy the application must be a first class citizen and fundamental component of the application development effort.
Agile development and DevOps are providing some key concepts and methodologies for achieving success but the truth is that every organization has its own unique requirements, challenges and critical success factors. If you want to be successful then you need to approach this effort with the knowledge and perspective that critical systems are complex to develop and also complex to support. Building the automated deployment framework should not be an afterthought or an optional task started late in the process. Building quality into the development of complex computer systems requires what Deming described in the first of 14 points as “create constancy of purpose for continual improvement of products and service to society” .
We all know that Nuclear power plants, medical life support systems and missile defense systems must be reliable and they obviously must be upgraded from time to time – often due to uncontrollable market demands. Efforts by responsible regulatory agencies such as Finra are essential for helping financial service firms realize the importance of creating reliable systems. DevOps and configuration management best practices are fundamental to the successful creation of reliable software systems. You need to start this journey from the very beginning of the software and systems delivery effort. Make sure that you drop me a line and let me know what you are doing to develop reliable software systems!
 Deming, W. Edwards (1986). Out of the Crisis. MIT Press
 Bob Aiello and Leslie Sachs, Configuration Management Best Practices: Practical Methods that Work, Addison-Wesley Professional, 2011
 Bob Aiello and Leslie Sachs, Agile Application Lifecycle Management – Using DevOps to Drive Process Improvement, Addison-Wesley Professional, 2016