Monitoring your runtime environment is an essential function that will help you proactively identify potential issues before they escalate into incidents and outages. But environment monitoring can be pretty challenging to do well. Unfortunately, environment management is often overlooked and, even when addressed, usually only handled in the simplest way. Keeping an eye on your environment is actually one of the most important functions for IT operations. If you spend the time understanding what really needs to be monitored and establish effective ways of communicating events, then your systems will be much more reliable—and you will likely get a lot more sleep without so many of those painful calls in the middle of the night. Here’s how to get started with environment management.
The ITIL v3 framework provides pretty good guidance on how to implement an effective environment management function. The first step is to identify which events should be monitored and establish an automated framework for communicating the information to the stakeholders who are responsible for addressing problems when they occur. The most obvious environment dependencies are basic resources such as available memory, disk space, and processor capacity. If you are running low on memory, disk space, or any other physical resource, then obviously your IT services may be adversely impacted. Most organizations understand that employees need to monitor key processes and identify and respond to abnormal process termination. Nagios is one of the popular tools to monitor processes and communicate events that may be related to processes being terminated unexpectedly.
There are many other environmental dependencies, such as ports being opened, that also need to be monitored on a constant basis. I have seen production outages caused by a security group closing a port because there was no record that the port was needed for a particular application. These are fairly obvious dependencies, and most IT shops are well aware of these requirements. But what about the more subtle environment dependencies that need to be addressed?
I have seen situations where databases stopped working because the user account used by the application to access the database locked up. Upon investigation, we found that the UAT user account was the same account used in production. In most ways, you want UAT and production to match, but in this case locking up the user account in UAT took down production. You certainly don’t want to use the same account for both UAT and production, and it may be a good idea to set up a job that checks to ensure that the database account is always working.
Market data feeds are another example of an environment dependency that may impact your system. This one can be tricky because you may not have control over a third-party vendor who supplies you with data. This is all the more reason why you want to monitor your data feeds and notify the appropriate support people if there is a problem. Cloud-based services may also provide some challenges because you may not always be in control of the environment and might have to rely on a third party for support. Establishing a service-level agreement (SLA) is fundamental when you are dependent on another organization for services. You may also find yourself trying to figure out how your cloud-based resources actually work and what you need to do when your service provider makes changes that may be unexpected and not completely understood. I had this experience myself when trying to puzzle my way through all of the options for Amazon Cloud. In fact, it took me a few tries to figure out how to turn off all of the billable options such as storage and fixed IPs when the project was over. I am not intending to criticize Amazon per se but even their own help desk had trouble locating what I needed to remove so that I would stop getting charged for resources that I wasn’t using.
To be successful with environment management, you need to establish a knowledge base to gather the essential technical information that may be understood by a few people on the team. Documenting and communicating this information is an important task and often requires effective collaboration among your development, data security, and operations teams.
Many organizations including financial services are working to establish a configuration management database (CMDB) to facilitate environment management. The ITIL framework provides a considerable amount of guidance on how to establish a CMDB and the supporting configuration management system (CMS), which helps to provide some structure for the information in the CMDB. The CMDB and the CMS must be supported by tools that monitor the environment and report on the status of key dependencies on a constant basis. These capabilities are essential for ensuring that your critical infrastructure is safe and secure.
Many organizations monitor port level scans and attacks. Network intrusion detection tools such as SNORT can help to monitor and identify port-level activity that may indicate an attempt to compromise your system is underway. Ensuring that your runtime environment is secure is essential for maintaining a trusted computing environment. There have been many high-profile incidents that resulted in serious system outages related to port-level system attacks. Monitoring and recognizing this activity is a first step in addressing these concerns.
In complex technology environments you may find it difficult to really understand all of the environment requirements. This is where tying together your support application lifecycle is essential. When bad things happen, your help desk will receive the calls. Reviewing and understanding incidents can help the entire team identify and address environment-related issues. Make sure that you never have the same problem twice by having reported incidents fully investigated with new environmental dependencies identified and monitored on an ongoing basis.
Conclusion
Environment management is a key capability that can help your entire team be more effective. You need to provide a structure to identify environment dependencies and then work with your technical resources to implement tools to monitor environment dependencies. If you get this right, then your organization will benefit from reliable systems and your development and operations