The importance of monitoring
“Monitoring” traditionally was a preserve of Operations engineers. The term often invokes not very pleasant memories in the minds of many who’ve been doing it for long enough. They can remember the time when Nagios was state-of-the-art.
Infrastructure monitoring has moved on from the tracking of temperatures, power consumption, neat cabling and racks, but capacity, computing resources, and security are still our responsibility.
While it’s true that a decade ago, up/down checks might’ve been all a “monitoring” tool would have been capable of, in the recent years “monitoring” tools have evolved greatly, to the point where many, many, many people no longer think of monitoring as just external pings. While they might still call it “monitoring”, the methods and tools they use are more powerful and streamlined.
Your monitoring system should address two questions: what’s broken, and why? The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause.
From a security perspective, it’s never been more important to have the tools and practices in place to know your environments. Data and security breaches are consistently making the headlines as hackers are using brute force. The effects on reputation and corporate wallets can be immense.
The best practice extends to efficiency and optimization and, through monitoring, helps you find out how well your stacks are running. Monitoring allows you to utilize resources to their fullest while keeping costs low, as well as ensuring you have enough capacity should incidents happen.
Building “monitorable” systems requires being able to understand the failure domain of the critical components of the system proactively. And that’s a tall order. Especially for complex systems.