End-to-End Application Performance Management
by Jacob Ukelson
IT service levels are the critical component in the perceived performance of IT (as perceived by the business and customers). There are many service levels that need to be monitored (application development time, time to add new features, time to roll out an application), but none is more critical in today’s self-service, web enabled business world than end-user satisfaction.
Of course end-user satisfaction as a measurement is complex, because it can be affected by application ease of use , end-users machine capabilities, network problems and load, as well as datacenter performance. The biggest component in end-user satisfaction is response time (anyone interested in user experience issues should take a look a Jakob Nielsen’s use.it blog). Response time means performance, so the most important service level for managing end user satisfaction is managing application performance.
Looking at from 100,000 feet the main components for application performance management are managing network components, managing system components and managing application components. The job of operations (or technical services) is to continually monitor the state of these different components, notice any problems and then figure who needs to be involved to fix the problem.
The idea behind end-to-end performance management is provide tools that enable operations to look at things from an application perspective, which makes sense with respect to end-user satisfaction since the ony things users care about is the application – they don’t care about application servers, load balancers or databases. So the idea behind APM is that if operations could look at the world from an application perspective, and then drill down from the applications to the various components that comprise the application when a problem occurs, they could quickly find the problematic components and hand them off to the appropriate personnel for mending.
APM is a good idea, but the problem with today’s APM tools is that they are all arms and legs with no brain. What I mean by that is that they collect a vast amount of information from a wide variety systems and display that flood of information to the operators. They try to add some smarts (or a brain) in the guise of alerts, thresholds (and sometimes correlation) capabilities- providing alerts when thresholds are crossed. The problem is that thresholds are notoriously hard to set (set it too high and you miss certain critical events, too low and you are flooded with alerts). Also, if you look more closely at the problem, it is clear that for thresholds to actually work they have to be dynamic – changing and adapting for a huge number of ever changing real world conditions. Even if an APM tool supports dynamic thresholding – staff would have to spend all their time monitoring and modifying thresholds. The result is that your APM is a brainless zombie – leaving everything up to the operators, while swamping them with information. As systems are becoming more complex (virtualization and cloud computing will only increase the complexity), operations need alerts that can be trusted – and root cause analysis that can be relied upon.
Given the state of APM today, there are lots of different vendors that can provide the arms and legs – but they leave the brains out. APM systems need brains to do a first level filtering of data, turning it into trusted alerts and information that operators can process. There is only one way to build the brains for APM - behavioral performance analytics that can create an infinite number of thresholds that learn and adapt to an applications ever changing environment.
