IT Service Metrics 101
What to Measure, and How, When, and Why to Measure It
Written by Mark Katsouros.
November 13, 2012
As IT service providers, we all know how important it is to develop good metrics. Without meaningful measurements, how are we to gauge our level of performance, and thus how we improve, at delivering the services most needed by our customers? And, yet, it seems that so few of us are doing this correctly, if at all. In order to deliver the most value to our institution, we must know which services will do that in the first place (something that is always shifting), how efficiently and effectively we are delivering those services in order to be competitive with alternatives, and how adequately we are managing/monitoring services in order to maximize our provisioning and support capacity, and respond to growth in demand.
Know the Vernacular: Metrics vs. Measurements
The Four Critical Categories of Metrics
While some might argue that there are more, I have found that most every meaningful metric falls into one of four high-level categories: (1) Capacity, (2) Performance, (3) Relevancy, and (4) Satisfaction. The first two types are typically gleaned via operational measurements—instrumentation, monitoring, and logging. The latter two types are best gleaned via customer feedback—subscription rates and satisfaction surveys. They are all truly critical, and are somewhat interwoven, but let us look at each one independently:
- Capacity generally refers to quantity—how much of a finite resource is being consumed. In the case of IT service metrics, this resource is a service (or an identifiable, finite piece of a service, such as infrastructure). How much bandwidth is being consumed? What is our CPU utilization? How much disk space or memory is being utilized? (And, of course, how fast are these numbers growing? Again, trends are discussed further below.)
- Performance, in the context of IT services, is mostly about uptime and reliability, but it is also often used to measure how “healthy” (fast, error-free, etc.) a service is. How many nines of uptime have we achieved? How quickly did the data arrive? How many packets were lost? What is the quality of the service? Are we meeting the performance expectations identified in our service level agreements?
- Relevancy gauges how dependent your customers are (or their work is) on a particular service. This metric is most critical in terms of distinguishing levels of importance, and thus prioritizing services—which ones you should continue to provide and, perhaps more importantly, which ones you should not (or which ones you should change to better meet your customers’ needs).
- Satisfaction should be pretty self-explanatory. This is how satisfied your customers are with a particular service—how they perceive your delivery of it, or their level of gratification. Without this metric, you are only guessing at how well your service delivery is received by customers. Of course, this metric is only applicable to the services that customers deem most relevant to them. (No need to ask, obviously, how satisfied someone is with a service having little importance to his/her job.
The Importance of Continuous Measurements towards Trending
Metrics are the means to an end, or, rather, the means to a trend. While each of these categories is critical, it is the slope of the lines plotted across time from which you can learn the most about your services. Is the service moving towards more importance or less? (Is it time to dedicate more resources to the service, or plan its end?) At what rate are the service’s resources being consumed? Are we getting better or worse at delivering it?
Taking a measurement in time, for example, of a satisfaction score might yield an average score of seven on a scale of one to 10 (with one being poor and 10 being excellent). While seven might seem like a decent average score on its own, it could be a sign of serious trouble if the past several average scores have been nines and 10s, or a welcomed sign of great improvement if the past several average scores have been extremely low. Which kind of attention (or celebration) to give that average score of seven is totally based on the context gleaned by the trend.
Knowing when to end services (one of the most significant challenges in IT organizations), how to manage just-in-time (optimal) resource growth, how much to focus (resources) on improving service provisioning and/or customer service skills in general, and understanding the overall fitness of your service management (and service portfolio management) processes all comes down to metric-based trends.
A Valuable Tool: The Service Survey
As mentioned previously, service subscription rates and customer surveys provide the best metrics for identifying trends to gauge service relevancy. And, essentially, the only surefire way to know what customers think about your organization’s ability to deliver services is to ask them. A simple, consistent, regularly-scheduled survey, that asks two critical questions for each service, is all you need. Those two questions are:
Ask these two questions for each service that can be ordered from your service catalog. Additionally, depending on how centralized your IT organization is, you may want to ask some questions to segregate the results into appropriate demographics. For instance, if you are the central IT service provider for your institution, you will likely want to specifically differentiate between the feedback of your institution’s other IT service providers (who might serve as liaisons between you and their department’s end-users) and that of true end-users. You might have totally different staff serving these two very different constituencies, or a very different provisioning process for each of them, so looking at this data separately likely makes a lot of sense.
“What’s the Frequency, Kenneth?”
Dan Rather references aside, the frequency by which one takes measurements is critical to maximizing the meaning of the trends produced. If measurements fluctuate dramatically (i.e., have a high standard deviation), they need to be recorded more often to minimize the likelihood of drawing the wrong trending conclusions. More importantly, you want to ensure that you can appropriately react to trend changes in a timely manner. So, the frequency of your measurements needs to account for the time you might require to sufficiently take action (e.g., increase capacity, add redundancy to deal with an availability issue, improve customer service training, or even kill a service).
In terms of the above relevancy and satisfaction survey, one should balance the need for frequent measures with that of not asking too much of one’s customers (by over-surveying them). One solution is to divide your customer base into 12 groups and survey one group per month, thereby surveying every individual only once per year, but still gathering monthly points on the graph towards more rapid trending. Of course, this assumes that your customer base is substantially large—large enough so that one twelfth of it would still provide reasonably large sample sizes, and thus un-skewed results.
A Fifth Metric Category
Storing Measurements and the Grand Scheme of Things
Besides collecting, and gleaning meaning out of, metrics, one of the challenges many IT organizations seem to face is being able to summon and share those metrics when needed/requested. This is a critical part of utilizing metrics successfully, as IT organizations are often large and complex, with completely different employees or groups supporting different services.
It is important to recognize the hierarchical relationship among metrics and two other fundamental service resource layers, (1) the service catalog/portfolio and (2) Service Level Agreements or, often, commitments / Operational Level Agreements (SLAs/OLAs). To be clear, the service catalog is a public-facing menu of orderable services and the service portfolio is a more-internal superset of the service catalog that contains services (internal, retired, etc), service attributes, and detailed data, not typically present in the catalog. SLAs are the performance and response commitments made to customers of the service, and OLAs are agreements between IT service providers that address roles, responsibilities, and response commitments. Metrics should be leveraged to support SLAs/OLAs, and, of course, also support other service measurements more directly. The aforementioned hierarchical relationship is a one-to-many-to-many structure:
A Few Words of Caution
There are two risks, even dangers, of which you must be cognizant:
First, developing metrics, towards enabling better, data-driven decisions around a service, needs to be approached in a top-down fashion—with service level agreements/commitments (and perhaps other service goals) driving what you collect. One of the major rookie sins of defining and gathering metrics is to start with those that are simply easy to gather, in which case you can certainly end up with voluminous amounts of data, very little of which may actually be useful. Instead, start with your service goals—SLAs/OLAs, provisioning efficiency, customer satisfaction, and so on, and then identify the metric data that will help you gauge how well you are meeting those goals.
Second, developing the processes and structures for gathering, storing, analyzing, and sharing metric data requires a fairly monumental effort. The very last thing you want to do with respect to metrics is to not respect the metrics, i.e., go through all of this effort, only to end up not utilizing the results. More than just a waste of time, this will create a huge morale suck for those employees involved in the effort, as well as those service subscribers/customers who would likely benefit from it. Act on the metrics and metric trends you expose.
Limited resources demand that our actions have a purpose, and that our purpose results in action.
Identifying appropriate metrics, taking reasonably-frequent measurements, and monitoring for trends, particularly changes in trends, is the only truly-reliable way to not only improve services towards meeting customer demands and expectations (and ensuring that you are meeting service level agreements/commitments), but it is the only way to know which services deserve your focus and resources, and which ones deserve to be placed on the chopping block (to free up focus and resources for more important things).
Capacity and performance metrics are typically gleaned via operational measurements—instrumentation, monitoring, and logging. Relevancy and satisfaction metrics are best gleaned via customer feedback—subscription rates and satisfaction surveys. Cost metrics, both capital and operational expenses, are obviously best gleaned from financial systems—general ledger, payroll, order/trouble tracking, and service and project portfolios.
And do not gather a bunch of metric data just because you can. Gather it in support of service level agreements/commitments and other specific service goals. Then deliberately and visibly act on them! In other words, start with your high-level service goals and then identify the metrics that can help gauge how well those goals are being met. Again, action without purpose is as detrimental to good service management (and employee morale) as purpose without action.
Finally, remember that those trends and trend changes tell the real stories, are what help us adequately grow (or shrink) services, and ultimately maximize our effectiveness and efficiency in delivering the important services on which most every business process of the modern enterprise so depends.
Mark Katsouros is the Director of Network Planning & Integration at the Pennsylvania State University. The opinions expressed in this article are his and are not necessarily shared by the University, but they just might be, as PSU is a pretty awesome institution.