IT Service
Metrics 101
What to Measure, and How, When, and
Why to Measure It
Written by Mark
Katsouros.
November 13, 2012
As IT service providers, we all
know how important it is to develop good metrics. Without meaningful measurements, how are we
to gauge our level of performance, and thus how we improve, at delivering the
services most needed by our customers?
And, yet, it seems that so few of us are doing this correctly, if at
all. In order to deliver the most value
to our institution, we must know which
services will do that in the first place (something that is always shifting), how
efficiently and effectively we are delivering those services in order to be
competitive with alternatives, and how adequately we are managing/monitoring
services in order to maximize our provisioning and support capacity, and
respond to growth in demand.
Know the Vernacular: Metrics vs. Measurements
The Four Critical Categories of
Metrics
While some might argue that there are more, I have found
that most every meaningful metric falls into one of four high-level
categories: (1) Capacity, (2)
Performance, (3) Relevancy, and (4) Satisfaction. The first two types are typically gleaned via
operational measurements—instrumentation, monitoring, and logging. The latter two types are best gleaned via
customer feedback—subscription rates and satisfaction surveys. They are all truly critical, and are somewhat
interwoven, but let us look at each one independently:
- Capacity generally refers to quantity—how much of a finite resource is being consumed. In the case of IT service metrics, this resource is a service (or an identifiable, finite piece of a service, such as infrastructure). How much bandwidth is being consumed? What is our CPU utilization? How much disk space or memory is being utilized? (And, of course, how fast are these numbers growing? Again, trends are discussed further below.)
- Performance, in the context of IT services, is mostly about uptime and reliability, but it is also often used to measure how “healthy” (fast, error-free, etc.) a service is. How many nines of uptime have we achieved? How quickly did the data arrive? How many packets were lost? What is the quality of the service? Are we meeting the performance expectations identified in our service level agreements?
- Relevancy gauges how dependent your customers are (or their work is) on a particular service. This metric is most critical in terms of distinguishing levels of importance, and thus prioritizing services—which ones you should continue to provide and, perhaps more importantly, which ones you should not (or which ones you should change to better meet your customers’ needs).
- Satisfaction should be pretty self-explanatory. This is how satisfied your customers are with a particular service—how they perceive your delivery of it, or their level of gratification. Without this metric, you are only guessing at how well your service delivery is received by customers. Of course, this metric is only applicable to the services that customers deem most relevant to them. (No need to ask, obviously, how satisfied someone is with a service having little importance to his/her job.)
The Importance of Continuous
Measurements towards Trending
Metrics are the means to an end, or, rather, the means to a trend.
While each of these categories is critical, it is the slope of the lines
plotted across time from which you can learn the most about your services. Is the service moving towards more importance
or less? (Is it time to dedicate more
resources to the service, or plan its end?)
At what rate are the service’s resources being consumed? Are we getting better or worse at delivering
it?
Taking a measurement in time, for example, of a satisfaction
score might yield an average score of seven on a scale of one to 10 (with one
being poor and 10 being excellent).
While seven might seem like a decent average score on its own, it could
be a sign of serious trouble if the past several average scores have been nines
and 10s, or a welcomed sign of great improvement if the past several average scores
have been extremely low. Which kind of
attention (or celebration) to give that average score of seven is totally based
on the context gleaned by the trend.
Knowing when to end services
(one of the most significant challenges in IT organizations), how to manage
just-in-time (optimal) resource growth, how much to focus (resources) on improving
service provisioning and/or customer service skills in general, and
understanding the overall fitness of your service management (and service
portfolio management) processes all comes down to metric-based trends.
A Valuable Tool: The Service Survey
As mentioned previously, service subscription rates and customer
surveys provide the best metrics for identifying trends to gauge service
relevancy. And, essentially, the only
surefire way to know what customers think about your organization’s ability to
deliver services is to ask them. A simple,
consistent, regularly-scheduled survey, that asks two critical questions for
each service, is all you need. Those two
questions are:
Ask these two questions for each service that can be ordered from your service catalog. Additionally, depending on how centralized your IT organization is, you may want to ask some questions to segregate the results into appropriate demographics. For instance, if you are the central IT service provider for your institution, you will likely want to specifically differentiate between the feedback of your institution’s other IT service providers (who might serve as liaisons between you and their department’s end-users) and that of true end-users. You might have totally different staff serving these two very different constituencies, or a very different provisioning process for each of them, so looking at this data separately likely makes a lot of sense.
“What’s the Frequency, Kenneth?”
Dan Rather references aside, the frequency by which one takes
measurements is critical to maximizing the meaning of the trends produced. If measurements fluctuate dramatically (i.e.,
have a high standard deviation), they need to be recorded more often to
minimize the likelihood of drawing the wrong trending conclusions. More importantly, you want to ensure that you
can appropriately react to trend changes in a timely manner. So, the frequency of your measurements needs
to account for the time you might require to sufficiently take action (e.g.,
increase capacity, add redundancy to deal with an availability issue, improve
customer service training, or even kill a service).
In terms of the above relevancy
and satisfaction survey, one should balance the need for frequent measures with
that of not asking too much of one’s customers (by over-surveying them). One solution is to divide your customer base
into 12 groups and survey one group per month, thereby surveying every
individual only once per year, but still gathering monthly points on the graph
towards more rapid trending. Of course,
this assumes that your customer base is substantially large—large enough so
that one twelfth of it would still provide reasonably large sample sizes, and
thus un-skewed results.
A Fifth Metric Category
Storing Measurements and the Grand
Scheme of Things
Besides collecting, and gleaning meaning out of, metrics,
one of the challenges many IT organizations seem to face is being able to
summon and share those metrics when needed/requested. This is a critical part of utilizing metrics
successfully, as IT organizations are often large and complex, with completely
different employees or groups supporting different services.
It is important to recognize the hierarchical relationship
among metrics and two other fundamental service resource layers, (1) the
service catalog/portfolio and (2) Service Level Agreements or, often,
commitments / Operational Level Agreements (SLAs/OLAs). To be clear, the service catalog is a
public-facing menu of orderable
services and the service portfolio is a more-internal superset of the service
catalog that contains services (internal, retired, etc), service attributes,
and detailed data, not typically present in the catalog. SLAs are the performance and response
commitments made to customers of the service, and OLAs are agreements between
IT service providers that address roles, responsibilities, and response
commitments. Metrics should be leveraged
to support SLAs/OLAs, and, of course, also support other service measurements
more directly. The aforementioned hierarchical
relationship is a one-to-many-to-many structure:
A Few Words of Caution
There are two risks, even dangers, of which you must be
cognizant:
First, developing metrics, towards enabling better, data-driven
decisions around a service, needs to be approached in a top-down fashion—with
service level agreements/commitments (and perhaps other service goals) driving
what you collect. One of the major rookie
sins of defining and gathering metrics is to start with those that are simply
easy to gather, in which case you can certainly end up with voluminous amounts
of data, very little of which may actually be useful. Instead, start with your service
goals—SLAs/OLAs, provisioning efficiency, customer satisfaction, and so on, and
then identify the metric data that will help you gauge how well you are meeting
those goals.
Second, developing the processes and structures for
gathering, storing, analyzing, and sharing metric data requires a fairly
monumental effort. The very last thing
you want to do with respect to metrics is to not respect the metrics, i.e., go through all of this effort, only
to end up not utilizing the results.
More than just a waste of time, this will create a huge morale suck for
those employees involved in the effort, as well as those service subscribers/customers
who would likely benefit from it. Act on
the metrics and metric trends you expose.
Limited resources demand that
our actions have a purpose, and that our purpose results in action.
Conclusion
Identifying appropriate metrics, taking reasonably-frequent measurements,
and monitoring for trends, particularly changes
in trends, is the only truly-reliable way to not only improve services towards
meeting customer demands and expectations (and ensuring that you are meeting
service level agreements/commitments), but it is the only way to know which
services deserve your focus and resources, and which ones deserve to be placed
on the chopping block (to free up focus and resources for more important
things).
Capacity and performance metrics are typically gleaned via
operational measurements—instrumentation, monitoring, and logging. Relevancy and satisfaction metrics are best
gleaned via customer feedback—subscription rates and satisfaction surveys. Cost metrics, both capital and operational
expenses, are obviously best gleaned from financial systems—general ledger,
payroll, order/trouble tracking, and service and project portfolios.
And do not gather a bunch of metric data just because you
can. Gather it in support of service
level agreements/commitments and other specific service goals. Then deliberately and visibly act on them! In other words, start with your high-level
service goals and then identify the metrics that can help gauge how well those
goals are being met. Again, action
without purpose is as detrimental to good service management (and employee
morale) as purpose without action.
Finally, remember that those trends and trend changes tell
the real stories, are what help us adequately grow (or shrink) services, and
ultimately maximize our effectiveness and efficiency in delivering the important
services on which most every business process of the modern enterprise so
depends.
Mark Katsouros is the Director of Network Planning &
Integration at the Pennsylvania State University. The opinions expressed
in this article are his and are not necessarily shared by the University, but
they just might be, as PSU is a pretty awesome institution.