Why Software Sucks, part 2: Instrumentation

January 12th, 2007 | Author: Todd Biske

Continuing on my discussion about this Technometria podcast on IT Conversations, where Phil Windley and others chat with David Platt, author of a new book entitled “Why Software Sucks… and What You Can Do About It.”, there was a great item on the importance of instrumentation.

David gave the example of a web site for a library. He said how will read a review of a book online and immediately go to the website of the library to request it. He said that the UI is hard to use because there are two use cases: one where the user is in the physical library and one where the user is accessing the web site from home. When you’re accessing it from home, a user really doesn’t care if the book is in the library, where a user in the library certainly does. As a result, a design suited for use in the library doesn’t work well for people accessing it from home. David asked the team what percentage of requests for the site came from within the library, versus what percentage of requests came from outside the library. They didn’t know. Not knowing this makes it very difficult to design the site properly. David points out that you need to design for the masses and not the edge cases, although that’s so often what we do.

The importance of instrumentation has always been a soap box of mine. Back in May of 2006 there was a Business Week article that discussed the natural advantage of web-based companies like Amazon and Salesforce.com which all revolved around instrumentation (my comments here). I just recently had a posting on the power of the feedback loop. None of this can happen without instrumentation, and this doesn’t apply solely to user interfaces. Do you know which operations of your services receive the most use? Do you know when they receive the most use? Do you know where that usage is coming from? This type of information needs to be captured and fed back into the development process to create better services, and make better use of resources. If 99% of a service’s operations aren’t used, why were they built in the first place? Without instrumentation, how will you know this?

Here are two examples that I’ve seen first hand:

A service that normally had about 10,000 requests a day, but every now and then, it would balloon up to 100,000 requests or more for two days. There was a consuming application associated with end-of-quarter reporting that hammered the service. While no problems were experienced, this could have been a disaster. This directly led to work to capture accurate usage profiles of new consumers before they went into production.
Another service was seen to have spikes of usage first thing in the morning, over lunch, and at the end of the day. These were times when the users of these applications had time to sit down and use the application versus the other activities that went on over the course of the day. Adjustments had to be made to the infrastructure to support the spiked access pattern instead of a steady rate over the whole day.

These examples served to open the door to better instrumentation. Things that should be looked for as part of a continual improvement process are sequences of service interactions. I may see that two or more services are always called in sequence, by multiple applications. This may be an opportunity to create a composite service (or simply rewrite the first service) so it handles the entire sequence of interactions, improving performance, and making life easier for the consumer. You can only get this information through instrumentation, however.

Posted in SOA

Dale Chalfant:

January 16, 2007 at 9:35 am

Hi Todd:

Back at Fords, I was given the assignment to convert a green screen system to client/server (using a teradata in the back end). The first thing I did was apply usage metrics; I attached to the menu a log to get an idea what parts of the system were used when. I used this data to prioritize my efforts.

On a more extreme example comes from my Fortran teacher. He took over the operation of a large data center. His first order of business was to stop sending all the reports which were printed each day. He waiting until someone called asking for their report. This resulted in cutting down the volume to about 10%!

The key is knowing what to meter. If you were to meter everything, the only thing which the meter would be reporting on would be itself, and it would not be doing a very good job (it’s like a map of the world actual size)! It is the Heisenberg uncertainty principle in action (when ever you study a system, you change it).

Todd Biske: Outside the Box

Why Software Sucks, part 2: Instrumentation

2 Responses to “Why Software Sucks, part 2: Instrumentation”

Leave a Reply