- Transforming ITIL to Agile
- Visualization Coolness
- Change Detection
- Green IT Machine
- Continuous Training
- Community and Cooperation are the Keys to Success!
- Ignoring the source code is akin to an ostrich sticking its head in the sand
- Remember when men were men and wrote their own device drivers?
- My downloads is bigger than yours!
- It's all about working together
February 01, 2006 | Comments: (0)
How Hard is it to Troubleshoot IT Anyway?
So how hard is it to troubleshoot anyway? Last month Bay Lisa, LOPSA (league of professional system administrators), NaSPA (network and system professionals association), Splunk, SysAdmin Magazine and Usenix surveyed the attendees at Camp SysAdmin in San Francisco about their troubleshooting activities and found some things that surprised everyone.
Campers represented large Fortune 100 companies and small organizations and consultants who had responsibilities for all types of IT infrastructures and technologies including enterprise applications ( J2EE, .NET, LAMP, Web Services), email, on-demand services, VoIP, and more.
Here is what the campers had to say.
#1. Even when troubleshooting cutting edge technologies, we are still doing things the old-fashioned way. Campers were asked to name their Top 3 Troubleshooting Tools in an open ended question.
The most frequently mentioned tools were:
a. grep
b. manually scanning logs
c. tailing data sources
These were just a few of the command line utilities that were mentioned. Others included traceroute, ping, sort, awk, swatch, vim, topdump, strace, and more.
There were frequent mentions of opensource or free tools, primarily Nagios, Ethereal, and MRTG. Surprisingly, given the amount of money corporations spend on these, very few commercial tools were mentioned. Keynote, HP OpenView, NetIQ App Manage, Tivoli, and BMC PATROL were all mentioned just once each.
#2. The hardest thing about troubleshooting is that there is too much data in too many places and in too many formats.
In an open-ended question, participants were asked to identify the hardest thing about troubleshooting, 25% of respondents replied with an answer related to dealing with the huge volume of data they use for troubleshooting, the inconsistency of formatting, or the number of locations where the data resides. These data-related concerns were far and away the top answer, with lack of documentation and lack of expertise tied for number two.
#3. The most aggravating thing about troubleshooting is too much data.
In a follow-on open-ended question asking about the most aggravating aspect of troubleshooting, 30% of respondents replied similarly saying that the amount and inconsistency of data was aggravating. In this case, however, the number two answer was related to people issues, interrupting calls from management, vague user reports, or unresponsive members of other teams (developers in particular).
#4. The amount and types of data available for troubleshooting is overwhelming.
More than half (54%) of participants had data stored in more than 50 locations, with 19% having data in 1000+ locations. 60% of participants said that each of these locations generates more than 25 MB of data a day. 96% of participants said they deal with more than 3 formats of troubleshooting data, 38% said they have more than 15 formats and 23% said they deal with over 30 formats.
#5) System administrators are very proactive in their search for problems.
The good news in this survey is system administrators are generally not so overwhelmed that they can not be proactive. 60% proactively search through data either daily or weekly. 35% proactively search through data occasionally.
The sponsors are planning to run a broader survey of troubleshooting issues targeted at more specific categories at the next Camp SysAdmin. Stay tuned and I'll let everyone know when its announced. If you are interested in participating or have specific questions you'd like asked as part of the next survey email me at thebaum@splunk.com.
Posted by Michael Baum on February 1, 2006 06:53 PM
RATE THIS ARTICLE:
-

- COMMENTS
As everyone seems to recognize, the problem is too much data available from too many places. So how about a couple of simple approaches that, together, make for powerful approach to the problem. Warning: I'm with Integrien (www.integrien.com), and this is very much our point of view.
HOLISTIC, APPLICATION-ORIENTED VIEW
What if a tool subset the problem based on application? In other words, when you know you have an application problem, display information from just those elements of the infrastructure that could be involved, from the firewall to the network components to all the logical services (web, application, authentication, database, etc.). Provide the ability to walk cross-silo on the path the application takes and drill down into the metrics quickly. As a result, an expert might walk back from slow response time on a web server to too many sessions on an application server and find the bottleneck in the database server, which is showing too many open database cursors. Makes for quick troubleshooting.
DYNAMIC THRESHOLDS
What if you had a tool that didn't depend on you to tell it what constitutes a problem level in a server- or node-level metric? Instead, the tool learned what normal is for each metric and told you when things were going abnormal? (By the way, that's not a single level, but a variable that changes based on the overall system load). Two big benefits emerge. One, it eliminates the impossible task of setting one threshold level that's appropriate for every system load. Two, it lets the system pick up problems early and not just at the high-water mark. You learn when the levee is leaking and not just when the water is going over the top.
CORRELATE ABNORMAL READINGS WITH SERVICE LEVEL
What if that same system also checked to see if the abnormal reading (a violation of a dynamically set threshold) was having a negative impact on a business application? It would send a special alert to the responsible system administrator: "this abnormal metric coincides with degradation in business system xyz." As a result, the person with the expertise to fix the business problem gets responsibility for it and the information needed to act.
CAPTURE WHAT YOU LEARN
Finally, what if the system took a Problem Fingerprint of the events leading up to system problems in order to provide information to do your forensics on ... and also to recognize the problem should it recur again. Not a perfect solution, but it would help post-mortem root cause analysis. Maybe more important, whatever you learned from the analysis could then be associated with the Fingerprint, so when that Problem Fingerprint was recognized you'd get an alert that carried with it the best post-mortem thinking you could do about the problem and how to fix it. As an added bonus, the warning might come well before the problem if the system had learned enough about the pattern of events that precede it. So you might be fixing the problem before it actually impacted the business.
The above may be a lot to think about. There's a two-minute Flash at http://www.integrien.com/sysadmin.cfm you can view and share to help you decide whether the approach makes sense. All together, it's a way to use the power of the computer to organize and correlate the problem indicators so you don't have to try to do it in a needle in the haystack fashion after the fact.
Posted by: Kevin Strehlo at March 26, 2006 10:09 AMTOP STORIES
ADDITIONAL RESOURCES

- Virtualization: A Step by Step Approach to Success
- Dialing up Agility with Business Transformation
- 5 Things You Need to Know About Storage Virtualization

- Is your smaller organization ready for High Availability?
- Is system maintenance doing more harm than good?
- Virtual Test Lab Automation: Manage development infrastructure





