Free Newsletters

   All InfoWorld Newsletters
IT Troubleshooter | Harper Mann » February 2006

February 18, 2006 | Comments: (0)

Certified Open Source Stacks

I thought the whole point of Open Source is that you can make it what you want? I just finished reading Neil McAllisters column A slim market for certified open source?. I have to agree with Neil and his assessment that certifying Open Source stacks may seem appropriate only for customers that dont really care about price. Which brings into question whether someone who does not care about price is a consumer of Open Source technologies and applications in the first place.

But I think Neil misses a number of other points about what makes Open Source so appealing to enterprises. It is not just price. I asked Patrick McGovern who ran SourceForge.net for five and a half years about the issue of Open Source appeal to corporate customers.


Cost is certainly one reason why corporation turn to Open Source, but I don't think it's an exlusive one. Often the products work better their commercial competitors and more often then not, the community based support is better as well. But of course the beauty of Open Source for lots of enterprises is you can make it what you want, not what someone else wants.

Here the certification of Open Source stacks may not have much appeal. The likelihood that my stack will be similar enough to your stack could be pretty slim indeed. When I first heard about companies like SpikeSource and SourceLabs I was pretty jazzed. I figured they would build me the stack of my choice, built to my specs. Then they would test it and certify it -- all with a turnkey service -- sort of like Amazon.com for Open Source software. Unfortunately, for me anyway, it seems like their business has taken both companies in a different direction.

Of course troubleshooting your own Open Source stack is a great topic for the IT Troubleshooter and I'll be covering more of this soon.

Is there a market for a standard, certified Open Source stack or is that something I just get for free from the Open Source community?

Feel free to write me with your thoughts at thebaum@splunk.com.

Posted by Michael Baum on February 18, 2006 08:50 AM


February 17, 2006 | Comments: (0)

Higher Availability Future: Autonomic Computing or Recovery Oriented Computing?

It is fascinating to me that so many smart people can disagree on the best future approach to higher availability infrastructure. The
autonomic computing
crowd led by IBM is touting self-healing and self-regulating computing systems. On the other hand the recovery oriented computing (ROC) folks led by researchers at Berkeley and Stanford declare failures are inevitable. ROC proposes the key to higher availability is helping humans to recover infrastructure from failures faster.

I have written here previously about ROC, but its time to start a dialog on comparing and contrasting these two radically differing views on the future of better infrastructure availability.

You notice I am talking about infrastructure availability not individual system availability. As an industry we have focused for decades on building more reliable individual components and systems. But now the reliability problem has moved to a different level. Take all these highly reliable components and systems and put them together with software developed by multiple vendors or adopted from different open source projects and the reality of complex systems settles in.

Can we build autonomic computing infrastructure that is self-healing and self-regulating beyond simple problems and single systems? Or will humans always be an important part of repairing and recovering IT infrastructure?

Our friends from Berkeley and Stanford offer an interesting perspective dubbed the Ironies of Automation. Their argument goes something like this.


Automation does not remove human influence, but instead reduces IT personnel understanding and can actually make their job harder. Automation increases complexity, reduces visibility and provides no day-to-day interaction and learning. ROC argues for better tools to help, not replace people.

So what do you think? Autonomic Computing or Recovery Oriented Computing? Which will lead us to higher availability infrastructure? Send me your vote to thebaum@splunk.com,

Posted by Michael Baum on February 17, 2006 11:33 PM


February 14, 2006 | Comments: (0)

Mia Culpa: Quoted But Misunderstood

Alexandre Rafalovitch points out in his post Quoted, but misunderstood: What's Missing from Production System Troubleshooting that I sort of missed his point. Alexandre points out:

I do not ask for more data, I ask that the data format currently being used is reviewed to see whether it is actually useful for troubleshooting/monitoring and that concerned effort is made to change the format where it proves not to be useful.

For example, any messages produced by multithreaded services must include thread/transaction id in them. Trying to extract sequence out of the logs that just intermingle their log entries is next to impossible. Same problem with having timestamps in a format that does not allow to correlate to other log types.

I couldn't agree with Alexandre more. If you want to hear more about his views on the topic be sure to go see him speak at the Javaone conference. Heck maybe he'll even let me attend. I'm sure I've a lot to learn from him about J2EE logging and troubleshooting.

Posted by Michael Baum on February 14, 2006 01:36 PM


February 14, 2006 | Comments: (0)

Debugging Versus Troubleshooting

Often I hear developers talking about how they don't understand system administrators and why they are so primitive when it comes to finding and fixing problems. What most developers do not understand is there can be a significant different between the tools appropriate for development environments and the tools safe for production environments. Here is a good example. This is from a recent post entitled Guerilla Debugging for Java by Russ Olsen.

I love Eclipse. I really do. Eclipse has made creating, and especially debugging Java code so much easier that it is hard to imagine living without it. And that last bit is the problem. Sometimes we have to live without Eclipse. I have worked for customers who run very large and complicated J2EE applications, who for security reasons they will not let Eclipse within 200 yards of their servers. Dont even ask. Then there are the times where I have been ssh-ed in to some machine, trying to figure out what is going wrong without the benefit of a GUI. Very little opportunity for an Eclipse session there. Finally, I have been up against Java applications with such an enormous memory footprint that Eclipse would have been the IDE that broke the camels 2 Gig memory limit.

What do you do when cruel fate separates you from your favorite IDE? You resort to the kind of techniques that programmers used before Eclipse...


Do you have a good story of living in production environments without your development environment tools? Write me with your story at thebaum@splunk.com.

Posted by Michael Baum on February 14, 2006 07:27 AM


February 06, 2006 | Comments: (0)

What's Missing from Production System Troubleshooting

Alexandre Rafalovitch recently wrote me regarding our recently survey of system administrators and why product IT systems are so difficult to troubleshoot. Alexandre writes

By now we pretty much established that until the developers themselves try to support/troubleshoot their own products in production (or get loud enough feedback), they will not understand how to make their products easier to manage post-deployment.

The surveys of the how do you deal with it now kind should always include questions on why commercial solutions are not suitable (usually due to installation/license difficulties) and also what the companies creating the products could do to make things easier in a long run.

I think this is sort of the whole point of the survey results. Attendees at Camp Sys Admin overwhelming stated that they have so much data already it's killing them. It's interesting to think about the amount of IT data being generated every day in a typical enterprise shop. Forget about the network, firewall and security data. I'm talking about just your basic web servers, application servers and databases. Hundreds of gigabytes to several terabytes in IT data a day is not atypical for a good size data center.

The notion that IT people need even more data generated by developers kinda misses the point. Troubleshooting production applications is a whole lot different that debugging code in development or staging environments. Production systems involve many technologies and systems that just don't appear in pre-production environments.

When I was at Yahoo everyday we were fire fighting production problems that never manifested themselves in development or staging systems. At Splunk, my current company, we routinely see problems with our software that only occur with multi-terabyte data sets in very large production systems. Perhaps in another post I'll discuss how we deal with building QA environments to deal with this.

Alexandre comments further that commercial solutions are not suitable for solving production troubleshooting problems. True, today's solutions most often require extensive amounts of code instrumentation. IT people generally don't want to and/or can't instrument code in production environments. For starters, generally we don't even own most of the code running our applications and services.

So we're left to deal with all the ever-changing evidence our machines generated. And boy it sure is piling up quickly.

How much IT data do you have in your data center? Write me with your estimate at thebaum@splunk.com.

Posted by Michael Baum on February 6, 2006 10:10 PM


February 01, 2006 | Comments: (0)

How Hard is it to Troubleshoot IT Anyway?

So how hard is it to troubleshoot anyway? Last month Bay Lisa, LOPSA (league of professional system administrators), NaSPA (network and system professionals association), Splunk, SysAdmin Magazine and Usenix surveyed the attendees at Camp SysAdmin in San Francisco about their troubleshooting activities and found some things that surprised everyone.

Campers represented large Fortune 100 companies and small organizations and consultants who had responsibilities for all types of IT infrastructures and technologies including enterprise applications ( J2EE, .NET, LAMP, Web Services), email, on-demand services, VoIP, and more.

Here is what the campers had to say.

#1. Even when troubleshooting cutting edge technologies, we are still doing things the old-fashioned way. Campers were asked to name their Top 3 Troubleshooting Tools in an open ended question.

The most frequently mentioned tools were:

a. grep
b. manually scanning logs
c. tailing data sources

These were just a few of the command line utilities that were mentioned. Others included traceroute, ping, sort, awk, swatch, vim, topdump, strace, and more.

There were frequent mentions of opensource or free tools, primarily Nagios, Ethereal, and MRTG. Surprisingly, given the amount of money corporations spend on these, very few commercial tools were mentioned. Keynote, HP OpenView, NetIQ App Manage, Tivoli, and BMC PATROL were all mentioned just once each.

#2. The hardest thing about troubleshooting is that there is too much data in too many places and in too many formats.

In an open-ended question, participants were asked to identify the hardest thing about troubleshooting, 25% of respondents replied with an answer related to dealing with the huge volume of data they use for troubleshooting, the inconsistency of formatting, or the number of locations where the data resides. These data-related concerns were far and away the top answer, with lack of documentation and lack of expertise tied for number two.

#3. The most aggravating thing about troubleshooting is too much data.

In a follow-on open-ended question asking about the most aggravating aspect of troubleshooting, 30% of respondents replied similarly saying that the amount and inconsistency of data was aggravating. In this case, however, the number two answer was related to people issues, interrupting calls from management, vague user reports, or unresponsive members of other teams (developers in particular).

#4. The amount and types of data available for troubleshooting is overwhelming.

More than half (54%) of participants had data stored in more than 50 locations, with 19% having data in 1000+ locations. 60% of participants said that each of these locations generates more than 25 MB of data a day. 96% of participants said they deal with more than 3 formats of troubleshooting data, 38% said they have more than 15 formats and 23% said they deal with over 30 formats.

#5) System administrators are very proactive in their search for problems.

The good news in this survey is system administrators are generally not so overwhelmed that they can not be proactive. 60% proactively search through data either daily or weekly. 35% proactively search through data occasionally.

The sponsors are planning to run a broader survey of troubleshooting issues targeted at more specific categories at the next Camp SysAdmin. Stay tuned and I'll let everyone know when its announced. If you are interested in participating or have specific questions you'd like asked as part of the next survey email me at thebaum@splunk.com.

Posted by Michael Baum on February 1, 2006 06:53 PM


Technology White Papers

 

InfoWorld Technology Marketplace

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
» BUY A LINK NOW

Sponsored Technology Links