Skip to content Skip to navigation

Connexions

You are here: Home » Content » Troubleshooting Software Systems

Navigation

Recently Viewed

This feature requires Javascript to be enabled.
 

Troubleshooting Software Systems

Module by: Warren Myers. E-mail the author

Summary: Troubleshooting systems and software is an art and a science - what hatchets can you put in your "bag o' hatchets" to help eliminate non-problems while diagnosing symptoms of failure?

Note: You are viewing an old version of this document. The latest version is available here.

Introduction

One of the best trainers I ever had taught the incoming crop of support engineers at Opsware (of which I was a member) that Support is all about applying hatchets to problems to make them easier to handle - when someone is calling for help, they [typically] have a major problems that is impacting their job, and need a solution to it last week. The product we were being trained on came on 2 full DVD iso images (it has since grown to three, dual-layer DVD iso images). That's a lot of potential area for errors to occur - whether from bugs in the application, or user mistakes. After a while, you start to see patterns in incoming issues, which allows for quicker resolution of customer complaints - when you've seen the same problem pop up at a dozen locations, as soon as a fix is found for one of them, you can, most likely, apply that same solution to the next 11, and solve all of those problems "at once".

You will learn about a host of hatchets you can use to narrow-down problems from the initial symptom of "it doesn't work" or "it broke" to the root cause, or viable workarounds.

The techniques described can be applied to other areas as well, but the focus will be on software systems.

Overview

We will cover an array of hatchets:

  • Stop, Drop, and Roll
  • What Changed
  • Logging Output / Log Files
  • Effective Searching
  • Debugging Tools
  • User Error
  • Post-Mortem Data Collection
  • Pro-Active, Preventative Measures

Stop, Drop, and Roll

When encountering software issues, whether in the smallest of scripts, or in enterprise tools, is to Stop, Drop, and Roll. Yes, those same three words you learned from the fireman as a child for what to do if your clothes ever catch fire.

Famous last words in most cases are, "I know what I'm doing" - you may very well, but always guess first that you don't. This is not to insult your intelligence, but rather to remind you that everyone makes mistakes!

Things to do before blindly going on:

  • Take note of all error messages returned from the failed process
  • See if the error is something you have seen before (such as "Permission denied")
  • Make sure you are running as the correct user / with proper privileges
  • Make sure you have enough space to continue the task

What Changed

Were you able to successfully accomplish the task at hand before? Did the script run successfully yesterday, but not today? Can someone else run this correctly and I can't?

If the answer to these, and similar questions, is "yes", then you need to find out what changed between that last time you did this and now.

Things that may have changed:

  • Your user's permissions
  • The contents of the script/tool
  • Free space you have access to (ie, maybe you're nearing your quota)
  • System changes (patches, updates, etc)
  • Remote resources are inaccessible (maybe it relies on a file server that is down for maintenance)

If you can undo any of the changes, does the tool work again? For example, if you are nearing your quota space but you delete some files, will it then run? If so, maybe you need your quota expanded. When the network file server is back up, does it run correctly?

Logging Output / Log Files

If the first basic checks don't yield any useful data, it's time to start looking at the the output of the program. Does it create any log files when it runs? If so, are any error messages kicked-out when you run it? Can the log setting be turned up?

If the script does not generate its own log files, it's time to start generating them on your own. Many tools have commandline options that will increase verbosity - use them in conjunction with trapping the output using tools like script and tee.

script

One of the most useful tools to use when troubleshooting is script. From the man page for script: "Script makes a typescript of everything printed on your terminal". So, while it is running, everything you type and/or is displayed will also get recorded into a file set when you generate the script session. Such an output file could end up being invaluable to the support team for the product if you need to open a case, or to the developers of the tool if you need to create a Bug Report.

tee

Another very useful tool is tee. The man page for tee is short, but gets right to the point: "read from standard input and write to standard output and files". In other words, if you pipe the tool you are running into tee, not only will it display on the screen, but it will be captured in the output file you have supplied as well.

For example, in addition to the following displaying the entire contents of the filesystem (that your user has permission to see) on screen, it will trap the output into a file called "filesystem.out" in your home directory: ls -R / | tee ~/filesystem.out

Effective Searching

Google (or Bing, Yahoo, DuckDuckGo, etc) is your friend when trying to find answers to technical problems. After checking the man pages for a given tool (if they exist), always take any error messages found to Google and see what may be found thereon. For example, an error kicked-out by Apache Tomcat is likely to be documented just by searching the error string on Google - and probably from the official documentation / known issues / publicly-viewable bug reports (eg http://tomcat.apache.org/tomcat-6.0-doc/api/org/apache/tomcat/jni/Error.html).

Effective searching is far more than "let me Google that for you" - it is knowing both what to look for and how to evaluate the results. Weak searching is easy: type in what you're looking for and hope. Like going to eBay and searching for "baseball card" - you'll get millions of results on just that query: not the most helpful thing in the world. Try "roger maris rookie baseball card", and your resultset drops to a manageable few dozen - those are results you can pick through effectively. Similarly, searching for "OutOfMemory" will yield millions of hits, whereas "bea weblogic outofmemory java error jvm 1.4" may start to yield useful results in the top 2-5 links.

Becoming an Effective Searcher will translate over into a host of other arenas beyond mere troubleshooting: from finding photographs of an obscure HVAC part, to locating the best deals on travel, searching effectively is a skill that anyone can learn, hone, and benefit from in their daily lives.

Effective Searchers look at what they are trying to find, pick out the important parts (if it's an error message, the generic portion of the message, not the specific-to-the-instance portion), and start looking, refining as they go.

To become an Effective Searcher does take some practice, but there is no better time than the present to start. Learn your favorite search engine's advanced features. For example, with Google I can search just on apache.org by adding site:apache.org to my search. Quoting the text of what you want to find will also tend to push references of it higher.

A drawback to using search engines, though, is that since they try to index everything, you can get a lot of results that are other people asking more-or-less the same question you are trying to find the answer to. That's great if an answer was posted, but when the responses are a lot of "me too"-types, it can be frustrating.

Part of becoming an Effective Searcher is learning to identify authority in resources found: "I had a paper to write several years ago on comparing AMD’s x86-64 architecture and Intel’s IA32 architecture for the companies’ CPUs. Sources like Tom’s Hardware Guide were helpful to see real-world comparisons between the competing products, but the true sources of authority on the products were AMD and Intel themselves. I printed large chunks of the manufacturer’s technical documentation to backup conclusions I made in my paper... Authority of sources isn’t assured by just one factor – author, publisher, host, length, etc – but rather by directly linking to the data used to produce the conclusions made by that source. No resource stands on its own as an authority on any topic. In order to establish credibility, any resouce must cite where their data came from – either through some kind of bibliography in the case of a paper, or experimental results, or that the resource is maintained by the people who designed and built what they’re writing about... The means of determining authority needs to come down to the following factors: 1) is the article written in an intelligent form? 2) are the sources cited of an authoritative nature? 3) has the author written anything previously that can be considered authoritative? and 4) would someone who is a known expert in the field (perhaps a professor of the topic) agree that the source is not some crackpot?"

Another component in becoming and Effective Searcher is to learn the skill of skimming for important details - and ignoring everything else. If you don't, you'll end up like the person described in this XKCD comic!

Lastly, to become an Effective Searcher, don't be afraid to ask for help: there is likely someone sitting near you or available via instant message who can help you out, and would be happy to if you only ask them.

Debugging Tools

Linux typically has several useful debugging tools available out of the box. These include gdb (for the advanced user), strace, ps, lsof, netstat, iostat, uptime, and top. Many others exist as well, but these are the most-commonly utilized.

top

One of the quickest-to-use tools for a picture of the current state of a Linux system is top which displays current top processes running, system uptime, load average, CPU utilization, memory usage (real and swap), and other items.

uptime

For a quick view of the system uptime and load averages, run uptime.

iostat

iostat displays information about the current state of the disk I/O on the system

netstat

To see what network ports are currently open and listening, use netstat. For example, netstat -an | grep 80 will display what is using port 80 (and 8080, and anything else that has '80' in its port number).

lsof

lsof will show what process is holding open a network port or file. To use "list open files" to see what process is holding port 80, run lsof -i:80

ps

To see a list of the process table, run ps. My favorite argument sequence is aux which gives lots of information back: ps aux

The similar call on a Solaris machine is: ps -ef

strace

For a fuller diagnosis of what a given process is doing, strace can be a lifesaver. It essentially wraps around the process in question (either by running strace <program-name>, or by attaching to a running process with strace -p<pid>

On Solaris, the similar tool is truss.

gdb

The GNU debugger, gdb, is a massively-useful tool in the right hands: tracking individual calls inside a program, setting breakpoints, etc: it should be learned by every developer, and known to advanced users.

User Error

"User error" is among the most commonly-cited errors with software and systems: the operator did something the creators did not expect. To use a ubiquitous car analogy, it's "user error" if the driver hits the gas instead of the brake. One interesting article makes the claim that there is [almost] no such thing as "user error", and that instead it should be the developers who make tools not resilient enough to handle any user (no, a car manufacturer can't make the gas act like the brake when you "meant to stop", but maybe software developers can make their products less error-prone, or at least have them give better errors when they do have a problem).

A spectrum of user-initiated errors:

  • Typos (misspellings, fat-fingering, generally mistyping something)
  • External environmental problems (eg unplugging a network cable)
  • Clickos (ie, misclicks - akin to mistyping)
  • Forgetfulness
  • Etc
From personal observation, I would guess user error accounts for 70-80% of all errors seen.

Post-Mortem Data Collection

When something has gone so awry that it has violently crashed, or even taken out its host system, it's time for some post-mortem data collection - maybe even forensic analysis.

Core dumps, log files, and even images of whole drives can be investigated during a post-mortem analysis of problems seen: as your technical acumen grows, you'll be able to investigate more parts of these prior to escalating to the tool's support or development teams.

Pro-Active, Preventative Measures

Ideally, we would all live and work in a world where nothing ever failed, and everyone acted the way they are "supposed" to. Sadly, that world does not exist. So what can we do to help prevent issues in the first place, or respond more adeptly when they [inevitably] occur?

Some solutions are simple: add more memory to the system; increase swap space; verify storage quotas; make sure all the resources I need are available; etc. Many can be more complex.

If there is a set of "Known Issues" or release notes that come with a particular product, make sure you read and are aware of them: there is almost nothing more frustrating than finding out there is a known issue, but you didn't check the manuals first!

Asking "Why"

If you're on the administrative side of the technical world, and not just the end-user side, the other big thing to remember is to always ask "why". Why did it fail? Why did we miss the known issue? Why were we not notified a necessary resource was going to be down? Why was there no alert sent about resources nearing their limits? If you can ask (and answer) those, then you should be able to reduce the number of "why" questions you need to ask in the future - because hopefully you're solving problems before they arise.

"Future-proofing" - is it possible?

The idea of "Future-Proofing" is to create an environment that can survive future developments without needing to be changed itself. A common example of this would be to look at the current and expected growth needs of the email infrastructure of an organization, and then size the mail servers to handle 15-25% more than the expected growth (ie 100 users today, adding 20% per year, size the environment today for 200 users in three years (173 expected, plus ~15%). Or it could mean ensuring that data you are working with today in version 4.3 of some tool will be accessible when upgrading to 7.2 in 4 years.

When relying on external vendors, guaranteeing your environment is future-proof may not be possible - they could decide to change database schemas, file formats, etc. Likewise, when relying on expected growth patterns, you may exceed those expectations (requiring additional licenses, hardware, etc), or you may not meet those plans, and have an unnecessarily oversized environment. Several mitigating strategies exist for these eventualities, but are beyond the scope of this lesson.

Closing Thoughts

You've completed this module, and so now you're ready to troubleshoot the most ornery problems in the most obscure corners of your system, right? Don't let me discourage you from that lofty goal: but the reality is that becoming a good troubleshooter takes time, practice, lots of exposure, practice, skimming skills, practice, and patience. Oh, and did I mention: practice!

Lots of professions require troubleshooting skills, and each has their own tricks and tips to follow: auto mechanics will check the OBDII and listen to a rattle; electricians look for wiring faults; doctors look at symptoms to come up with a diagnosis. Skills learned in one field may not always translate into another, but if you can learn the basics (which DO all transfer), then gleaning insights from others can only improve your own personal Bag O' Hatchets.

Content actions

Download module as:

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks