Why troubleshooting?
Today's networks are more high-availability minded than
ever and downtime means loss of revenue....
-Employee Productivity
-Customer SLA violations
-Regulatory fines
-Etc.
One Key way expert-level engineers set themselves apart
from average engineers is troubleshooting methodology.
-Average engineer runs around like a chicken with its head
cut off.
-Expert Engineer keeps a cool head and follows a structured
approach.
Structured Troubleshooting Approach
-Defines a logical and systematic method of
troubleshooting that can be applied to any case.
--E.g. troubleshooting VoIP call quality and OSPF neighbor
adjacency involves different discrete steps, but logical approach is the same.
-Structured troubleshooting is closely analogous to the
scientific method of conducting experiments
Scientific Method Workflow
Structured Troubleshooting Workflow
Defining the Problem
-Network problems are generally discovered in two ways
--Reactive (e-Ticketing help desk system)
---e.g. users submit tickets to the help desk that web
browsing is slow
--Proactive (Monitoring System PRTG, NAGIOS, CISCO WORKS
and HP OpenView)
---e.g. SNMP report a linkdown event
-In either case more investigation is needed to find the
root of the cause.
Gathering Information
-Apart from asking users for more information on tickets
submitted, gathering information is in the form of ……
--shows commands
--debug commands
---Typically not used in real world unless network-down
emergency
---Mis. Testing tools
-----ping
-----traceroute
-----telnet
-----Etc.
Ultimate goal is to isolate the issue as closely as possible
by eliminating unrelated variables
How to gather Information?
Structured troubleshooting involves isolating the operation
work network into functional layers
-E.g. OSI Model or TCP/IP Model
Where to actually start isolating is a personal preference
-Common approaches are ……
--Top-Down
--Bottom-Up
--Divide and Conquer
Key to remember is that layers have a cascading effect
-E.g. if physical layer (i.e. layer1) is down, all layers
above it are broken
Top down Troubleshooting
Most useful for application related issues
-E.g. user can’t send email – start by checking their email
settings
Potentially very time consuming if problem resides in lower
layer
-E.g. Physical switchport is bad (layer 1)
Bottom Up Troubleshooting
-Verify each layer starting with physical and proceed to the
next
--Is the link is Up/Up?
--Are the layer 2 options correct?
--IP properly configured
--Etc.
Like top-down, can be very time consuming depending on where
the problem actually lies.
Divide and Conquer
Goal is to reduce search time by picking a layer to start at
Based on results of testing, further verification goes
either up or down the stack.
E.g. for troubleshooting email problem …..
--Can I ping the mail server ?
----If yes, go up stack
----If no, go down stack
Defining and Implementing the Fix
Ideally up to this point the issue is sufficiently isolated
to make an educated guess as to how the problem can be fixed.
Proper “Change Control” at this stage is key.
--Clearly define the proposed fix
--Implement the proposed fix
--Did it work ?
---if yes, Proceed forwards.
---if no, roll back
Changing too many variables at once can compound the problem
even further.
Observing the result
Depending on the nature of the problem, verification of the
solution can be either straightforward or complicated
-E.g. users said they couldn’t email, now they can, problem
straightforward and solved
-E.g. users experienced low VoIP quality, quality is now
good, but only time will tell
Within the scope of TSHOOT exam, final observation is your
scope.
Reiteration
If the problem was not solved, a further dilemma occurs
-Did I misdiagnose the problem in the first place?
-Are there significant variables that were overlooked?
-Was my fix not appropriate?
Before making further changes, more information should be gathered?
-Did the situation change since I implemented a fix?
--If yes, for the better or worse.
--If not, why not?
Document the Fix
-All good change control policies should require documentation
for all fixes.
-Documentation allows the development of a “Knowledge Base“for
particular network topology.
-KB can be referenced in the future to solve similar
problems, or to trace your steps if the same problem is recurring.
ALAA HEGGA
Senior System Engineer Saudi Bin Laden Group
0 comments:
Post a Comment