RHCSA RHCE VCP550 VCAP-DCA VCAP-DCD HP MASTER ASE STORAGE WORKS DATA CENTER DESIGN: Why troubleshooting? TSHOOT 300-135 INE Session Summary

Why troubleshooting?

Today's networks are more high-availability minded than ever and downtime means loss of revenue....

-Employee Productivity

-Customer SLA violations

-Regulatory fines

-Etc.

One Key way expert-level engineers set themselves apart from average engineers is troubleshooting methodology.

-Average engineer runs around like a chicken with its head cut off.

-Expert Engineer keeps a cool head and follows a structured approach.

Structured Troubleshooting Approach

-Defines a logical and systematic method of troubleshooting that can be applied to any case.

--E.g. troubleshooting VoIP call quality and OSPF neighbor adjacency involves different discrete steps, but logical approach is the same.

-Structured troubleshooting is closely analogous to the scientific method of conducting experiments

Scientific Method Workflow

Structured Troubleshooting Workflow

Defining the Problem

-Network problems are generally discovered in two ways

--Reactive (e-Ticketing help desk system)

---e.g. users submit tickets to the help desk that web browsing is slow

--Proactive (Monitoring System PRTG, NAGIOS, CISCO WORKS and HP OpenView)

---e.g. SNMP report a linkdown event

-In either case more investigation is needed to find the root of the cause.

Gathering Information

-Apart from asking users for more information on tickets submitted, gathering information is in the form of ……

--shows commands

--debug commands

---Typically not used in real world unless network-down emergency

---Mis. Testing tools

-----ping

-----traceroute

-----telnet

-----Etc.

Ultimate goal is to isolate the issue as closely as possible by eliminating unrelated variables

How to gather Information?

Structured troubleshooting involves isolating the operation work network into functional layers

-E.g. OSI Model or TCP/IP Model

Where to actually start isolating is a personal preference

-Common approaches are ……

--Top-Down

--Bottom-Up

--Divide and Conquer

Key to remember is that layers have a cascading effect

-E.g. if physical layer (i.e. layer1) is down, all layers above it are broken

Top down Troubleshooting

Most useful for application related issues

-E.g. user can’t send email – start by checking their email settings

Potentially very time consuming if problem resides in lower layer

-E.g. Physical switchport is bad (layer 1)

Bottom Up Troubleshooting

-Verify each layer starting with physical and proceed to the next

--Is the link is Up/Up?

--Are the layer 2 options correct?

--IP properly configured

--Etc.

Like top-down, can be very time consuming depending on where the problem actually lies.

Divide and Conquer

Goal is to reduce search time by picking a layer to start at

Based on results of testing, further verification goes either up or down the stack.

E.g. for troubleshooting email problem …..

--Can I ping the mail server ?

----If yes, go up stack

----If no, go down stack

Defining and Implementing the Fix

Ideally up to this point the issue is sufficiently isolated to make an educated guess as to how the problem can be fixed.

Proper “Change Control” at this stage is key.

--Clearly define the proposed fix

--Implement the proposed fix

--Did it work ?

---if yes, Proceed forwards.

---if no, roll back

Changing too many variables at once can compound the problem even further.

Observing the result

Depending on the nature of the problem, verification of the solution can be either straightforward or complicated

-E.g. users said they couldn’t email, now they can, problem straightforward and solved

-E.g. users experienced low VoIP quality, quality is now good, but only time will tell

Within the scope of TSHOOT exam, final observation is your scope.

Reiteration

If the problem was not solved, a further dilemma occurs

-Did I misdiagnose the problem in the first place?

-Are there significant variables that were overlooked?

-Was my fix not appropriate?

Before making further changes, more information should be gathered?

-Did the situation change since I implemented a fix?

--If yes, for the better or worse.

--If not, why not?

Document the Fix

-All good change control policies should require documentation for all fixes.

-Documentation allows the development of a “Knowledge Base“for particular network topology.

-KB can be referenced in the future to solve similar problems, or to trace your steps if the same problem is recurring.

ALAA HEGGA

Senior System Engineer Saudi Bin Laden Group

sa.linkedin.com/pub/alaa-hegga/16/862/423

RHCSA RHCE VCP550 VCAP-DCA VCAP-DCD HP MASTER ASE STORAGE WORKS DATA CENTER DESIGN

Saturday, May 9, 2015

Why troubleshooting? TSHOOT 300-135 INE Session Summary

What next?

0 comments:

Blog Archive

Followers