In the 1st Part of this series, I’ve described the most common steps that you should follow to troubleshoot a total lack of communication between a Layer 2 device (Cisco switch) and an end user connected device. As I promised here is the second part, in which I’ll try to show you what you can check when you have no problem with connection, but still you encounter a degradation in service. By this degraded service, I understand a scenario when you have packet loss for example, or intermitent connection which will affect communication and more than sure will make user users not very happy.
We will stick with the same scenario when a end user device is connected to a Cisco switch. Remember that until now, we just troubleshoot at the Layer 1 and Layer 2. Today we will stick in the same area, so nothing directly related to IP, routing protocols or complex networking environment.
Scenario 2: You have an end device connected to a switch and you have degraded communication
a) Check for errors on the interface:
In this example there is no errors, but if you find something there, you may want to keep an eye on this port. Try to issue the above command couple of times to see if the errors are increasing in real time, as this is the worst case possible and you should take action immediately. Error on the interface can be caused by faulty interface on the switch or on the other end, ethernet cable issue or wrong configuration
b) Check the interface queue and drop packets
Interface queue is very important and you should check it during your troubleshooting process. With the above command you can see how many packets are in the input / output queue, which is the transmit and receive rate and very important if you have packets dropped from input and output queue. Usually this happens when there is a lot of load on the switch and it cannot process as quick as it’s needed all the packets. This lead us to the next step.
c) Check the CPU load on the switch
The command output is longer but most interesting for this example are the first 2 rows which show load in 60 seconds and in 60 minutes. If you have there peaks up to 100, then it’s bad and the device is having some issues that need to be fixed.
d) Identify what process is keeping the CPU busy
Most of the time, this is easy to read and to see what process is taking all your CPU power. When you see there Fifo Error Detection with 100% than you have to think that maybe there is something wrong with the queue on one of the interfaces and try to find which one is having problem. This is not straighforward and you have to check a lot of things, but can be helpful. To be honest, I see a lot of engineers just reloading the device and then problem is solved (if it was due to a hardware issue and not a configuration mistake).
e) Check for memory issue on the switch
Again, if you run out of memory, bad things can happend to your device and as well to the communication with device connected to the switch. Reloading of the device solved about 90% of this kind of problems. I don’t recommend just unplug the power cable as soon as you see a memory problem. First have a look, maybe there is something you can fix without reloading the device.
f) Check for problems with storm-control implementation
In one of previous posts I have explained how you can use storm-control to limit the available bandwidth on a Cisco switch interface. In the example above I set this bandwidth to 1 % from the available one gigabit (I know is stupid, but imagine a typo mistake). Imagine what effect will this have on the traffic. Everything above 1 % is keeped in the queue until this is full and then silent discarded.
e) As a general rule, have a look into the logs (maybe this should be first step!)
If there are a lot of Spanning-tree reconfiguration, interface flapping or anything else that looks suspicious, be sure to check on this as you can find there the root cause for your problems.
Do you have any other tips in regard to this topic? Anything else you check and can be added here? Be sure to comment below and your suggestion will be taken into consideration.