Troubleshooting ARP/IGMP/Router CPU
We recently starting having issues with a building reporting that icmp stopped responding on a distribution router and some access switches behind the router. Some routing interfaces would respond, but the management VLAN interface wouldn’t. Further troubleshooting showed that the CPU processes on the router comprised of two Extreme Networks 7100 series switching running OSPF climbed up to 80/100% utilization. The “show logging buffer” revealed massive amounts of host-dos ARP attack events. The first thought was that a possible infected machine was creating an ARP storm. This would happen about twice a day for about a minute, but not at the exact same time. We tracked down the MAC addresses and removed the PC’s from the network. This didn’t seem to help, as another set of MAC’s addresses would show up in the logging buffer for host-dos ARP attack events the next day. We decided to start running a Wireshark packet capture. We could see the ARP storm along with some other IGMP traffic that would easily consume a Wireshark session, but we couldn’t identify the root cause.
After further investigation of the host-dos ARP logs, we noticed that the source interface should have been in STP blocking mode due to it being a redundant link to the access switch. My thought was that the massive ARP flooding could have been caused by a loop. Why would a loop occur? I then caught the CPU process table during the outage and I found that the IGMP process was consuming the router CPU. Could it be that the issue wasn’t due to an ARP storm, but the ARP storm was a secondary issue to something else going on? We decided to disable the redundant interfaces. This would take the possibility of a loop being created out of the picture. My thought was that the high CPU was causing dropped bpdu’s and the secondary link would go into forwarding on the access switch, but the router being CPU bound was still using the original link which caused the ARP storm/loop.
The issue continued with the redundant links disconnected, but now we weren’t seeing the host-dos ARP logs. Ok, so we knew we had high CPU utilization. We also knew it was the IGMP process. There was a slight traffic correlation on the routing interface before the CPU spike, so I enabled netflow on the upstream core router. Netflow started forwarding data to PRTG (Network monitoring utility). PRTG showed that the top talker was a newly built Landesk server. Now we were getting somewhere. Further research into Landesk revealed that the product uses multicast. My team decided to run another packet capture while booting up a lab and presto, the CPU started to spike on the router. The packet capture revealed a large number of multicast traffic classified to be used by Landesk. The multicast address was 18.104.22.168 along with UDP port destination of 33355 which was defined as Landesk “software distribution”. The flooding of multicast traffic seemed to be the culprit of the high router CPU utilization.