Troubleshooting ARP/IGMP/Router CPU

We recently starting having issues with a building reporting that icmp stopped responding on a distribution router and some access switches behind the router. Some routing interfaces would respond, but the management VLAN interface wouldn’t. Further troubleshooting showed that the CPU processes on the router comprised of two Extreme Networks 7100 series switching running OSPF climbed up to 80/100% utilization. The “show logging buffer” revealed massive amounts of host-dos ARP attack events. The first thought was that a possible infected machine was creating an ARP storm. This would happen about twice a day for about a minute, but not at the exact same time. We tracked down the MAC addresses and removed the PC’s from the network. This didn’t seem to help, as another set of MAC’s addresses would show up in the logging buffer for host-dos ARP attack events the next day. We decided to start running a Wireshark packet capture. We could see the ARP storm along with some other IGMP traffic that would easily consume a Wireshark session, but we couldn’t identify the root cause.

After further investigation of the host-dos ARP logs, we noticed that the source interface should have been in STP blocking mode due to it being a redundant link to the access switch. My thought was that the massive ARP flooding could have been caused by a loop. Why would a loop occur? I then caught the CPU process table during the outage and I found that the IGMP process was consuming the router CPU. Could it be that the issue wasn’t due to an ARP storm, but the ARP storm was a secondary issue to something else going on? We decided to disable the redundant interfaces. This would take the possibility of a loop being created out of the picture. My thought was that the high CPU was causing dropped bpdu’s and the secondary link would go into forwarding on the access switch, but the router being CPU bound was still using the original link which caused the ARP storm/loop.

The issue continued with the redundant links disconnected, but now we weren’t seeing the host-dos ARP logs. Ok, so we knew we had high CPU utilization. We also knew it was the IGMP process. There was a slight traffic correlation on the routing interface before the CPU spike, so I enabled netflow on the upstream core router. Netflow started forwarding data to PRTG (Network monitoring utility). PRTG showed that the top talker was a newly built Landesk server. Now we were getting somewhere. Further research into Landesk revealed that the product uses multicast. My team decided to run another packet capture while booting up a lab and presto, the CPU started to spike on the router. The packet capture revealed a large number of multicast traffic classified to be used by Landesk. The multicast address was 239.83.100.109 along with UDP port destination of 33355 which was defined as Landesk “software distribution”. The flooding of multicast traffic seemed to be the culprit of the high router CPU utilization.

Happy troubleshooting,

@javi_isolis

2 comments

  • Great post on tracking down difficult problems in the network.

    I recently experienced similar performance issues with timeouts and disconnects between various subnets. Turns out that a monitor session that was configured weeks in the past was gobbling up system resources on a Brocade VDX switch stack. This caused intermittent slowness and disconnects throughout the day. Very hard to pin down. Once I cleared the monitor session traffic was normalized and our problems disappeared.

    The process was similar to yours. Is it the LAG between the access layer and the core? Is it a loop? Is the the VMWare environment? Narrowing down the many questions to start are key skills as engineers we need to possess in spades!

    Great work on fixing your problem. It’s always useful to detail the fix and have a searchable registry if something that like that happens again. We use a simple [CHG] subject email to let our team know of any additions or network changes. Problem comes online a quick review of the [CHG] emails let us know at a glance of any recent changes.

  • Thanks for the comments Evan. We currently use a SharePoint list to track our changes. However, this issue escalated in the form of a ticket. I like your idea of receiving an email with changes. I’ll have to see if we can maybe have SharePoint send out an email to the team when updates are made to the list.

    We ended up blocking the multicast address temporarily using Extreme Network’s policy manager, which is basically ACL’s applied at the port level. Landesk and Extreme Networks tech support were contacted. Extreme Networks had us apply a firmware fix and Landesk suggested that the ARP discovery method within the application be disabled. Both tasks were completed, the multicast address block has been removed, and we are no longer running into any issues.

Leave a Reply

Your email address will not be published. Required fields are marked *