Ran into a very strange issue that ultimately turn out to be a basic CCNP R&S topic. It involved, EIGRP, BGP, Netflow, and Redistirbution….
AKA I can’t get to Facebook. :)
An essential tool for finding out what exactly going on in a network is using some kind of monitoring tool — like Netflow. Netflow is Cisco’s way to “categorize” traffic down to TCP, UDP, port #s, type of traffic, Source an Dest IP adddreses, etc. Other vendors like their own “Netflow” tool as well. Juniper has JFlow, and Arista and SFlow. Basically, you’ll want something to view this information. Without it you’re shooting in the dark.
Once I pulled up the Solarwinds report on the Netflow stats, I noticed we were maxing out the link. This of course would cause some slowness. My manager said “open a ticket with Verizon to check the circuit!”. I knew by looking at the Netflow report that were hitting out 100Mbps limit so I highly doubt it was a carrier issue.
I took a look deeper at the “Top Talkers” of the netflow report. I noticed a bunch of subnets OUTSIDE of our LAN at the office. Huh??
Our office had only a 6500 swtich, and two 2951 routers. All of our datacenter equipment was in the datacenter. We have only 1 Data VLAN which was the 10.10.0.0/16 subnet. So why was I seeing IPs from the 10.20.0.0/16 subnet here?
Was it DNS? Was there a bad route? Did the intern at Verizon mess with BGP routes? Was Netflow even working properly? Too many questions were up in the air.
We took a look at one of the 10.20.0.0 boxes and checked the basic stuff. Traceroutes, static IP, DNS entiries, even the Netflow results on the 10.20.0.0 site but nothing!
After spinning our wheels in the mud for a bit I decided to try and do a tracert from the server that all these boxes were talking to. It just so happened to be an Exchange server. What was so special about this exhange box? For one, it was in our datacenter. But why is traffic going to one site to reach another site? After RDPing to the box in the data center I noticed that doing a traceroute from that box indeed DID hit our local LAN of 10.10.0.0 subnet.
So the path from the MPLS spoke site (10.20.0.0) to the Datacenter (10.1.0.0) took the MPLS path, but the return traffic made a pit stop at our NYC office (10.10.0.0). We got two different paths.
Look at the tracerts from the Windows box in the Datacneter, I saw the routing table was pointing to the NYC office. Why? EIGRP was pointing at it for some reason, which would explain why we were seeing traffic from other MPLS spoke site appear on our NYC router and thus appear in the NYC Netflow when it should of appeared at a completely different site altogether!
So to solve this missing Netflow case and put the traffic back where it should be going — we had to make some BGP changes on the NYC router as to prefer the EIGRP routes over BGP. BGP takes precedence via its AD of its internal 20 and the external EIGRP distance of 170. Once we did some route-maps and tagging to deny this traffic from coming into this NYC router, the tracert from the Datacenter Exchange box was now going through the MPLS network instead of touching out NYC office.
^ a reminder link for anyone wondering what the Admin Distance is (AD)
Moral of the story is to verify BOTH sides of the story. Don’t just do one trace from one side and assume that’ll be the return traffic. Do it from both ends to ensure you got the full path of traffic mapped out!