In modern cloud-centric applications, the connection can be split into three main segments:
- Local LAN
- Public Internet
- Cloud network
When the Network Engineer realizes that the fault lies in the public Internet, that is never good news. A local LAN problem can usually be pinpointed fairly quickly and solved. Similarly, if we realize that the issue is in the final Cloud network, we can contact the provider and get that fixed.
A different story is when an outage occurs on the public Internet, which means affecting an Internet Service Provider (ISP). In this case, a couple of questions would need to be answered as soon as possible:
- In which ISP is the outage occurring?
- Is it just me or is everyone else affected?
- How can I reroute my traffic in order to bypass the faulty paths?
ThousandEyes Internet Outage Detection aims to solve this problem with anonymous aggregation of real-time data from all ThousandEyes users and tests. This is done to generate insights about large-scale issues occurring across the Internet. For whoever is not familiar with ThousandEyes Monitoring solution, we recommend reading this previous review.
Pretty much all tests in ThousandEyes include a Path Visualization, which traces the hop by hop path from each probing agent to the test target. Outages are identified by aggregating events where path traces terminate on any of the hops along the route. Analyzing a larger set of tests also allows sensing the magnitude of the outage.
With Internet Outage Detection, ThousandEyes promises to promptly answer the three aforementioned questions. Let’s see this in action!
Disclaimer: This is a sponsored review, however, the content and opinions stated here are independently written by RouterFreak’s team. As always, we provide honest product reviews, unedited.
Traffic Outage Detection (Data Plane)
In the networking world, the data plane takes care of checking the headers of each packet in transit, forwarding them toward the right direction. This has to be done hop by hop, so it’s something we can clearly depict in the ThousandEyes Path Visualization.
In the Path Visualization, when the packet forwarding is failing, the terminated traces are highlighted in red on the nodes (see screenshot below). This could be due to a specific issue affecting our agent traffic or could be part of a larger outage. This is what Outage Detection should be able to reveal.
In the previous image, we can see the Outage Detection kicking in: it is flagging an outage in Autonomous System 174, a.k.a. Cogent Communication ISP. When an outage is occurring, the timeline is marked with purple blocks, plus the underlying Path Visualization shows the nodes where the packet loss is happening.
In the Path Visualization, we find an Outage Detected section in which affected locations and interfaces are listed, as seen also in the next picture. The interfaces can be easily identified using the provided IP address and domain name (if available).
ThousandEyes ensures that this data is anonymously collected using all the ThousandEyes customer tests, so it’s not possible to infer to which user/test/customer it belongs.
When we mouse over an affected node, we get information on how many tests were affected by this outage across the ThousandEyes customer base. This is illustrated in the following picture.
Another interesting metric shown here is the Loss Frequency, which tells how often an interface is having loss hence giving an indication of the generated noise. A ‘low’ loss frequency is obviously preferable to have.
The Traffic Outage Detection feature is pretty straightforward to use and actually, it is much easier to try than to explain. Please use this test link to access an interactive page where you can play yourself with the collected data.
Routing Outage Detection (Control Plane)
The control plane is the component of a router that focuses on how to exchange routing information: in the case of the public Internet, we are obviously talking about the BGP protocol.
The Routing Outage Detection is used to detect outages in the routing layer (i.e., control plane), using data aggregated from 300 route monitors located around the globe. The feature is grouping BGP prefixes having reachability issues in the same geographic location. Outages across the Internet that are detected by evaluating the BGP reachability of Autonomous Systems transited by or originating prefixes found in the tests.
In the next picture, we can see how the Routing Outage Detection is in action, see the purple color in the timeline, similarly to the Traffic Outage Detection. In this case, it’s indicating a problem in the United States, specifically in AS 2914 also known as NTT America.
When there are BGP path changes, also the prefix reachability is often affected as we see in the next picture. During the BGP outage, there is a drop in the reachability for some of the monitored BGP prefixes.
The Routing Outage Detection provides a panel where we find information on:
- geolocation of the outage
- affected ISPs
- number of affected prefixes
- number of origin networks
As we can see in the next picture, several of the BGP monitors are marked yellow and red due to the ongoing outage that is lowering the availability score, which in normal condition would be 100%.
In the center, the green node is the BGP origin while the red solid/dashed lines respectively stand for prefix injection/withdrawal. Due to the outage, there is a lot of BGP activity that obviously has a negative impact on the carried traffic, which is the reason why the outage is detected.
This time as well, the Routing Outage Detection feature can be tried using this test link to access an interactive page where you can play yourself with the collected data.
The documentation for the Outage Detection can be found here. In there, step by step explanation is provided for each aspect of the feature which is simple to see and use, but got some cool technology underlying.
What we liked of ThousandEyes Outage Detection is the visibility created anonymously analyzing all the customer tests. This created a pretty solid bunch of data that can help to provide both global and local troubleshooting information. It’s nice to see that ThousandEyes is taking security consciously, allowing sensitive businesses such as Finance institutions to safely use this tool.
It is interesting that Outage Detection can answer the frequent question, “Is it just me having this problem, or everyone else?” along with providing a method to quickly pinpoint our affected tests. The public Internet is a bit like the Wild West, so it’s really nice being pointed toward the faulty ISP in a timely manner. This allows contacting that ISP asking for resolution, or even taking stronger actions such as rerouting the traffic to avoid the outage: this is how Outage Detection empowers the Network Engineers using it for troubleshooting.
Something we kind of missed in the tool is historical data on faulty ISPs. When the Outage Detection kicks in, it would be good to have access to the past history of the affected ISPs. If we have proof that a specific ISP often has issues with traffic forwarding or routing, hence we could try to divert the traffic through another network. Maybe this is something that ThousandEyes could add in the future.
All in all, we were pretty happy with the combination of data plane and control plane failure detection targeting the public Internet and pinpointing the faulty ISPs. This, in the end, provides enhanced visibility into the Internet portion of any end to end connection flowing through the public infrastructure.
We tested ThousandEyes Outage Detection and once again the outcome was positive. It is always welcome adding more visibility to this well-known tool and in this case, ThousandEyes provided some additional insights into the public Internet segment which is never easy to troubleshoot.
You can use ThousandEyes Internet Outage Detection for free by creating a Trial account. This will give you access to the full features for 14 days. We strongly recommend you give this feature a shot. Definitely, you will not regret it.