The authors measure wide area network (WAN) latency from the viewpoint of a large cloud provider, Azure, by tracking the round-trip time (RTT) of transmission control protocol (TCP) connections. Presenting their tool BlameIt, the authors aim to find the faults and diagnose where the WAN is having issues.
Tracking where the problem is happening in a large WAN is a pressing challenge in networks today. It is difficult to find where and why problems are occurring, such as data not reaching its destination or packets being lost along the way, as the networks grow and become more complex. This paper presents a passive measurement tool to help localize certain problems in a WAN.
The paper first does a measurement analysis on various aspects of the Azure network. It describes the datasets collected and how they are able to deduce (1) the common countries in which bad RTT is recorded, (2) how long these bad connections last, and (3) how it affects their clients. It then goes on to present BlameIt. The tool is able to passively record various RTT-relevant data to understand where the problems are happening: client-side, middle, or end-side. A number of issues are recognized, for example, middle-segment problems dominate in India, China, and Brazil. The authors also found that the US has more directly related high RTTs than the rest of the world.
By taking measurements on autonomous systems (AS) and the border gateway protocol (BGP), where there is a latency degradation between client and cloud locations, the tool uses a combination of passive measurements (TCP handshake RTTs) and selective active measurements (traceroutes) to localize issues.
The paper is easy to read, and it’s exciting to see how Azure measures and determines where bad performance is happening on its network. In other networks, tools such as perfSONAR and measuring loss are used, and it would be interesting to see how Google Cloud Platform (GCP) and Amazon Web Services (AWS) measure their network performance. This paper is a good read for those working to improve network performance using machine learning.