| Causes and Correlation of Network Impairments |
|
|
|
We have passed through the halcyon days of the Internet's childhood. During that period we have seen the growth of applications that fit well into the Internet's basic communications parameters -- applications such as e-mail, the World-Wide-Web, and even highly buffered non-conversational (one-way) streaming audio and video. These applications fit well within the kinds of packet transit delays and packet loss rates that are found on the Internet. However we are now beginning to deploy network based applications that place greater demands on the underlying network. Conversational applications, such as voice-over-IP (VOIP) and storage area networks (SANS, iSCSI) are sensitive to the time it takes for packets to transit the network; how much that time varies (jitter); and how often packets are lost, corrupted, reordered, or replicated. It will be increasingly necessary to engineer these new applications with full recognition of the characteristics of the underlying network. These new applications may, and frequently will, require that the underlying network itself be engineered, tuned, and operated to meet defined service level agreements (SLAs). Few tools exist for application designers and their customers to discover and explore the boundaries of network behavior in which their applications work or do not work. In 1492 Columbus looked west across the Atlantic ocean and said "there is land out there" but he was unable to say where or how far. In order to find the answers Columbus had to take the risk and expense of actually deploying his ships and sailors. Fortunately we are in a somewhat better position than Columbus was - there are new tools that let us explore the boundaries of our networks and the limitations of our new applications without undertaking the risk and expense of an exploratory deployment. The purpose of this paper is to discuss the ways in which networks may be imperfect and how we evaluate and deal with those limitations. In addition, it has become common practice for network equipment vendors to make sweeping claims that new classes of applications require the deployment of that vendor's latest equipment. We find that such claims are either unwarranted or made without knowledge of the actual requirements of the applications in questions. This paper advocates that customers create testbeds in which they may ascertain the actual service requirements of their present and planned applications. The resulting data will give the customer the necessary information to understand whether the existing network infrastructure is adequate, whether it needs to be upgraded, and what service level agreements should be established with network providers. This approach could result in enormous cost savings and reduced deployment times. What do we mean by the "imperfect network"?There is no such thing as a perfect network. The laws of physics and mathematics impose limitations that are random (such as noise on copper or fiber optic cables, or hardware or software failures in routers and switches) or predictable (such as the speed of electrical pulses on wires or light on fiber optics). There are other sources of imperfection: Packets may be lost or delayed by transient congestion in switching elements of the net. Packets may be lost, replicated, or reordered by changes in routing or transitions between slow-path and fast-path routing mechanisms. Packet reordering may also be caused by "load balancing" of traffic between a pair of routers using a set of parallel links. These conditions tend to occur in bursts that span periods of time ranging from a few milliseconds to a few minutes. However longer periods are not atypical - congestive losses can last as long as the competing packet flows fight over some scarce resource, typically buffers, in a switching device. Instability in internet routing can cause bursts of lost or reordered packets as route tables are adjusted. There are times when there is no usable route for packets to flow from some point A to some other point B on the net. And reordering caused by parallel telecommunications links can last for as long as those links are in place. Many people tend to think of these as unusual or rare conditions. In the core of the Internet, a place of large data pipes, high powered switches and routers, and (usually) good traffic engineering, these conditions are infrequent, but they do occur. However, if one considers packet paths that pass through the periphery of the net - as most packet paths do - then one encounters overloaded exchanges and links, older and under provisioned equipment, and lack of 24x7 monitoring and operational coverage. At the time of this writing -- early 2003 -- we happen to be in an Internet era of excess network capacity. Because traffic is low, packets flow across the internet with few delays and relatively few points of trouble-causing congestion. These halcyon days will not last forever; networks that appear near perfect today will show flaws as traffic levels rise. Products that work well today may not do so well in the future when network conditions are not so favorable and there are more demands for limited network resources. And new applications will make increasing demands on the net for reliable and timely packet transport. The Ways That Networks ErrThere are many sources of the imperfect network. The following table defines various ways in which networks err.
Causes of the Network Errors
How do impairments manifest themselves?There is no single way in which network impairments make themselves visible. For example, in applications that tend to move a lot of data over a long distance TCP connection, packet loss, jitter, and reordering tend to trigger TCP's congestion avoidance algorithms and thus cause considerable diminishment in throughput. In voice-over-IP (VOIP) applications, jitter and delay combine with the result that the people trying to speak end up speaking over one another. VOIP voice quality can degrade in the face of any impairment. Even the perceived responsiveness of web browsing can significantly degrade if DNS query packets are lost or delayed. What about your own network?Even small networks can be impaired. Any net with more than a few switches and routers, and particularly any net with out-of-campus connections, is likely to experienced impaired services. In many cases, smaller networks may be more subject to impairments than larger networks that are monitored by 24x7 Network Operations Centers (NOCs). It isn't that the larger networks are more immune; it's just that on the larger networks somebody (other than the users) are watching and might be able to notice problems, isolate the causes, and initiate a repair. How can I tell if my network has a problem with these things?As we mentioned earlier, the sensitivity of applications to network impairments varies widely with the nature of the application and to a lesser degree with the quality of the implementation of the protocol stacks used by the application. So, there are really two questions:
Since nearly every network has some level of impaired service, perhaps the pragmatic approach is to inventory the applications running on your network in order to create some kind of service level definition. Impairments that don't rise to the level where they erode that level of service are impairments that may be safe to ignore. It would, however, be necessary to review that service level as new applications are added, old ones removed, and as the overall traffic demands and patterns change. Even the introduction of a new router or switch may change the behavior of the net. Let's assume for the moment that you do come up with a service level definition. For example, for voice-over-IP applications you might come up with the following:
How would you measure whether your network meets these service levels? But, more importantly, how would you even know in the first place whether these numbers are actually useful and whether they represent the kinds of services your applications actually need? It can cost a great deal of time and money to over-engineer a network to provide service levels that your applications do not need. And it can be more than simply embarrassing to discover after the fact that your shiny new (and expensive) network, even though it meets the service definitions, doesn't do the trick? The key is to be able to build a testbed so that you can evaluate how applications behave in the presence of controllable and known degrees of impaired traffic. With such a testbed you could have more confidence that your service level definitions are, in fact, representative of what you need from your production networks. There are a number of ways one could go about building a testbed. One can build a miniature version of a proposed network and hope that it adequately reflects the behavior of the full scale network. This approach is expensive and inflexible. Another approach is to use mathematical simulations and models. These methods take a great deal of expertise to design, implement, and evaluate. And in many instances the results may be rather detached from reality. The software found inside network devices frequently, indeed, almost always, does not act with mathematical precision, or indeed with anything that even approximates that kind of precision. The approach that we advocate is to use tools that actually produce, under controlled and repeatable conditions, a variety of of network impairments so that the proposed applications can be tested and evaluated under near real-life conditions. There are several tools available for inducing impairments into networks. Most share an ability to create some or all of the types of impairments described above. Most of these tools manipulate either all packets or classify packets into flows most frequently defined by a simple 5-tuple scheme (source-IP, destination-IP, IP protocol type (UDP, TCP), source-port, destination port). Those kinds of tools are often adequate when one is testing general classes of equipment. However, experience with networks has taught us that many network applications have sensitivity to certain patterns of impairments. In fact it is this kind of sensitivity that frequently allows crackers to break into equipment or to create denial of service attacks. However, there is only one tool that is able to dig more deeply into flows and allow one to create impairments that might expose these pattern-based flaws. That tool is InterWorking Labs' Maxwell, The Network Impairment System. What can be done about network impairments?We can deal with network impairments in two ways - we can make the network better or we can make the applications better. David Isen's now famous paper, "Rise of the Stupid Network" (the actual paper is at http://www.rageboy.com/stupidnet.html and http://www.hyperorg.com/misc/stupidnet.html) could be construed as an argument for pushing the burden onto the applications while leaving the underlying network as simple as possible. There is much merit in this approach - in fact it is only in the applications where most of us, including application vendors, have any ability to control the quality of our internet experience. There are times when one simply cannot build an application without getting better quality network services. There are many time and distance sensitive applications that will not work effectively, regardless of the applications engineering, without guaranteed service level agreements. At the same time, it is possible to consider the design of devices and applications at the edge of the network and optimize the ability of these applications to properly compensate and handle impaired network situations. The best approach with either the stupid network or a network engineered for high quality packet delivery is to create pre-deployment testbeds. A pre-deployment testbed allows us to evaluate the range of impairments that our existing and future applications can tolerate. With that knowledge we can better understand the engineering and investment tradeoffs between building more sophisticated applications versus demanding (and obtaining) improved service levels from our network infrastructures. |
|||||||||||||||||||||||


