Chapter 4. Routing Protocol BasicsAlthough this book is primarily concerned with
exterior gateway protocols—routing between different
autonomous systems—it makes sense to look at internal gateway
protocols as a first step because, conceptually and in practice, the
two will affect each other's behavior. Thus, Chapter 4 begins with a
consideration of protocols intended for routing within an
autonomous system before moving into exterior gateway protocols.
Chapter 4 concludes with an overview of the particular exterior
gateway protocol, BGP, which we will focus on. Chapter 5 contains a
more in-depth exploration of BGP's attribute manipulation and the
use of route filtering in influencing the BGP decision process.
Understanding the basics of BGP, as described in Part 2, is
necessary before we can put the protocols's capabilities to use in
practical routing design problems throughout the rest of the book.
A brief consideration of interior gateway protocols as a point of contrast for this chapter's more in-depth consideration of exterior gateway protocols. An autonomous system is a set of routers sharing the same routing policies. Various configurations for autonomous systems are possible, depending on how many exit points to outside networks are desired and whether the system should permit through traffic. An overview of how the Border Gateway Protocol (version 4) operates, including its message header format, and how and what it negotiates with neighboring routers. The formats and purposes of BGP's four main message types—OPEN, NOTIFICATION, KEEPALIVE, and UPDATE—are covered. Interdomain Routing BasicsThe Internet is a collection of autonomous systems that define the administrative authority and the routing policies of different organizations. Autonomous systems run Interior Gateway Protocols (IGPs), such as RIP, IGRP, EIGRP, OSPF, and ISIS, within their boundaries and interconnect via an Exterior Gateway Protocol (EGP) called the Border Gateway Protocol (BGP). Routers are devices that direct traffic between hosts. Routers build routing tables that contain their collected information on all the best paths to all the destinations they know how to reach. They both announce and receive route information to and from other routers. This information goes into the routing tables. Overview of Routers and RoutingRouters develop a hop-by-hop mechanism by keeping track of "next hop" information that enables a data packet to find its destination through the network. A router that does not have a direct physical connection to the destination checks its routing table and forwards the packet to another next hop router that is closer to that destination. The process repeats until the traffic finds its way through the network to its final destination. EGPs, such as BGP, were introduced because IGPs do not scale in networks that go beyond the enterprise level. IGPs were never designed for the purpose of global internetworking because they do not have the necessary hooks to segregate enterprises into different administrations that are technically and politically independent from one another. This chapter touches upon basic IGP functionality and then explains the specifics of BGP. Figure 4-1 describes three routers, RTA, RTB, and RTC, connecting three local area networks, 192.10.1.0, 192.10.5.0, and 192.10.6.0, via serial links. Each serial link is repesented by its own network number, which results in three additional networks, 192.10.2.0, 192.10.3.0, and 192.10.4.0. Each network has a metric associated with it indicating the level of overhead (cost) of transmitting traffic on that particular link. The link between RTA and RTB, for example, has a cost of 2,000, much higher than the cost of 60 of the link between RTA and RTC. In practice, the link between RTA and RTB is a 56 Kbps link with much bigger delays than the T1 link between RTA and RTC and the T1 link between RTC and RTB combined. Routers RTA, RTB, and RTC would exchange network information via some interior gateway protocol and build their respective IP routing tables. Figure 4-1 shows examples of RTA's IP routing table for two different scenarios; the routers are exchanging routing information via RIP in one scenario and OSPF in another. As an example of how traffic is passed between end stations, if host 192.10.1.2 is trying to reach host 192.10.6.2, it will first send the traffic to RTA. RTA will look in its IP routing table for any network that matches this destination and would find that network 192.10.6.0 is reachable via next hop 192.10.3.2 (RTC) out on Serial line 2 (S2). RTC would receive the traffic and would try to look for the destination in its IP routing table (not shown). RTC would discover that the host is directly connected to its Ethernet 0 interface (E0) and would send the traffic to 192.10.6.2. In the preceding example, the routing is the same whether RTA is using the RIP or OSPF scenario. RIP and OSPF, however, fall into different categories of IGP protocols, namely distance vector protocols and link state protocols, respectively. For a different routing example in figure 4-1, the results might be different depending on whether you are looking at the RIP or OSPF scenario. It is useful at this point to consider characteristics of both IGP protocol categories, to see how protocols generally have evolved to meet increasingly sophisticated routing demands. Distance Vector Protocols Distance vector protocols such as RIP version 1 were mainly designed for small network topologies. The term distance vector derives from the fact that the protocol includes in its routing updates a vector of distances (hop counts). By using hop counts, distance vector protocols do not factor into the routing equation the overhead of sending information over a particular link. Low-speed links are treated equally or sometimes preferred over a high-speed link, depending on the calculated hop count in reaching a destination. This would lead to suboptimal and inefficient routing behaviors. Consider, for example, the RTA routing tables shown in figure 4-1. In the RIP case, RTA has listed the direct link between RTA and RTB to reach network 192.10.5.0. RTA prefers this link because it requires just one hop via RTB versus two hops via RTC and then RTB. But the preferred route is inefficient because the total cost of the routing path via RTC and then RTB (60 + 60 = 120) is much less than the cost of crossing the RTA-RTB link (2,000). Another issue with hop counts is the count to infinity restriction: distance vector protocols have a finite limit of hops (15) after which a route is considered unreachable. This would restrict the propagation of routing updates and would cause problems for large networks. The reliance on hop counts is one deficiency of distance vector protocols; another deficiency is the way that the routing information gets exchanged. Distance vector algorithms work on the concept that routers exchange all the network numbers they can reach via periodic broadcasts of the entire routing table. In large networks, the routing table exchanged between routers becomes very large and very hard to maintain, leading to slower convergence. Convergence refers to the point in time at which the entire network becomes updated to the fact that a particular route has appeared or disappeared. Distance vector protocols work on the basis of periodic updates and hold-down timers: If a route is not received in a certain amount of time, the route goes into a hold-down state and gets aged out of the routing table. The hold-down and aging process translates into minutes in convergence time before the whole network detects that a route has disappeared. The delay between a route's becoming unavailable and its aging out of the routing tables can result in routing loops and black holes. Another major drawback of distance vector protocols is their classfull nature and their lack of support of Variable Length Subnet Masks or CIDR. Distance vector protocols do not exchange mask information in their routing updates. A router that receives a routing update on a certain interface will apply to this update its locally defined subnet mask. This would lead to confusion, in case the interface belongs to a network that is variably subnetted, and a misinterpretation of the received routing update. Finally, distance vector networks are considered to be flat. They present a lack of hierarchy, which translates into a lack of aggregation. This flat nature has made distance vector protocols incapable of scaling to larger and more efficient enterprise networks. RIP version 2 has added support for VLSM and CIDR, but it still carries most of the other deficiencies that its predecessor, RIP version 1, has. Link State Protocols Link state protocols, such as the Open Shortest Path First (OSPF) [1] and Intermediate System-to-Intermediate System (ISIS) [2], are more advanced routing protocols that have addressed the deficiencies of distance vector protocols. Link state protocols work on the basis that routers exchange information elements, called link states, which carry information about links and nodes. This means that routers running link state protocols do not exchange routing tables. Each router inside a domain will have enough bits and pieces of the big puzzle that it can run a shortest path algorithm and build its own routing table. Following are some of the benefits that link state protocols provide over distance vector protocols:
Even though link state algorithms have provided better routing scalability, which enables them to be used in bigger and more complex topologies, they still should be restricted to interior routing. Link state protocols, by themselves, cannot provide a global connectivity solution required for Internet interdomain routing. In very large networks and in case of route fluctuation caused by link instabilities, link state retransmission and recomputation will become too large for any router to handle. Multihomed Nontransit AS An AS is multihomed if it has more than one exit point to the outside world. An AS can be multihomed to a single provider or multiple providers. A nontransit AS does not allow transit traffic to go through it. Transit traffic is any traffic that has a source and destination outside the AS. Figure 4-5 illustrates an AS (AS1) that is nontransit and multihomed to two providers, ISP1 and ISP2. A nontransit AS would only advertise its own routes and would not advertise routes that it learned from other ASs. This ensures that traffic for any destination that does not belong to the AS would not be directed to the AS. In figure 4-5, AS1 is learning about routes n3 and n4 via ISP1 and routes n5 and n6 via ISP2. AS1 is only advertising its local routes (n1,n2) and is not passing to ISP2 routes it learned from ISP1 or to IPS1 routes it learned from ISP2. This way, AS1 will not open itself to outside traffic, such as ISP1 trying to reach n5 or n6 and ISP2 trying to reach n3 and n4 via AS1. Of course, ISP1 or ISP2 can force their traffic to be directed to AS1 via default or static routing. As a precaution against this, AS1 could filter any traffic coming toward it with a destination not belonging to AS1. Multihomed nontransit ASs do not really need to run BGP4 with their providers, although it is recommended and most of the time required by the provider. As you will see later on in this book, running BGP4 with the providers has many advantages in controlling route propagation and filtering. Multihomed Transit AS A multihomed transit AS has more than one connection to the outside world and can still be used for transit traffic by other ASs (see figure 4-6). Transit traffic (relative to the multihomed AS) is any traffic with an origin and destination that does not belong to the AS. Even though BGP4 is an exterior gateway protocol, it can still be used inside an AS as a pipe to exchange BGP updates. BGP connections inside an autonomous system are called Internal BGP (IBGP), whereas BGP connections between autonomous systems are called External BGP (EBGP). Routers that are running IBGP are called transit routers when they carry the transit traffic going through the AS. Routers that run EBGP with other ASs are usually called border routers. A transit AS would advertise to one AS routes that it learned from another AS. This way, the transit AS would open itself to traffic that does not belong to it. Multihomed transit ASs are advised to use BGP4 for their connections to other ASs and also internally to shield their internal nontransit routers from Internet routes. Not all routers inside a domain need to run BGP; internal nontransit routers could run default routing to the BGP routers, which alleviates the number of routes the internal nontransit routers must carry. Figure 4-6 illustrates a multihomed transit autonomous system, AS1, connected to two different providers, ISP1 and ISP2. AS1 is learning routes n3, n4, n5, and n6 from both ISP1 and ISP2 and in turn advertising all that it learned, including its local routes, to ISP1 and ISP2. In this case, ISP1 could use AS1 as a transit AS to reach networks n5 and n6, and ISP2 could use AS1 to reach networks n3 and n4. Border Gateway Protocol Version 4BGP went through different phases and improvements from its earlier version, BGP1, in 1989 to today's version, BGP4, deployment of which started in 1993. BGP4 is the first version that handles aggregation (CIDR) and supernetting, as discussed earlier in this book. BGP imposes no restrictions on the underlying Internet topology. It assumes that routing within an autonomous system is done via an intra-autonomous system routing protocol. (For the purposes of this book, intra means routing within an entity, and inter means between entities.) BGP constructs a graph of autonomous systems based on the information exchanged between BGP neighbors. This directed graph environment is sometimes referred to as a tree. As far as BGP is concerned, the whole Internet is a graph of ASs, with each AS identified by an AS number. Connections between two ASs together form a path, and the collection of path information forms a route to reach a specific destination. BGP ensures that loop-free interdomain routing is maintained. Figure 4-7 illustrates this general path tree concept. How BGP Works BGP is a path vector protocol used to carry routing information between autonomous systems. The term path vector comes from the fact that BGP routing information carries a sequence of AS numbers, which indicates the path a route has traversed. BGP uses TCP as its transport protocol (port 179). This ensures that all the transport reliability such as retransmission is taken care of by TCP and does not need to be implemented in BGP itself. Two BGP routers form a transport protocol connection between each other. These routers are called neighbors or peers. Figure 4-8 illustrates this relationship. Peer routers exchange multiple messages to open and confirm the connection parameters, such as the BGP version running between the two peers (for example, version 3 for BGP3 and version 4 for BGP4). In case of any disagreement between the peers, notification errors are sent, and the peer connection does not get established. Initially, all candidate BGP routes are exchanged, as illustrated in figure 4-9. Incremental updates are sent as network information changes. The incremental update approach has shown an enormous improvement as far as CPU overhead and bandwidth allocation compared with complete periodic updates used by previous protocols, such as EGP. Routes are advertised between a pair of BGP routers in UPDATE messages. The UPDATE message contains, among other things, a list of <length, prefix> tuples that indicate the list of destinations reachable via each system. The UPDATE message also contains the path attributes, which include such information as the degree of preference for a particular route. In case of information changes, such as a route being unreachable or having a better path, BGP informs its neighbors by withdrawing the invalid routes and injecting new routing information. As illustrated in figure 4-10, withdrawn routes are part of the UPDATE message. These are the routes no longer available for use. Figure 4-11 illustrates a steady state situation: if no routing changes occur, the routers exchange only KEEPALIVE packets.
KEEPALIVE messages are sent periodically between BGP neighbors to ensure that the connection is kept alive. KEEPALIVE packets (19 bytes each) should not cause any strain on the router CPU or link bandwidth as they consume a minimal bandwidth (about 2.5 bits/sec for a periodic rate of 60 sec). BGP keeps a table version number to keep track of the instance of the BGP routing table. If the table changes, BGP will increment the table version. A table version that is incrementing rapidly is usually an indication of instabilities in the network. BGP Message Header Format The BGP message header format is a 16-byte marker field, followed by a 2-byte length field and a 1-byte type field. Figure 4-12 illustrates the basic format of the BGP message header. There may or may not be a data portion following the header, depending on the message type. KEEPALIVE messages, for example, consist of the message header only, with no following data. The marker field is used to either authenticate incoming BGP messages or to detect loss of synchronization between two BGP peers. The marker field can have two formats:
The length indicates the total BGP message length including the header. The smallest BGP message is no less than 19 bytes (16+2+1) and no greater than 4,096. The type indicates the message type, from the following possibilities:
The following sections examine the purpose and format of each of the four message types in more detail. BGP Neighbor Negotiation One of the basic steps of the BGP protocol is establishing neighbors between BGP peers. Without successful completion of this step, no exchange of updates will ever take effect. Neighbor negotiation is based on the successful completion of a TCP transport connection, the successful processing of the OPEN message, and periodic detection of the KEEPALIVE messages. OPEN Message Format Figure 4-13 illustrates the format of the OPEN message. The descriptions that follow summarize each of its fields:
Finite State Machine Perspective BGP neighbor negotiation proceeds through different stages before the connection is fully established. Figure 4-14 illustrates a simplified finite state machine (FSM) that highlights the major events in the process with an indication of messages (OPEN, KEEPALIVE, NOTIFICATION) sent to the peer in the transition from one state to the other. The following discussions summarize the key states in the FSM example illustrated in Figure 4-14:
NOTIFICATION Message From the preceding examination of the finite state machine, it should be apparent that many opportunities exist among the various states, for errors to be detected. A NOTIFICATION message is always sent whenever an error is detected, after which the peer connection is closed. Network administrators will need to evaluate these NOTIFICATION messages to determine the specific nature of errors that emerge in the routing protocol. Figure 4-15 illustrates the general message format. The NOTIFICATION message is composed of the Error code (1-byte), Error subcode (1-byte), and a Data field (variable). The Error code indicates the type of the notification, the Error subcode provides more specific information about the nature of the error, and the Data field contains data relevant to the error such as a bad header, an illegal AS number, and so on. Table 4-1 lists possible errors and their subcodes.
KEEPALIVE Message KEEPALIVE messages are periodic messages exchanged between peers to determine whether peers are reachable. As discussed previously, the hold time is the maximum amount of time that may elapse between the receipt of successive KEEPALIVE or UPDATE messages. The KEEPALIVE messages are sent at a rate that ensures that the hold time will not expire (the session is considered alive). A recommended rate is one-third of the hold time interval. If the hold time interval is zero, periodic KEEPALIVE messages will not be sent. The KEEPALIVE message is a 19-byte BGP message header with no data following it. UPDATE Message and Routing InformationCentral to the BGP protocol is the concept of routing updates. Routing updates contain all the necessary information that BGP uses to construct a loop-free picture of the Internet. The following are the basic blocks of an UPDATE message:
Figure 4-16 illustrates these components in the context of an UPDATE message format. The NLRI is an indication, in the form of an IP prefix route, of the networks being advertised. The path attribute list provides BGP with the capabilities of detecting routing loops and the flexibility to enforce local and global routing policies. An example of the BGP path attributes is the AS_path attribute, which is a sequence of AS numbers a route has traversed before reaching the BGP router. AS3 in figure 4-17, for example, is receiving BGP updates from AS2 indicating that network 10.10.1.0/24 (NLRI) is reachable via two hops, first AS2 and then AS1. Based on this information, AS3 will be able to direct its traffic to 10.10.1.0/24. The third part of the UPDATE message, is a list of routes that have become unreachable—or in BGP terminology, WITHDRAWN. With the example illustrated in figure 4-17, if 10.10.1.0/24 is no longer reachable or experiences a change in its attribute information, BGP can withdraw the route that it advertised by sending an UPDATE message that lists the new network information or the network being unreachable. Network Layer Reachability Information BGP4 provides a new set of mechanisms for supporting classless interdomain routing (CIDR). As discussed in Chapter 3, "Handling IP Address Depletion," the concept of CIDR is a move from the traditional IP classes (A, B, C) toward a concept of IP prefixes. The IP prefix is an IP network address with an indication of the number of bits (left to right) that constitute the network number. The Network Layer Reachability Information (NLRI) is the mechanism by which BGP supports classless routing. The NLRI is the part of the BGP routing update that lists the set of destinations about which BGP is trying to inform its other BGP neighbors. The NLRI consists of multiple instances of the 2-tuples <length, prefix>, where length is the number of masking bits that a particular prefix has. Figure 4-18 illustrates the NLRI <19, 198.24.160.0>. The prefix is 198.24.160.0, and the length is a 19-bit mask (counting from the far left of the prefix). Looking AheadThe Border Gateway Protocol has defined the basis of routing architectures in the Internet. The segregation of networks into autonomous systems has logically defined the administrative and political borders between organizations. Interior Gateway Protocols can now run independently of each other, but still interconnect via BGP to provide global routing. BGP as a protocol presents some basic elements of routing that are flexible enough to allow total control from the administrator's perspective. The power of BGP lies in its attributes and its route filtering techniques. Attributes are simply parameters that can be modified to affect the BGP decision process. Route filtering can be done on a prefix level or a path level. A combination of filtering and attribute manipulation can acheive the optimal routing behavior. Because traffic follows a road map laid out by routing updates, modifying the routing behavior would eventually modify the traffic trajectories. The next chapter, "Tuning BGP Capabilities," gives you a hands-on approach to understanding the basics of setting routing policies with BGP. References[1] RFC 1583 OSPF Version 2 [2] ISO 10589 Intermediate System-to-Intermediate System [3] RFC 904 Exterior Gateway Protocol formal specification [4] RFC 1771 A Border Gateway Protocol 4 (BGP-4) |