4.5. Route Selection

Crucial to the proper ability of hosts to exchange IP packets is the correct selection of a route to the destination. The rules for the selection of route path are traditionally made on a hop-by-hop basis [17] based solely upon the destination address of the packet. Linux behaves as a conventional routing device in this way, but can also provide a more flexible capability. Routes can be chosen and prioritized based on other packet characteristics.

The route selection algorithm under linux has been generalized to enable the powerful latter scenario without complicating the overwhelmingly common case of the former scenario.

4.5.1. The Common Case

The above sections on routing to a local network and the default gateway expose the importance of destination address for route selection. In this simplified model, the kernel need only know the destination address of the packet, which it compares against the routing tables to determine the route by which to send the packet.

The kernel searches for a matching entry for the destination first in the routing cache and then the main routing table. In the case that the machine has recently transmitted a packet to the destination address, the routing cache will contain an entry for the destination. The kernel will select the same route, and transmit the packet accordingly.

If the linux machine has not recently transmitted a packet to this destination address, it will look up the destination in its routing table using a technique known longest prefix match [18]. In practical terms, the concept of longest prefix match means that the most specific route to the destination will be chosen.

The use of the longest prefix match allows routes for large networks to be overridden by more specific host or network routes, as required in Example 1.10, for example. Conversely, it is this same property of longest prefix match which allows routes to individual destinations to be aggregated into larger network addresses. Instead of entering individual routes for each host, large numbers of contiguous network addresses can be aggregated. This is the realized promise of CIDR networking. See Section I.1.3, “General IP Networking Resources” for further details.

In the common case, route selection is based completely on the destination address. Conventional (as opposed to policy-based) IP networking relies on only the destination address to select a route for a packet.

Because the majority of linux systems have no need of policy based routing features, they use the conventional routing technique of longest prefix match. While this meets the needs of a large subset of linux networking needs, there are unrealized policy routing features in a machine operating in this fashion.

4.5.2. The Whole Story

With the prevalence of low cost bandwidth, easily configured VPN tunnels, and increasing reliance on networks, the technique of selecting a route based solely on the destination IP address range no longer suffices for all situations. The discussion of the common case of route selection under linux neglects one of the most powerful features in the linux IP stack. Since kernel 2.2, linux has supported policy based routing through the use of multiple routing tables and the routing policy database (RPDB). Together, they allow a network administrator to configure a machine select different routing tables and routes based on a number of criteria.

Selectors available for use in policy-based routing are attributes of a packet passing through the linux routing code. The source address of a packet, the ToS flags, an fwmark (a mark carried through the kernel in the data structure representing the packet), and the interface name on which the packet was received are attributes which can be used as selectors. By selecting a routing table based on packet attributes, an administrator can have granular control over the network path of any packet.

With this knowledge of the RPDB and multiple routing tables, let's revisit in detail the method by which the kernel selects the proper route for a packet. Understanding the series of steps the kernel takes for route selection should demystify advanced routing. In fact, advanced routing could more accurately be called policy-based networking.

When determining the route by which to send a packet, the kernel always consults the routing cache first. The routing cache is a hash table used for quick access to recently used routes. If the kernel finds an entry in the routing cache, the corresponding entry will be used. If there is no entry in the routing cache, the kernel begins the process of route selection. For details on the method of matching a route in the routing cache, see Section 4.7, “Routing Cache”.

The kernel begins iterating by priority through the routing policy database. For each matching entry in the RPDB, the kernel will try to find a matching route to the destination IP address in the specified routing table using the aforementioned longest prefix match selection algorithm. When a matching destination is found, the kernel will select the matching route, and forward the packet. If no matching entry is found in the specified routing table, the kernel will pass to the next rule in the RPDB, until it finds a match or falls through the end of the RPDB and all consulted routing tables.

Here is a snippet of python-esque pseudocode to illustrate the kernel's route selection process again. Each of the lookups below occurs in kernel hash tables which are accessible to the user through the use of various iproute2 tools.

Example 4.4. Routing Selection Algorithm in Pseudo-code

if packet.routeCacheLookupKey in routeCache :
    route = routeCache[ packet.routeCacheLookupKey ]
else
    for rule in rpdb :
        if packet.rpdbLookupKey in rule :
            routeTable = rule[ lookupTable ]
            if packet.routeLookupKey in routeTable :
                route = route_table[ packet.routeLookup_key ]
          

This pseudocode provides some explanation of the decisions required to find a route. The final piece of information required to understand the decision making process is the lookup process for each of the three hash table lookups. In Table 4.1, each key is listed in order of importance. Optional keys are listed in italics and represent keys that will be matched if they are present.

Table 4.1. Keys used for hash table lookups during route selection

route cacheRPDBroute table
destinationsourcedestination
sourcedestinationToS
ToSToSscope
fwmarkfwmarkoif
iifiif 

The route cache (also the forwarding information base) can be displayed using ip route show cache. The routing policy database (RPDB) can be manipulated with the ip rule utility. Individual route tables can be manipulated and displayed with the ip route command line tool.

Example 4.5. Listing the Routing Policy Database (RPDB)

[root@isolde]# ip rule show
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup 253
        

Observation of the output of ip rule show in Example 4.5 on a box whose RPDB has not been changed should reveal a high priority rule, rule 0. This rule, created at RPDB initialization, instructs the kernel to try to find a match for the destination in the local routing table. If there is no match for the packet in the local routing table, then, per rule 32766, the kernel will perform a route lookup in the main routing table. Normally, the main routing table will contain a default route if not a more specific route. Failing a route lookup in the main routing table the final rule (32767) instructs the kernel to perform a route lookup in table 253.

A common mistake when working with multiple routing tables involves forgetting about the statelessness of IP routing. This manifests when the user configuring the policy routing machine accounts for outbound packets (via fwmark, or ip rule selectors), but forgets to account for the return packets.

4.5.3. Summary

For more ideas on how to use policy routing, how to work with multiple routing tables, and how to troubleshoot, see Section 10.3, “Using the Routing Policy Database and Multiple Routing Tables”.

Yeah. That's it. So there.



[17] This document could stand to allude to MPLS implementations under linux, for those who want to look at traffic engineering and packet tagging on backbones. This is certainly not in the scope of this chapter, and should be in a separate chapter, which covers developing technologies.

[18] Refer to RFC 3222 for further details.