IPv4 route lookup on Linux
Vincent Bernat
TL;DR
With its implementation of IPv4 routing tables using LPC-tries, Linux offers good lookup performance (50 ns for a full view) and low memory usage (64 MiB for a full view).
During the lifetime of an IPv4 datagram inside the Linux kernel, one
important step is the route lookup for the destination address
through the fib_lookup()
function. From essential
information about the datagram (source and destination IP addresses,
interfaces, firewall mark, …), this function should quickly provide
a decision. Some possible options are:
- local delivery (
RTN_LOCAL
); - forwarding to a supplied next hop (
RTN_UNICAST
); and - silent discard (
RTN_BLACKHOLE
).
Since 2.6.39, Linux stores routes in a compressed prefix tree (commit 3630b7c050d9). In the past, a route cache was maintained but it has been removed1 in Linux 3.6.
Route lookup in a trie#
Looking up a route in a routing table is to find the most specific prefix matching the requested destination. Let’s assume the following routing table:
$ ip route show scope global table 100 default via 203.0.113.5 dev out2 192.0.2.0/25 nexthop via 203.0.113.7 dev out3 weight 1 nexthop via 203.0.113.9 dev out4 weight 1 192.0.2.47 via 203.0.113.3 dev out1 192.0.2.48 via 203.0.113.3 dev out1 192.0.2.49 via 203.0.113.3 dev out1 192.0.2.50 via 203.0.113.3 dev out1
Here are some examples of lookups and the associated results:
Destination IP | Next hop |
---|---|
192.0.2.49 |
203.0.113.3 via out1 |
192.0.2.50 |
203.0.113.3 via out1 |
192.0.2.51 |
203.0.113.7 via out3 or 203.0.113.9 via out4 (ECMP) |
192.0.2.200 |
203.0.113.5 via out2 |
A common structure for route lookup is the trie, a tree structure where each node has its parent as prefix.
Lookup with a simple trie#
The following trie encodes the previous routing table:
For each node, the prefix is known by its path from the root node and the prefix length is the current depth.
A lookup in such a trie is quite simple: at each step, fetch the
nth bit of the IP address, where n is the current
depth. If it is 0, continue with the first child. Otherwise, continue
with the second. If a child is missing, backtrack until a routing
entry is found. For example, when looking for 192.0.2.50
, we will
find the result in the corresponding leaf (at depth 32). However for
192.0.2.51
, we will reach 192.0.2.50/31
but there is no second
child. Therefore, we backtrack until the 192.0.2.0/25
routing entry.
Adding and removing routes is quite easy. From a performance point of view, the lookup is done in constant time relative to the number of routes (due to maximum depth being capped to 32).
Quagga is an example of routing software still using this simple approach.
Lookup with a path-compressed trie#
In the previous example, most nodes only have one child. This leads to a lot of unneeded bitwise comparisons and memory is also wasted on many nodes. To overcome this problem, we can use path compression: each node with only one child is removed (except if it also contains a routing entry). Each remaining node gets a new property telling how many input bits should be skipped. Such a trie is also known as a Patricia trie or a radix tree. Here is the path-compressed version of the previous trie:
Since some bits have been ignored, on a match, a final check is executed to ensure all bits from the found entry are matching the input IP address. If not, we must act as if the entry wasn’t found (and backtrack to find a matching prefix). The following figure shows two IP addresses matching the same leaf:
The reduction on the average depth of the tree compensates the necessity to handle these false positives. The insertion and deletion of a routing entry is still easy enough.
Many routing systems are using Patricia trees:
Lookup with a level-compressed trie#
In addition to path compression, level compression2 detects parts of the trie that are densily populated and replace them with a single node and an associated vector of 2k children. This node will handle k input bits instead of just one. For example, here is a level-compressed version our previous trie:
Such a trie is called LC-trie or LPC-trie and offers higher lookup performance compared to a radix tree.
A heuristic is used to decide how many bits a node should handle. On Linux, if the ratio of non-empty children to all children would be above 50% when the node handles an additional bit, the node gets this additional bit. On the other hand, if the current ratio is below 25%, the node loses the responsibility of one bit. These values are not tunable.
Insertion and deletion becomes more complex but lookup times are also improved.
Implementation in Linux#
The implementation for IPv4 in Linux exists since 2.6.13 (commit 19baf839ff4a) and is enabled by default since 2.6.39 (commit 3630b7c050d9).
Here is the representation of our example routing table in memory:3
There are several structures involved:
struct fib_table
represents a routing table;struct trie
represents a complete trie;struct key_vector
represents either an internal node (whenbits
is not zero) or a leaf;struct fib_info
contains the characteristics shared by several routes (like a next-hop gateway and an output interface); andstruct fib_alias
is the glue between the leaves and thefib_info
structures.
The trie can be retrieved through /proc/net/fib_trie
:
$ cat /proc/net/fib_trie Id 100: +-- 0.0.0.0/0 2 0 2 |-- 0.0.0.0 /0 universe UNICAST +-- 192.0.2.0/26 2 0 1 |-- 192.0.2.0 /25 universe UNICAST |-- 192.0.2.47 /32 universe UNICAST +-- 192.0.2.48/30 2 0 1 |-- 192.0.2.48 /32 universe UNICAST |-- 192.0.2.49 /32 universe UNICAST |-- 192.0.2.50 /32 universe UNICAST […]
For internal nodes, the numbers after the prefix are:
- the number of bits handled by the node,
- the number of full children (they only handle one bit),
- the number of empty children.
Moreover, if the kernel was compiled with CONFIG_IP_FIB_TRIE_STATS
,
some interesting statistics are available in /proc/net/fib_triestat
:4
$ cat /proc/net/fib_triestat Basic info: size of leaf: 48 bytes, size of tnode: 40 bytes. Id 100: Aver depth: 2.33 Max depth: 3 Leaves: 6 Prefixes: 6 Internal nodes: 3 2: 3 Pointers: 12 Null ptrs: 4 Total size: 1 kB […]
When a routing table is very dense, a node can handle many bits. For example, a densily populated routing table with 1 million entries packed in a /12 can have one internal node handling 20 bits. In this case, route lookup is essentially reduced to a lookup in a vector.
The following graph shows the number of internal nodes used relative to the number of routes for different scenarios (routes extracted from an Internet full view, /32 routes spread over 4 different subnets with various densities). When routes are densily packed, the number of internal nodes are quite limited.
Performance#
So how performant is a route lookup? The maximum depth stays low
(about 6 for a full view), so a lookup should be quite fast. With the
help of a small kernel module, we can accurately benchmark5
the fib_lookup()
function:
The lookup time is loosely tied to the maximum depth. When the routing table is densily populated, the maximum depth is low and the lookup times are fast.
When forwarding at 10 Gbps, the time budget for a packet would be about 50 ns. Since this is also the time needed for the route lookup alone in some cases, we wouldn’t be able to forward at line rate with only one core. Nonetheless, the results are pretty good and they are expected to scale linearly with the number of cores.
The measurements are done with a Linux kernel 4.11 from Debian unstable. I have gathered performance metrics across kernel versions in “Performance progression of IPv4 route lookup on Linux.”
Another interesting figure is the time it takes to insert all these routes into the kernel. Linux is also quite efficient in this area since you can insert 2 million routes in less than 10 seconds:
Memory usage#
The memory usage is available directly in
/proc/net/fib_triestat
. The statistic provided doesn’t account for
the fib_info
structures, but you should only have a handful of them
(one for each possible next-hop). As you can see on the graph below,
the memory use is linear with the number of routes inserted, whatever
the shape of the routes is.
The results are quite good. With only 256 MiB, about 2 million routes can be stored!
Routing rules#
Unless configured without CONFIG_IP_MULTIPLE_TABLES
, Linux
supports several routing tables and has a system of configurable
rules to select the table to use. These rules can be configured with
ip rule
. By default, there are three of them:
$ ip rule show 0: from all lookup local 32766: from all lookup main 32767: from all lookup default
Linux will first lookup for a match in the local
table. If it
doesn’t find one, it will lookup in the main
table and at last
resort, the default
table.
Builtin tables#
The local
table contains routes for local delivery:
$ ip route show table local broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1 local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1 local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1 broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1 broadcast 192.168.117.0 dev eno1 proto kernel scope link src 192.168.117.55 local 192.168.117.55 dev eno1 proto kernel scope host src 192.168.117.55 broadcast 192.168.117.63 dev eno1 proto kernel scope link src 192.168.117.55
This table is populated automatically by the kernel when addresses are
configured. Let’s look at the three last lines. When the IP address
192.168.117.55
was configured on the eno1
interface, the kernel
automatically added the appropriate routes:
- a route for
192.168.117.55
for local unicast delivery to the IP address; - a route for
192.168.117.63
for broadcast delivery to the broadcast address; and - a route for
192.168.117.0
for broadcast delivery to the network address.
When 127.0.0.1
was configured on the loopback interface, the same
kind of routes were added to the local
table. However, a loopback
address receives a special treatment and the kernel also adds the
whole subnet to the local
table. As a result, you can ping any IP in
127.0.0.0/8
:
$ ping -c1 127.42.42.42 PING 127.42.42.42 (127.42.42.42) 56(84) bytes of data. 64 bytes from 127.42.42.42: icmp_seq=1 ttl=64 time=0.039 ms --- 127.42.42.42 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.039/0.039/0.039/0.000 ms
The main
table usually contains all the other routes:
$ ip route show table main default via 192.168.117.1 dev eno1 proto static metric 100 192.168.117.0/26 dev eno1 proto kernel scope link src 192.168.117.55 metric 100
The default
route has been configured by some DHCP daemon. The
connected route (scope link
) has been automatically added by the
kernel (proto kernel
) when configuring an IP address on the eno1
interface.
The default
table is empty and has little use. It has been kept when
the current incarnation of advanced routing has been introduced in
Linux 2.1.68 after a first tentative using “classes” in Linux 2.1.15.6
Performance#
Since Linux 4.1 (commit 0ddcf43d5d4a), when the set of rules is
left unmodified, the main
and local
tables are merged and the
lookup is done with this single table (and the default
table if not
empty). Moreover, since Linux 3.0 (commit f4530fa574df), without
specific rules, there is no performance hit when enabling the support
for multiple routing tables. However, as soon as you add new rules,
some CPU cycles will be spent for each datagram to evaluate them. Here
is a couple of graphs demonstrating the impact of routing rules on
lookup times:
For some reason, the relation is linear when the number of rules is between 1 and 100 but the slope increases noticeably past this threshold. The second graph highlights the negative impact of the first rule (about 30 ns).
A common use of rules is to create virtual routers: interfaces are segregated into domains and when a datagram enters through an interface from domain A, it should use routing table A:
# ip rule add iif vlan457 table 10 # ip rule add iif vlan457 blackhole # ip rule add iif vlan458 table 20 # ip rule add iif vlan458 blackhole
The blackhole rules may be removed if you are sure there is a default route in each routing table. For example, we add a blackhole default with a high metric to not override a regular default route:
# ip route add blackhole default metric 9999 table 10 # ip route add blackhole default metric 9999 table 20 # ip rule add iif vlan457 table 10 # ip rule add iif vlan458 table 20
To reduce the impact on performance when many interface-specific rules are used, interfaces can be attached to VRF instances and a single rule can be used to select the appropriate table:
# ip link add vrf-A type vrf table 10 # ip link set dev vrf-A up # ip link add vrf-B type vrf table 20 # ip link set dev vrf-B up # ip link set dev vlan457 master vrf-A # ip link set dev vlan458 master vrf-B # ip rule show 0: from all lookup local 1000: from all lookup [l3mdev-table] 32766: from all lookup main 32767: from all lookup default
The special l3mdev-table
rule was automatically added when
configuring the first VRF interface. This rule will select the routing
table associated to the VRF owning the input (or output) interface.
VRF was introduced in Linux 4.3 (commit 193125dbd8eb), the performance was greatly enhanced in Linux 4.8 (commit 7889681f4a6c) and the special routing rule was also introduced in Linux 4.8 (commit 96c63fa7393d, commit 1aa6c4f6b8cd). You can find more details about it in the kernel documentation.
The takeaways from this article are:
- route lookup times hardly increase with the number of routes;
- densily packed /32 routes lead to amazingly fast route lookups;
- memory use is low (128 MiB per million routes); and
- no optimization is done on routing rules.
For IPv6, have a look at “IPv6 route lookup on Linux.”
-
The routing cache was subject to reasonably easy to launch denial of service attacks. It was also believed to not be efficient for high volume sites like Google but I have first-hand experience it was not the case for moderately high volume sites. ↩︎
-
“IP-address lookup using LC-tries,” IEEE Journal on Selected Areas in Communications, 17(6):1083-1092, June 1999. ↩︎
-
For internal nodes, the
key_vector
structure is embedded into atnode
structure. This structure contains information rarely used during lookup, notably the reference to the parent that is usually not needed for backtracking as Linux keeps the nearest candidate in a variable. ↩︎ -
One leaf can contain several routes (
struct fib_alias
is a list). The number of “prefixes” can therefore be greater than the number of leaves. The system also keeps statistics about the distribution of the internal nodes relative to the number of bits they handle. In our example, all the three internal nodes are handling 2 bits. ↩︎ -
The measurements are done in a virtual machine with one vCPU. The host is an Intel Core i5-4670K running at 3.7 GHz during the experiment (CPU governor was set to performance). The benchmark is single-threaded. It runs a warm-up phase, then executes about 100,000 timed iterations and keeps the median. Timings of individual runs are computed from the TSC. ↩︎
-
Fun fact: the documentation of this first tentative of more flexible routing is still available in today’s kernel tree and explains the usage of the “default class”. ↩︎