Performance progression of IPv4 route lookup on Linux

Vincent Bernat July 3, 2017

TL;DR

Each of Linux 2.6.39, 3.6 and 4.0 brings notable performance improvements for the IPv4 route lookup process.

In a previous article, I explained how Linux implements an IPv4 routing table with compressed tries to offer excellent lookup times. The following graph shows the performance progression of Linux through history:

IPv4 route lookup performance — Route lookup times for various Linux kernel versions. Lookup is done on a table with 500,000 routes. Notable performance improvements are highlighted.

Two scenarios are tested:

500,000 routes extracted from an Internet router (half of them are /24); and
500,000 host routes (/32) tightly packed in 4 distinct subnets.

All kernels are compiled with GCC 4.9 (from Debian Jessie). This version is able to compile older kernels¹ as well as current ones. The kernel configuration used is the default one with CONFIG_SMP and CONFIG_IP_MULTIPLE_TABLES options enabled (however, no IP rules are used). Some other unrelated options are enabled to be able to boot them in a virtual machine and run the benchmark.

The measurements are done in a virtual machine with one vCPU.² The host is an Intel Core i5-4670K and the CPU governor was set to “performance.” The benchmark is single-threaded. Implemented as a kernel module, it calls fib_lookup() with various destinations in 100,000 timed iterations and keeps the median. Timings of individual runs are computed from the TSC (and converted to nanoseconds by assuming a constant clock).

The following kernel versions bring a notable performance improvement:

In Linux 2.6.39, commit 3630b7c050d9, David Miller removes the hash-based routing table implementation to switch to the LPC-trie implementation (available since Linux 2.6.13 as a compile-time option). This brings a small regression for the scenario with many host routes but boosts the performance for the general case.
In Linux 3.0, commit 281dc5c5ec0f, the improvement is not related to the network subsystem. Linus Torvalds disables the compiler size-optimization from the default configuration. It was believed that optimizing for size would help keeping the instruction cache efficient. However, compilers generated under-performing code on x86 when this option was enabled.
In Linux 3.6, commit f4530fa574df, David Miller adds an optimization to not evaluate IP rules when they are left unconfigured. From this point, the use of the CONFIG_IP_MULTIPLE_TABLES option doesn’t impact the performance unless some IP rules are configured. This version also removes the route cache (commit 5e9965c15ba8). However, this has no effect on the benchmark as it directly calls fib_lookup() which doesn’t involve the cache.
In Linux 4.0, notably commit 9f9e636d4f89, Alexander Duyck adds several optimizations to the trie lookup algorithm. It really pays off!
In Linux 4.1, commit 0ddcf43d5d4a, Alexander Duyck collapses the local and main tables when no specific IP rules are configured. For non-local traffic, both these tables were looked up.

Mid-2018 update#

Here is a graph for more recent kernels. Performance is mostly stable. Measures are done in a batch of 5 successful lookups to reduce overhead. Small regressions in 4.2 and 4.16 may or may not be significant—a 5 ns variation matches an L2 cache lookup.

IPv4 route lookup performance with recent kernels — Route lookup times for more recent Linux kernel versions. Lookup is done on a table with 500,000 routes. The shaded surfaces represent the median absolute deviation.

The recent Meltdown and Spectre exploits have made headlines due the performance impact of the various mitigations. Because route lookup happens entirely in the kernel, the mitigations should have no effect, as confirmed in the graphs below:

IPv4 route lookup performance with kernels around the Meltdown/Spectre era — Route lookup times for more Linux kernel versions from late 2017 and early 2018. Lookup is done on a table with 500,000 routes. The kernels are ordered by their release date. The shaded area contains kernels without Spectre/Meltdown mitigations.

Measures are done in a virtual machine with one vCPU. The host is an Intel Core i5-4670K—Haswell microarchitecture—running at 3.4 GHz with an up-to-date microcode and a Linux 4.16 kernel from Debian. When protected, host is using PTI, full generic retpoline and IBPB. When unprotected, the host is running with an out-of-date microcode and mitigations are disabled on the command line (nopti nospectre_v2). Kernels for the guest are compiled with GCC 7.3—with retpoline support.

Compiling old kernels with an updated userland may still require some small patches. ↩︎
The kernels are compiled with the CONFIG_SMP option to use the hierarchical RCU and activate more of the same code paths as actual routers. However, progress on parallelism are left unnoticed. ↩︎