Performance progression of IPv4 route lookup on Linux
Vincent Bernat
TL;DR
Each of Linux 2.6.39, 3.6 and 4.0 brings notable performance improvements for the IPv4 route lookup process.
In a previous article, I explained how Linux implements an IPv4 routing table with compressed tries to offer excellent lookup times. The following graph shows the performance progression of Linux through history:
Two scenarios are tested:
- 500,000 routes extracted from an Internet router (half of them are /24); and
- 500,000 host routes (/32) tightly packed in 4 distinct subnets.
All kernels are compiled with GCC 4.9 (from Debian Jessie). This
version is able to compile older kernels1 as well as current
ones. The kernel configuration used is the default one with
CONFIG_SMP
and CONFIG_IP_MULTIPLE_TABLES
options enabled (however,
no IP rules are used). Some other unrelated options are enabled to be
able to boot them in a virtual machine and run the benchmark.
The measurements are done in a virtual machine with one
vCPU.2 The host is an Intel Core i5-4670K and the CPU
governor was set to “performance.” The benchmark is
single-threaded. Implemented as a kernel module, it calls
fib_lookup()
with various destinations in 100,000 timed iterations
and keeps the median. Timings of individual runs are computed from the
TSC (and converted to nanoseconds by assuming a constant clock).
The following kernel versions bring a notable performance improvement:
-
In Linux 2.6.39, commit 3630b7c050d9, David Miller removes the hash-based routing table implementation to switch to the LPC-trie implementation (available since Linux 2.6.13 as a compile-time option). This brings a small regression for the scenario with many host routes but boosts the performance for the general case.
-
In Linux 3.0, commit 281dc5c5ec0f, the improvement is not related to the network subsystem. Linus Torvalds disables the compiler size-optimization from the default configuration. It was believed that optimizing for size would help keeping the instruction cache efficient. However, compilers generated under-performing code on x86 when this option was enabled.
-
In Linux 3.6, commit f4530fa574df, David Miller adds an optimization to not evaluate IP rules when they are left unconfigured. From this point, the use of the
CONFIG_IP_MULTIPLE_TABLES
option doesn’t impact the performance unless some IP rules are configured. This version also removes the route cache (commit 5e9965c15ba8). However, this has no effect on the benchmark as it directly callsfib_lookup()
which doesn’t involve the cache. -
In Linux 4.0, notably commit 9f9e636d4f89, Alexander Duyck adds several optimizations to the trie lookup algorithm. It really pays off!
-
In Linux 4.1, commit 0ddcf43d5d4a, Alexander Duyck collapses the local and main tables when no specific IP rules are configured. For non-local traffic, both these tables were looked up.
Mid-2018 update#
Here is a graph for more recent kernels. Performance is mostly stable. Measures are done in a batch of 5 successful lookups to reduce overhead. Small regressions in 4.2 and 4.16 may or may not be significant—a 5 ns variation matches an L2 cache lookup.
The recent Meltdown and Spectre exploits have made headlines due the performance impact of the various mitigations. Because route lookup happens entirely in the kernel, the mitigations should have no effect, as confirmed in the graphs below:
Measures are done in a virtual machine with one vCPU. The host is an
Intel Core i5-4670K—Haswell microarchitecture—running at 3.4 GHz
with an up-to-date microcode and a Linux 4.16 kernel from Debian. When
protected, host is using PTI, full generic retpoline and IBPB. When
unprotected, the host is running with an out-of-date microcode and
mitigations are disabled on the command line (nopti nospectre_v2
).
Kernels for the guest are compiled with GCC 7.3—with retpoline
support.
-
Compiling old kernels with an updated userland may still require some small patches. ↩︎
-
The kernels are compiled with the
CONFIG_SMP
option to use the hierarchical RCU and activate more of the same code paths as actual routers. However, progress on parallelism are left unnoticed. ↩︎