L3 routing to the hypervisor with BGP

Vincent Bernat January 29, 2018

On layer 2 networks, high availability can be achieved by:

antique protocols, like the Spanning Tree Protocol;
proprietary protocols, like Juniper’s virtual chassis, Cisco’s virtual port channel, and other MC-LAG implementations;¹
standardized protocols, like Shortest Path Bridging Protocol or TRILL; or
the underlying network (in the case of an overlay network, like VXLAN).

Layer 2 networks need very little configuration but come with a major drawback in highly available scenarios: an incident is likely to bring the whole network down.² Therefore, it is safer to limit the scope of a single layer 2 network by, for example, using one distinct network in each rack and connecting them with layer 3 routing. Incidents are unlikely to impact a whole IP network.

In the illustration below, top of the rack switches provide a default gateway for hosts. To provide redundancy, they use an MC-LAG implementation. Layer 2 fault domains are scoped to a rack. Each IP subnet is bound to a specific rack and routing information is shared between top of the rack switches and core routers using a routing protocol like OSPF.

Legacy L2 design — Minimal layer 3 (routed) access design: each pair of top of the racks act as a gateway. Layer 2 fault domains are limited to a single rack. Hypervisors provide a bridge to virtual guests, extending the layer 2 domain. Each rack owns a set of subnets.

There are two main issues with this design:

The L2 domains are still large. A rack could host several dozen hypervisors and several thousand virtual guests. Therefore, a network incident will have a large impact.
IP subnets are pinned to each rack. A virtual guest cannot move to another rack and unused IP addresses in a rack cannot be used in another one.

To solve both these problems, it is possible to push L3 routing further to the south, turning each hypervisor into an L3 router. However, we need to ensure customer virtual guests are blind to this change: they should keep getting their configuration from DHCP (IP, subnet, and gateway).

Hypervisor as a router
Keeping guests in the dark
Conclusion and future work

Hypervisor as a router#

In a nutshell, for a guest with an IPv4 address:

the hosting hypervisor configures a /32 route with the virtual interface as next-hop; and
this route is distributed to other hypervisors (and routers) using BGP.

Our design also handles two routing domains: a public one (hosting virtual guests from multiple tenants with direct Internet access) and a private one (used by our own infrastructure, hypervisors included). Each hypervisor uses two routing tables for this purpose.

The following illustration shows the configuration of a hypervisor with 5 guests. No bridge is needed.

L3 routing inside a hypervisor — Layer 3 routing setup for a hypervisor. Each hypervisor is connected to a “public” network and a “private” one. Guests are attached to either of these networks. Each routing table contains connected routes for local guests, host routes for remote guests, and a default route for other destinations.

The complete configuration described below is also available on GitHub. In real life, a piece of software is needed to update the hypervisor configuration when an instance is added or removed. It would listen to notifications from your cloud orchestrator.

Calico is a project fulfilling the same objective (L3 routing to the hypervisor) with mostly the same ideas (except it heavily relies on Netfilter to ensure separation between administrative domains). It provides an agent (Felix) to serve as an interface with orchestrators like OpenStack or Kubernetes. Check it if you want a turnkey solution.

Routing setup#

Using IP rules, each interface is “attached” to a routing table:

$ ip rule show
0:  from all lookup local
20: from all iif lo lookup main
21: from all iif lo lookup local-out
30: from all iif eth0.private lookup private
30: from all iif eth1.private lookup private
30: from all iif vnet8 lookup private
30: from all iif vnet9 lookup private
40: from all lookup public

The most important rules are the highlighted ones (priority 30 and 40): any traffic coming from a “private” interface uses the private table. Any remaining traffic uses the public table.

The two iif lo rules manage routing for packets originated from the hypervisor itself. The local-out table is a mix of the private and public tables. The hypervisor mostly needs the routes from the private table but also needs to contact local virtual guests (for example, to answer a ping request) using the public table. Both tables contain a default route (no chaining possible), so we build a third table, local-out, by copying all routes from the private table and directly connected routes from the public table.

To avoid an accidental leak of traffic, public, private, and local-out routing tables contain a default last-resort route with a large metric.³ On normal operations, these routes should be shadowed by a regular default route:

ip route add blackhole default metric 4294967294 table public
ip route add blackhole default metric 4294967294 table private
ip route add blackhole default metric 4294967294 table local-out

IPv6 is far simpler as we have only one routing domain. While we keep a public table, there is no need for a local-out table:

$ ip -6 rule show
0:  from all lookup local
20: from all lookup main
40: from all lookup public

As the last step, forwarding is enabled and the maximum size for IPv6 route cache is increased (default is only 4096):⁴

sysctl -qw net.ipv4.conf.all.forwarding=1
sysctl -qw net.ipv6.conf.all.forwarding=1
sysctl -qw net.ipv6.route.max_size=524288

Update (2019-12)

Starting from Linux 4.2, entries for the route cache are only created when there is a PMTU exception. Therefore, it is best to keep the default value and watch /proc/net/rt6_stats: the next to last value is the current cache size in hexadecimal.

Guest routes#

The second step is to configure routes for each guest. For IPv6, we use the link-local address, derived from the remote MAC address, as next-hop:⁵

ip -6 route add 2001:db8:cb00:7100:5254:33ff:fe00:f/128 \
    via fe80::5254:33ff:fe00:f dev vnet6 \
    table public

Assigning several IP addresses (or subnets) to each guest can be done by adding more routes:

ip -6 route add 2001:db8:cb00:7107::/64 \
    via fe80::5254:33ff:fe00:f dev vnet6 \
    table public

For IPv4, the route uses the guest interface as a next-hop. Linux will issue an ARP request before being able to forward the packet:⁶

ip route add 192.0.2.15/32 dev vnet6 \
  table public

Additional IP addresses and subnets can be configured the same way but each IP address would have to answer to ARP requests. To avoid this, it is possible to route additional subnets through the first IP address:⁷

ip route add 203.0.113.128/28 \
  via 192.0.2.15 dev vnet6 onlink \
  table public

BGP setup#

The third step is to share routes between hypervisors, through BGP. This part is dependent on how hypervisors are connected to each other.

Fabric design#

Several designs are possible to connect hypervisors. The most obvious one is to use a full L3 leaf-spine fabric:

Each hypervisor establishes an eBGP session with each of the leaf top-of-the-rack routers. These routers establish an eBGP session with each spine router. This solution can be expensive because the spine routers need to handle all routes. With the current generation of switches/routers, this puts a limit around the maximum number of routes for the expected density.⁸ IP and BGP configuration can also be tedious unless some uncommon autoconfiguration mechanisms are used. On the other hand, leaf routers (and hypervisors) may optionally learn fewer routes as they can push non-local traffic north.

Another potential design is to use an L2 fabric. This may sound surprising after bad-mouthing L2 networks for their unreliability but we don’t need them to be highly available. They can provide a very scalable and cost-efficient design:⁹

L2 fabric design — L2 fabric leaf-spine fabric design with 4 distinct L2 networks. Each hypervisor establishes a BGP session to at least one route reflector for each L2 network.

Each hypervisor is connected to 4 distinct L2 networks, limiting the scope of a single failure to a quarter of the available bandwidth. In this design, only iBGP is used. To avoid a full-mesh topology between all hypervisors, route reflectors are used. Each hypervisor has an iBGP session with one or several route reflectors from each of the L2 networks. Route reflectors on the same L2 network share their routes using iBGP. Calico documents this design in more detail.

This is the solution described below. Public and private domains share the same infrastructure but use distinct VLANs.

Update (2022-07)

While such a design made sense a few years back, recent hardware can scale better and I recommend you opt for a full L3 design. In this case, you do not need the route reflectors. If you swap BIRD with FRR, you can even use a single IPv6 eBGP session with VPNv4 (public and private), VPNv6 (public), and EVPN (for VXLAN, not described here). Depending on the devices involved, you can configure unnumbered BGP peers. This does not require any IP addressing plan and the configuration is therefore minimized.

Route reflectors#

Route reflectors are BGP-speaking boxes acting as a hub for all routes on a given L2 network but not routing any traffic. We need at least one of them on each L2 network. We can use more for redundancy.

Here is an example of configuration for Junos:¹⁰

protocols {
    bgp {
        group public-v4 {
            family inet {
                unicast {
                    no-install; # ❶
                }
            }
            type internal;
            cluster 198.51.100.126; # ❷
            allow 198.51.100.0/25; # ❸
            neighbor 198.51.100.127;
        }
        group public-v6 {
            family inet6 {
                unicast {
                    no-install;
                }
            }
            type internal;
            cluster 198.51.100.126;
            allow 2001:db8:c633:6401::/64;
            neighbor 2001:db8:c633:6401::198.51.100.127;
        }
        ttl 255;
        bfd-liveness-detection { # ❹
            minimum-interval 100;
            multiplier 5;
        }
    }
}
routing-options {
    router-id 198.51.100.126;
    autonomous-system 65000;
}

This route reflector accepts and redistributes IPv4 and IPv6 routes. In ❶, we ensure received routes are not installed in FIB: route reflectors are not routers.

Each route reflector needs to be assigned a cluster identifier, which is used in loop detection. In our case, we use the IPv4 address for this purpose (in ❷). Having a different cluster identifier for each route reflector on the same network ensure they share the routes they receive—increasing resiliency.

Instead of explicitly declaring all hypervisors allowed to connect to this route reflector, a whole subnet is authorized in ❸.¹¹ We also declare the second route reflector for the same network as a neighbor to ensure they connect to each other.

Another important point of this setup is how to quickly react to unavailable paths. With directly connected BGP sessions, a faulty link may be detected immediately and the associated BGP sessions will be brought down. This may not always be reliable. Moreover, in our case, BGP sessions are established over several switches: a link down on a path may be left undetected until the hold timer expires. Therefore, in ❹, we enable BFD, a protocol to quickly¹² detect faults in path between two BGP peers (RFC 5880).

The last point to consider is whether you want to allow anycast on your network: if an IP is advertised from more than one hypervisor, you may want to:

send all flows to only one hypervisor; or
load-balance flows between hypervisors.

The second choice provides a scalable L3 load-balancer. With the above configuration, for each prefix, route reflectors choose one path and distribute it. Therefore, only one hypervisor will receive packets. To get load-balancing, you need to enable advertisement of multiple paths in BGP (RFC 7911):¹³

set protocols bgp group public-v4 family inet  unicast add-path send path-count 4
set protocols bgp group public-v6 family inet6 unicast add-path send path-count 4

Here is an excerpt of show route exhibiting “simple” routes as well as an anycast route:

> show route protocol bgp
inet.0: 6 destinations, 7 routes (7 active, 1 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

0.0.0.0/0        *[BGP/170] 00:09:01, localpref 100
                    AS path: I, validation-state: unverified
                  > to 198.51.100.1 via em1.90
192.0.2.15/32    *[BGP/170] 00:09:00, localpref 100
                    AS path: I, validation-state: unverified
                  > to 198.51.100.101 via em1.90
203.0.113.1/32   *[BGP/170] 00:09:00, localpref 100
                    AS path: I, validation-state: unverified
                  > to 198.51.100.101 via em1.90
203.0.113.6/32   *[BGP/170] 00:09:00, localpref 100
                    AS path: I, validation-state: unverified
                  > to 198.51.100.102 via em1.90
203.0.113.18/32  *[BGP/170] 00:09:00, localpref 100
                    AS path: I, validation-state: unverified
                  > to 198.51.100.103 via em1.90
203.0.113.10/32  *[BGP/170] 00:09:00, localpref 100
                    AS path: I, validation-state: unverified
                  > to 198.51.100.101 via em1.90
                  [BGP/170] 00:09:00, localpref 100
                    AS path: I, validation-state: unverified
                  > to 198.51.100.102 via em1.90

Complete configuration is available on GitHub. Configurations for GoBGP, BIRD or FRR (running on Cumulus Linux) are also available.¹⁴ The configuration for the private routing domain is similar. To avoid dedicated boxes, a solution is to run the route reflectors on some of the top of the rack switches.

Hypervisor configuration#

Let’s tackle the last step: the hypervisor configuration. We use BIRD (1.6.x) as a BGP daemon. It maintains three internal routing tables (public, private, and local-out). We use a template with the common properties to connect to a route reflector:

template bgp rr_client {
  local as 65000;   # Local ASN matches route reflector ASN
  import all;       # Accept all received routes
  export all;       # Send all routes to remote peer
  next hop self;    # Modify next-hop with the IP used for this BGP session
  bfd yes;          # Enable BFD
  direct;           # Not a multi-hop BGP session
  ttl security yes; # GTSM is enabled
  add paths rx;     # Enable ADD-PATH reception (for anycast)

  # Low timers to establish sessions faster
  connect delay time 1;
  connect retry time 5;
  error wait time 1,5;
  error forget time 10;
}

table public;
protocol bgp RR1_public from rr_client {
  neighbor 198.51.100.126 as 65000;
  table public;
}
# […]

With the above configuration, all routes in BIRD’s public table are sent to the route reflector 198.51.100.126. Any route from the route reflector is accepted. We also need to connect BIRD’s public table to the kernel’s one:¹⁵

protocol kernel kernel_public {
  persist;
  scan time 10;
  import filter {
    # Take any route from kernel,
    # except our last-resort default route
    if krt_metric < 4294967294 then accept;
    reject;
  };
  export all;      # Put all routes into kernel
  learn;           # Learn routes not added by BIRD
  merge paths yes; # Build ECMP routes if possible
  table public;    # BIRD's table name
  kernel table 90; # Kernel table number
}

We also need to enable BFD on all interfaces:

protocol bfd {
  interface "*" {
    interval 100ms;
    multiplier 5;
  };
}

To avoid losing BFD packets when the conntrack table is full, it is safer to disable connection tracking for these datagrams:

ip46tables -t raw -A PREROUTING -p udp --dport 3784 \
  -m addrtype --dst-type LOCAL -j CT --notrack
ip46tables -t raw -A OUTPUT -p udp --dport 3784 \
  -m addrtype --src-type LOCAL -j CT --notrack

Some missing bits are:

Once the BGP sessions have been established, we can query the kernel for the installed routes:

$ ip route show table public proto bird
default
        nexthop via 198.51.100.1 dev eth0.public weight 1
        nexthop via 198.51.100.254 dev eth1.public weight 1
203.0.113.6
        nexthop via 198.51.100.102 dev eth0.public weight 1
        nexthop via 198.51.100.202 dev eth1.public weight 1
203.0.113.18
        nexthop via 198.51.100.103 dev eth0.public weight 1
        nexthop via 198.51.100.203 dev eth1.public weight 1

Performance#

You may be worried about how much memory Linux may use when handling many routes. Well, don’t:

128 MiB can fit 1 million IPv4 routes; and
512 MiB can fit 1 million IPv6 routes.

BIRD uses about the same amount of memory for its own usage. As for lookup times, performance is also excellent with IPv4 and still quite good with IPv6:

30 ns per lookup with 1 million IPv4 routes; and
1.25 µs per lookup with 1 million IPv6 routes.

Therefore, the impact of letting Linux handle many routes is very low. For more details, see “IPv4 route lookup on Linux” and “IPv6 route lookup on Linux.”

Reverse path filtering#

To avoid spoofing, reverse path filtering is enabled on virtual guest interfaces: Linux will verify the source address is legit by checking the answer would use the incoming interface as an outgoing interface. This effectively prevents any possible spoofing from guests.

For IPv4, reverse path filtering can be enabled either through a per-interface sysctl¹⁷ or through the rpfilter match of Netfilter. For IPv6, only the second method is available.

# For IPv6, use NetFilter
ip6tables -t raw -N RPFILTER
ip6tables -t raw -A RPFILTER -m rpfilter -j RETURN
ip6tables -t raw -A RPFILTER -m rpfilter --accept-local \
  -m addrtype --dst-type MULTICAST -j DROP
ip6tables -t raw -A RPFILTER -m limit --limit 5/s --limit-burst 5 \
  -j LOG --log-prefix "NF: rpfilter: " --log-level warning
ip6tables -t raw -A RPFILTER -j DROP
ip6tables -t raw -A PREROUTING -i vnet+ -j RPFILTER

# For IPv4, use sysctls
sysctl -qw net.ipv4.conf.all.rp_filter=0
for iface in /sys/class/net/vnet*; do
    sysctl -qw net.ipv4.conf.${iface##*/}.rp_filter=1
done

There is no need to prevent L2 spoofing as there is no gain for the attacker.

Keeping guests in the dark#

An important aspect of the solution is to ensure guests believe they are attached to a classic L2 network (with an IP in a subnet).

The first step is to provide them with a working default gateway. On the hypervisor, this can be done by assigning the default gateway IP directly to the guest interface:

ip addr add 203.0.113.254/32 dev vnet5 scope link

Our main goal is to ensure Linux will answer ARP requests for the gateway IP. Configuring a /32 is enough for this and we do not want to configure a larger subnet as, by default, Linux would install a route for the subnet to this interface, which would be incorrect.¹⁸

For IPv6, this is not needed as we rely on link-local addresses instead.

A guest may also try to speak with other guests on the same subnet. The hypervisor will answer ARP requests on their behalf. Once it starts receiving IP traffic, it will route it to the appropriate interface. This can be done by enabling ARP proxying on the guest interface:

sysctl -qw net.ipv4.conf.vnet5.proxy_arp=1
sysctl -qw net.ipv4.neigh.vnet5.proxy_delay=0

For IPv6, Linux NDP proxying is far less convenient. Instead, ndppd can handle this task. For each interface, we use the following configuration snippet:

proxy vnet5 {
  rule 2001:db8:cb00:7100::/64 {
    static
  }
}

ndppd does not spoof the source address and some systems are pickier about it. Therefore, an appropriate address needs to be configured on each interface:

ip addr add 2001:db8:cb00:7100::1/64 dev vnet5 noprefixroute nodad

For DHCP, some daemons may have difficulties to handle this odd configuration (with the /32 IP address on the interface), but dnsmasq accepts such an oddity. For IPv6, assuming the assigned IP address is a EUI-64 one, radvd works with the following configuration on each interface:

interface vnet5 {
  AdvSendAdvert on;
  prefix 2001:db8:cb00:7100::/64 {
    AdvOnLink on;
    AdvAutonomous on;
    AdvRouterAddr on;
  };
};

Conclusion and future work#

This setup should work with BIRD 1.6.3 and a Linux 3.15+ kernel. Compared to legacy L2 networks, it brings flexibility and resiliency while keeping guests unaware of the change. By handing over routing to Linux, this design is also cheap as existing equipment can be reused. Still, exploitation of such solution is simple enough once the basic concepts are understood—IP rules, routing tables, and BGP sessions.

There are several potential improvements:

Using VRF: Starting from Linux 4.3, L3 VRF domains enable binding interfaces to routing tables. We could have three VRFs: public, private, and local-out. This would improve performance by removing most IP rules (but until Linux 4.8, performance is crippled due to offloading features not enabled, see commit 7889681f4a6c). For more information, have a look at the kernel documentation.
Full L3 routing: The BGP setup can be enhanced and simplified by using an L3 fabric and using some autoconfiguration features. Cumulus published a nice book, “BGP in the datacenter,” on this topic. However, this requires all BGP speakers to support these features. On the hypervisors, this would mean using FRR while the various network equipment would need to run Cumulus Linux.
BGP resiliency with BGP LLGR: Using short BFD timers make our network react fast to any disruption by quickly invalidating faulty paths without relying on link status. However, under load or congestion, BFD packets may be lost, making the whole hypervisor unreachable until BGP sessions can be brought up again. Some BGP implementations support Long-Lived BGP Graceful Restart, an extension allowing stale routes to be retained with a lower priority (see draft-uttaro-idr-bgp-persistence-03). This is an ideal solution to our problem: these stale routes are used only as last resort, after all links have failed. ~~Currently, no open-source implementation supports this draft.~~ See “BGP LLGR: robust and reactive BGP sessions” for more details.

Update (2019-08)

For better security, it is possible to reuse features around the RPKI to validate announces from each hypervisor. See “Securing BGP on the host with origin validation.”

Update (2021-05)

Facebook published “Running BGP in Data Centers at Scale.” This paper contains some insights on how Facebook manages BGP in its datacenters. This is an orthogonal issue to extending BGP to the the server but it is an interesting read, notably the use of policies to help operability.

Update (2021-10)

We have published the configuration used at Blade in our San Francisco and Seoul data centers that are using BGP-to-the-hosts with an L3 fabric. This includes the configuration for Juniper top-of-the-rack switches and Cumulus top-of-the-rack switches. These feature a DHCP server for provisioning. The BGP configuration on the servers is done via an API served by nginx.

MC-LAG has been standardized in IEEE 802.1AX-2014. However, most vendors are likely to stick with their implementations. With MC-LAG, control planes remain independent. ↩︎
An incident can stem from an operator mistake, but also from software bugs, which are more likely to happen in complex implementations during infrequent operations, like a software upgrade. ↩︎
These routes should not be distributed over BGP. Hypervisors should receive a default route with a lower metric from edge routers. ↩︎
Do not increase this value too much. An external attacker can fill the cache easily. The value proposed here corresponds to 256 MiB of memory. See « IPv6 route lookup on Linux » for more details. ↩︎
This example assumes the guest does not use privacy extensions (RFC 3041). A workaround is to assign a whole /64 for each guest. This also removes the need to use NDP proxying. ↩︎
A static ARP entry can also be added with the remote MAC address. ↩︎
The onlink keyword is needed for the BGP daemon, BIRD, to accept the next-hop. Otherwise, it would consider there isn’t a directly connected (non-host) route covering it. ↩︎
For example, a Juniper QFX5100 supports about 200k IPv4 routes (about US$10,000, with Broadcom Trident II chipset). On the other hand, an Arista 7208SR supports 1.2M IPv4 routes (about US$20,000, with Broadcom Jericho chipset), through the use of an external TCAM. A Juniper MX240 would support more than 2M IPv4 routes (about US$30,000 for an empty chassis with two routing engines, with Juniper Trio chipset) with a lower density. ↩︎
From a scalability point of view, with switches able to handle 32k MAC addresses, the fabric can host more than 8,000 hypervisors (more than 5 million virtual guests). Therefore, cost-effective switches can be used as both leaves and spines. Each hypervisor has to handle all routes, an easy task for Linux. ↩︎
Using routing instances would enable hosting several route reflectors on the same box. This is not used in this example but should be considered to reduce costs. ↩︎
This prevents the use of any authentication mechanism: BGP usually relies on TCP MD5 signature (RFC 2385) to authenticate BGP sessions. On most OS, this requires to know allowed peers. To tighten a bit the security in absence of authentication, we use the Generalized TTL Security Mechanism (RFC 5082). For Junos, the configuration presented here (with ttl 255) is incomplete. A firewall filter is also needed. ↩︎
On the QFX5000 Series switches, the minimum interval is officially one second. This is quite unfortunate as it makes BFD mostly useless. ↩︎
Unfortunately, for no good reason, Junos doesn’t support the BGP add-path extension in a routing instance. Such a configuration is possible with Cumulus Linux. ↩︎
Only BIRD comes with BFD support out of the box but it does not support implicit peers. FRR needs Cumulus’s PTMD. If you don’t care about BFD, GoBGP is really nice as a route reflector. ↩︎
Kernel tables are numbered. ip can use names declared in /etc/iproute2/rt_tables. ↩︎
It should be noted that ECMP with IPv6 only works from BIRD 1.6.1. Moreover, when using Linux 4.11 or more recent, you need to apply commit 98bb80a243b5. ↩︎
For a given interface, Linux uses the maximum value between the sysctl for all and the one for the interface. ↩︎
It is possible to prevent Linux to install a connected route by using the noprefixroute flag. However, this flag is only available since Linux 4.4 for IPv4. Only use this flag if your DHCP server is giving you a hard time as it may trigger other issues (related to the promotion of secondary addresses). ↩︎