BGP LLGR: robust and reactive BGP sessions
Vincent Bernat
On a BGP-routed network with multiple redundant paths, we seek to achieve two goals concerning reliability:
-
A failure on a path should quickly bring down the related BGP sessions. A common expectation is to recover in less than a second by diverting the traffic to the remaining paths.
-
As long as a path is operational, the related BGP sessions should stay up, even under duress.
Detecting failures fast: BFD#
To quickly detect a failure, BGP can be associated with BFD, a protocol to detect faults in bidirectional paths,1 defined in RFC 5880 and RFC 5882. BFD can use very low timers, like 100 ms.
However, when BFD runs in a process on top of a generic kernel,2 notably when running BGP on the host, it is not unexpected to loose a few BFD packets on adverse conditions: the daemon handling the BFD sessions may not get enough CPU to answer in a timely manner. In this scenario, it is not unlikely for all the BGP sessions to go down at the same time, creating an outage, as depicted in the last case in the diagram below.
So far, we have two contradicting roads:
- lower the BFD timers to quickly detect a failure along the path; or
- raise the BFD timers to ensure BGP sessions remain operational.
Fixing false positives: BGP LLGR#
Long-lived BGP Graceful Restart is a new BGP capability to retain stale routes for a longer period after a session failure but treating them as least-preferred. It also defines a well-known community to share this information with other routers. It is defined inn RFC 9494 and several implementations already exist:
- Juniper Junos (since 15.1, see the documentation);
- Cisco IOS XR (tested on 7.3.2);
- BIRD (since 1.6.5 and 2.0.3);
- GoBGP (since 1.33); and
- FRR (since 8.2).
The following illustration shows what happens during two failure scenarios. Like without LLGR, in ❷, a link failure is detected by BFD and the failed path is removed from the route as two other paths remain with a higher preference. A couple of minutes later, the faulty path has its stale timer expired and will not be used anymore. Shortly after, in ❸, the bottom router experiences high CPU usage, preventing BFD packets to be processed timely. The BGP sessions are closed and the remaining paths become stale but as there is no better path left, they are still used until the LLGR timer expires. In the meantime, we expect the BGP sessions to resume.
From the point of view of the top router, the first failed path was
considered as stale because the BGP session with R1 was down.
However, during the second failure, the two remaining paths were
considered as stale because they were tagged with the well-known
community LLGR_STALE
(65535:6) by R2 and R3.
Another interesting point of BGP LLGR is the ability to restart the BGP daemon without any impact—as long as all paths keep a steady state shortly before and during restart. This is quite interesting when running BGP on the host.3
BIRD#
Let’s see how to configure BIRD 1.6. As BGP LLGR is built on top of the regular BGP graceful restart (BGP GR) capability, we need to enable both. The timer for BGP LLGR starts after the timer for BGP GR. During a regular graceful restart, routes are kept with the same preference. Therefore it is important to set this timer to 0.
template bgp BGP_LLGR { bfd graceful; graceful restart yes; graceful restart time 0; long lived graceful restart yes; long lived stale time 120; }
When a problem appears on the path, the BGP session goes down and the LLGR timer starts:
$ birdc show protocol R1_1 all name proto table state since info R1_1 BGP master start 11:20:17 Connect Preference: 100 Input filter: ACCEPT Output filter: ACCEPT Routes: 1 imported, 0 exported, 0 preferred Route change stats: received rejected filtered ignored accepted Import updates: 2 0 0 0 4 Import withdraws: 0 0 --- 0 0 Export updates: 12 10 0 --- 2 Export withdraws: 1 --- --- --- 0 BGP state: Connect Neighbor address: 2001:db8:104::1 Neighbor AS: 65000 Neighbor graceful restart active LL stale timer: 112/-
The related paths are marked as stale (as reported by the s
in
100s
) and tagged with the well-known community LLGR_STALE
:
$ birdc show route 2001:db8:10::1/128 all 2001:db8:10::1/128 via 2001:db8:204::1 on eth0.204 [R1_2 10:35:01] * (100) [i] Type: BGP unicast univ BGP.origin: IGP BGP.as_path: BGP.next_hop: 2001:db8:204::1 fe80::5254:3300:cc00:5 BGP.local_pref: 100 via 2001:db8:104::1 on eth0.104 [R1_1 11:22:51] (100s) [i] Type: BGP unicast univ BGP.origin: IGP BGP.as_path: BGP.next_hop: 2001:db8:104::1 fe80::5254:3300:6800:5 BGP.local_pref: 100 BGP.community: (65535,6)
We are left with only one path for the route in the kernel:
$ ip route show 2001:db8:10::1 2001:db8:10::1 via 2001:db8:204::1 dev eth0.204 proto bird metric 1024 pref medium
To upgrade BIRD without impact, it needs to run with the -R
flag
and the graceful restart yes
directive should be present in the
kernel
protocols. Then, before upgrade, stop it using SIGKILL
instead of SIGTERM
to avoid a clean close of the BGP sessions.
Juniper Junos#
With Junos, we only have to enable BGP LLGR for each family—assuming BFD is already configured:
# Enable BGP LLGR edit protocols bgp group peers family inet6 unicast set graceful-restart long-lived restarter stale-time 2m
Once a path is failing, the associated BGP session goes down and the BGP LLGR timer starts:
> show bgp neighbor 2001:db8:104::4 Peer: 2001:db8:104::4+179 AS 65000 Local: 2001:db8:104::1+57667 AS 65000 Group: peers Routing-Instance: master Forwarding routing-instance: master Type: Internal State: Connect Flags: <> Last State: Active Last Event: ConnectRetry Last Error: None Export: [ LOOPBACK NOTHING ] Options: <Preference HoldTime Ttl AddressFamily Multipath Refresh> Options: <BfdEnabled LLGR> Address families configured: inet6-unicast Holdtime: 6 Preference: 170 NLRI inet6-unicast: Number of flaps: 2 Last flap event: Restart Time until long-lived stale routes deleted: inet6-unicast 00:01:05 Table inet6.0 Bit: 20000 RIB State: BGP restart is complete Send state: not advertising Active prefixes: 0 Received prefixes: 1 Accepted prefixes: 1 Suppressed due to damping: 0 LLGR-stale prefixes: 1
The associated path is marked as stale and is therefore inactive as there are better paths available:
> show route 2001:db8:10::4 extensive […] BGP Preference: 170/-101 Source: 2001:db8:104::4 Next hop type: Router, Next hop index: 778 Next hop: 2001:db8:104::4 via em1.104, selected Protocol next hop: 2001:db8:104::4 Indirect next hop: 0xb1d27c0 1048578 INH Session ID: 0x15c State: <Int Ext> Inactive reason: LLGR stale Local AS: 65000 Peer AS: 65000 Age: 4 Metric2: 0 Communities: llgr-stale Accepted LongLivedStale Localpref: 100 Router ID: 1.0.0.4 […]
Have a look at the Git repository for the complete configurations as well as the expected outputs during normal operations. There is also a variant with the configurations of BIRD and Junos when acting as a BGP route reflector. Now that FRR got BFD support, I hope it will get LLGR support as well.
Update (2022-01)
Donatas Abraitis recently added support for LLGR helper mode in FRR! It will be present in version 8.2.
Update (2024-01)
RFC 9494 has been published in November 2023!
-
With point-to-point links, BGP can immediately detect a failure without BFD. However, with a pair of fibers, the failure may be undirectional, leaving it undetected by the other end until the expiration of the hold timer. ↩︎
-
On a Juniper MX, BFD is usually handled directly by the real-time microkernel running on the packet forwarding engine. The BFD control packet contains a bit indicating if BFD is implemented by the forwarding plane or by the control plane. Therefore, you can check with
tcpdump
how a router implements BFD. Here is an example where10.71.7.1
, a Linux host running BIRD, implements BFD in the control plane, while10.71.0.3
, a Juniper MX, does not:$ sudo tcpdump -pni vlan181 port 3784 IP 10.71.7.1 > 10.71.0.3: BFDv1, Control, State Up, Flags: [none] IP 10.71.0.3 > 10.71.7.1: BFDv1, Control, State Up, Flags: [Control Plane Independent]
On Juniper, you can also check at which level BFD is handled using the
show ppm adjacencies protocol bfd details
command. If the BFD session is marked as not distributed, it is handled by the routing engine CPU. Otherwise, you need to use theshow ppm adjacencies
command on the corresponding FPC. If the BFD session is “inline,” the session is handled at the lowest level by the NPU. Otherwise, it is handled by the linecard CPU (running a realtime Linux kernel). ↩︎ -
Such a feature is the selling point of BGP graceful restart. However, without LLGR, non-functional paths are kept with the same preference and are not removed from ECMP routes. ↩︎