Integration of a Go service with systemd: readiness & liveness
Vincent Bernat
Unlike other programming languages, Go’s runtime doesn’t provide a way to reliably daemonize a service. A system daemon has to supply this functionality. Most distributions ship systemd which would fit the bill. A correct integration with systemd is quite straightforward. There are two interesting aspects: readiness & liveness.
As an example, we will daemonize this service whose goal is to answer requests with nifty 404 errors:
package main import ( "log" "net" "net/http" ) func main() { l, err := net.Listen("tcp", ":8081") if err != nil { log.Panicf("cannot listen: %s", err) } http.Serve(l, nil) }
You can build it with go build 404.go
.
Here is the service file, 404.service
:1
[Unit] Description=404 micro-service [Service] Type=notify ExecStart=/usr/bin/404 WatchdogSec=30s Restart=on-failure [Install] WantedBy=multi-user.target
Readiness#
The classic way for a Unix daemon to signal its readiness is to daemonize. Technically, this is done by calling fork(2) twice (which also serves other intents). This is a very common task and the BSD systems, as well as some other C libraries, supply a daemon(3) function for this purpose. Services are expected to daemonize only when they are ready (after reading configuration files and setting up a listening socket, for example). Then, a system can reliably initialize its services with a simple linear script:
syslogd
unbound
ntpd -s
Each daemon can rely on the previous one being ready to do its work. The sequence of actions is the following:
syslogd
reads its configuration, activates/dev/log
, daemonizes.unbound
reads its configuration, listens on127.0.0.1:53
, daemonizes.ntpd
reads its configuration, connects to NTP peers, waits for clock to be synchronized,2 daemonizes.
With systemd, we would use Type=fork
in the service file. However,
Go’s runtime does not support that. Instead, we use Type=notify
. In
this case, systemd expects the daemon to signal its readiness with a
message to a Unix socket. go-systemd package handles the details
for us:
package main import ( "log" "net" "net/http" "github.com/coreos/go-systemd/daemon" ) func main() { l, err := net.Listen("tcp", ":8081") if err != nil { log.Panicf("cannot listen: %s", err) } daemon.SdNotify(false, daemon.SdNotifyReady) // ❶ http.Serve(l, nil) // ❷ }
It’s important to place the notification after net.Listen()
(in ❶):
if the notification was sent earlier, a client would get “connection
refused” when trying to use the service. When a daemon listens to a
socket, connections are queued by the kernel until the daemon is able
to accept them (in ❷).
If the service is not run through systemd, the added line is a no-op.
Liveness#
Another interesting feature of systemd is to watch the service and
restart it if it happens to crash (thanks to the Restart=on-failure
directive). It’s also possible to use a watchdog: the service sends
watchdog keep-alives at regular interval. If it fails to do so,
systemd will restart it.
We could insert the following code just before http.Serve()
call:
go func() { interval, err := daemon.SdWatchdogEnabled(false) if err != nil || interval == 0 { return } for { daemon.SdNotify(false, daemon.SdNotifyWatchdog) time.Sleep(interval / 3) } }()
However, this doesn’t add much value: the goroutine is unrelated to the core business of the service. If for some reason, the HTTP part gets stuck, the goroutine will happily continue to send keep-alives to systemd.
In our example, we can just do a HTTP query before sending the keep-alive. The internal loop can be replaced with this code:
for { _, err := http.Get("http://127.0.0.1:8081") // ❸ if err == nil { daemon.SdNotify(false, daemon.SdNotifyWatchdog) } time.Sleep(interval / 3) }
In ❸, we connect to the service to check if it’s still working. If we
get some kind of answer, we send a watchdog keep-alive. If the service
is unavailable or if http.Get()
gets stuck, systemd will trigger a
restart.
There is no universal recipe. However, checks can be split into two groups:
-
Before sending a keep-alive, you execute an active check on the components of your service. The keep-alive is sent only if all checks are successful. The checks can be internal (like in the above example) or external (for example, check with a query to the database).
-
Each component reports its status, telling if it’s alive or not. Before sending a keep-alive, you check the reported status of all components (passive check). If some components are late or reported fatal errors, don’t send the keep-alive.
If possible, recovery from errors (for example, with a backoff retry) and self-healing (for example, by reestablishing a network connection) is always better, but the watchdog is a good tool to handle the worst cases and avoid too complex recovery logic.
For example, if a component doesn’t know how to recover from an
exceptional condition,3 instead of using panic()
, it
could signal its situation before dying. Another dedicated component
could try to resolve the situation by restarting the faulty
component. If it fails to reach a healthy state in time, the watchdog
timer will trigger and the whole service will be restarted.
Update (2018-03)
Have a look at “Integration of a Go service with systemd: socket activation” for a followup of this article.
-
Depending on the distribution, this should be installed in
/lib/systemd/system
or/usr/lib/systemd/system
. Check with the output of the commandpkg-config systemd --variable=systemdsystemunitdir
. ↩︎ -
This highly depends on the NTP daemon used. OpenNTPD doesn’t wait unless you use the
-s
option. ISC NTP doesn’t either unless you use the--wait-sync
option. ↩︎ -
An example of an exceptional condition is to reach the limit on the number of file descriptors. Self-healing from this situation is difficult and it’s easy to get stuck in a loop. ↩︎