Integration of a Go service with systemd: readiness & liveness

Vincent Bernat February 9, 2017

Unlike other programming languages, Go’s runtime doesn’t provide a way to reliably daemonize a service. A system daemon has to supply this functionality. Most distributions ship systemd which would fit the bill. A correct integration with systemd is quite straightforward. There are two interesting aspects: readiness & liveness.

As an example, we will daemonize this service whose goal is to answer requests with nifty 404 errors:

package main

import (
    "log"
    "net"
    "net/http"
)

func main() {
    l, err := net.Listen("tcp", ":8081")
    if err != nil {
        log.Panicf("cannot listen: %s", err)
    }
    http.Serve(l, nil)
}

You can build it with go build 404.go.

Here is the service file, 404.service:¹

[Unit]
Description=404 micro-service

[Service]
Type=notify
ExecStart=/usr/bin/404
WatchdogSec=30s
Restart=on-failure

[Install]
WantedBy=multi-user.target

Readiness#

The classic way for a Unix daemon to signal its readiness is to daemonize. Technically, this is done by calling fork(2) twice (which also serves other intents). This is a very common task and the BSD systems, as well as some other C libraries, supply a daemon(3) function for this purpose. Services are expected to daemonize only when they are ready (after reading configuration files and setting up a listening socket, for example). Then, a system can reliably initialize its services with a simple linear script:

syslogd
unbound
ntpd -s

Each daemon can rely on the previous one being ready to do its work. The sequence of actions is the following:

syslogd reads its configuration, activates /dev/log, daemonizes.
unbound reads its configuration, listens on 127.0.0.1:53, daemonizes.
ntpd reads its configuration, connects to NTP peers, waits for clock to be synchronized,² daemonizes.

With systemd, we would use Type=fork in the service file. However, Go’s runtime does not support that. Instead, we use Type=notify. In this case, systemd expects the daemon to signal its readiness with a message to a Unix socket. go-systemd package handles the details for us:

package main

import (
    "log"
    "net"
    "net/http"

    "github.com/coreos/go-systemd/daemon"
)

func main() {
    l, err := net.Listen("tcp", ":8081")
    if err != nil {
        log.Panicf("cannot listen: %s", err)
    }
    daemon.SdNotify(false, daemon.SdNotifyReady) // ❶
    http.Serve(l, nil)                           // ❷
}

It’s important to place the notification after net.Listen() (in ❶): if the notification was sent earlier, a client would get “connection refused” when trying to use the service. When a daemon listens to a socket, connections are queued by the kernel until the daemon is able to accept them (in ❷).

If the service is not run through systemd, the added line is a no-op.

Liveness#

Another interesting feature of systemd is to watch the service and restart it if it happens to crash (thanks to the Restart=on-failure directive). It’s also possible to use a watchdog: the service sends watchdog keep-alives at regular interval. If it fails to do so, systemd will restart it.

We could insert the following code just before http.Serve() call:

go func() {
    interval, err := daemon.SdWatchdogEnabled(false)
    if err != nil || interval == 0 {
        return
    }
    for {
        daemon.SdNotify(false, daemon.SdNotifyWatchdog)
        time.Sleep(interval / 3)
    }
}()

However, this doesn’t add much value: the goroutine is unrelated to the core business of the service. If for some reason, the HTTP part gets stuck, the goroutine will happily continue to send keep-alives to systemd.

In our example, we can just do a HTTP query before sending the keep-alive. The internal loop can be replaced with this code:

for {
    _, err := http.Get("http://127.0.0.1:8081") // ❸
    if err == nil {
        daemon.SdNotify(false, daemon.SdNotifyWatchdog)
    }
    time.Sleep(interval / 3)
}

In ❸, we connect to the service to check if it’s still working. If we get some kind of answer, we send a watchdog keep-alive. If the service is unavailable or if http.Get() gets stuck, systemd will trigger a restart.

There is no universal recipe. However, checks can be split into two groups:

Before sending a keep-alive, you execute an active check on the components of your service. The keep-alive is sent only if all checks are successful. The checks can be internal (like in the above example) or external (for example, check with a query to the database).
Each component reports its status, telling if it’s alive or not. Before sending a keep-alive, you check the reported status of all components (passive check). If some components are late or reported fatal errors, don’t send the keep-alive.

If possible, recovery from errors (for example, with a backoff retry) and self-healing (for example, by reestablishing a network connection) is always better, but the watchdog is a good tool to handle the worst cases and avoid too complex recovery logic.

For example, if a component doesn’t know how to recover from an exceptional condition,³ instead of using panic(), it could signal its situation before dying. Another dedicated component could try to resolve the situation by restarting the faulty component. If it fails to reach a healthy state in time, the watchdog timer will trigger and the whole service will be restarted.

Update (2018-03)

Have a look at “Integration of a Go service with systemd: socket activation” for a followup of this article.

Depending on the distribution, this should be installed in /lib/systemd/system or /usr/lib/systemd/system. Check with the output of the command pkg-config systemd --variable=systemdsystemunitdir. ↩︎
This highly depends on the NTP daemon used. OpenNTPD doesn’t wait unless you use the -s option. ISC NTP doesn’t either unless you use the --wait-sync option. ↩︎
An example of an exceptional condition is to reach the limit on the number of file descriptors. Self-healing from this situation is difficult and it’s easy to get stuck in a loop. ↩︎