Fast and dynamic encoding of Protocol Buffers in Go
Vincent Bernat
Protocol Buffers are a popular choice for serializing structured data due to their compact size, fast processing speed, language independence, and compatibility. Other alternatives exist: Cap’n Proto, CBOR, and Avro.
Usually, data structures are described in a proto definition file
(.proto
). The protoc
compiler and a language-specific plugin convert it into
code:
$ head flow-4.proto syntax = "proto3"; package decoder; option go_package = "akvorado/inlet/flow/decoder"; message FlowMessagev4 { uint64 TimeReceived = 2; uint32 SequenceNum = 3; uint64 SamplingRate = 4; uint32 FlowDirection = 5; $ protoc -I=. --plugin=protoc-gen-go --go_out=module=akvorado:. flow-4.proto $ head inlet/flow/decoder/flow-4.pb.go // Code generated by protoc-gen-go. DO NOT EDIT. // versions: // protoc-gen-go v1.28.0 // protoc v3.21.12 // source: inlet/flow/data/schemas/flow-4.proto package decoder import ( protoreflect "google.golang.org/protobuf/reflect/protoreflect"
Akvorado collects network flows using IPFIX or sFlow, decodes them with GoFlow2, encodes them to Protocol Buffers, and sends them to Kafka to be stored in a ClickHouse database. Collecting a new field, such as source and destination MAC addresses, requires modifications in multiple places, including the proto definition file and the ClickHouse migration code. Moreover, the cost is paid by all users.1 It would be nice to have an application-wide schema and let users enable or disable the fields they need.
While the main goal is flexibility, we do not want to sacrifice performance. On this front, this is quite a success: when upgrading from 1.6.4 to 1.7.1, the decoding and encoding performance almost doubled! 🤗
goos: linux goarch: amd64 pkg: akvorado/inlet/flow cpu: AMD Ryzen 5 5600X 6-Core Processor │ initial.txt │ final.txt │ │ sec/op │ sec/op vs base │ Netflow/with_encoding-12 12.963µ ± 2% 7.836µ ± 1% -39.55% (p=0.000 n=10) Sflow/with_encoding-12 19.37µ ± 1% 10.15µ ± 2% -47.63% (p=0.000 n=10)
Faster Protocol Buffers encoding#
I use the following code to benchmark both the decoding and
encoding process. Initially, the Decode()
method is a thin layer above
GoFlow2 producer and stores the decoded data into the in-memory structure
generated by protoc
. Later, some of the data will be encoded directly during
flow decoding. This is why we measure both the decoding and the
encoding.2
func BenchmarkDecodeEncodeSflow(b *testing.B) { r := reporter.NewMock(b) sdecoder := sflow.New(r) data := helpers.ReadPcapPayload(b, filepath.Join("decoder", "sflow", "testdata", "data-1140.pcap")) for _, withEncoding := range []bool{true, false} { title := map[bool]string{ true: "with encoding", false: "without encoding", }[withEncoding] var got []*decoder.FlowMessage b.Run(title, func(b *testing.B) { for i := 0; i < b.N; i++ { got = sdecoder.Decode(decoder.RawFlow{ Payload: data, Source: net.ParseIP("127.0.0.1"), }) if withEncoding { for _, flow := range got { buf := []byte{} buf = protowire.AppendVarint(buf, uint64(proto.Size(flow))) proto.MarshalOptions{}.MarshalAppend(buf, flow) } } } }) } }
The canonical Go implementation for Protocol Buffers,
google.golang.org/protobuf
is not the most
efficient one. For a long time, people were relying on gogoprotobuf.
However, the project is now deprecated. A good replacement is
vtprotobuf.3
goos: linux goarch: amd64 pkg: akvorado/inlet/flow cpu: AMD Ryzen 5 5600X 6-Core Processor │ initial.txt │ bench-2.txt │ │ sec/op │ sec/op vs base │ Netflow/with_encoding-12 12.96µ ± 2% 10.28µ ± 2% -20.67% (p=0.000 n=10) Netflow/without_encoding-12 8.935µ ± 2% 8.975µ ± 2% ~ (p=0.143 n=10) Sflow/with_encoding-12 19.37µ ± 1% 16.67µ ± 2% -13.93% (p=0.000 n=10) Sflow/without_encoding-12 14.62µ ± 3% 14.87µ ± 1% +1.66% (p=0.007 n=10)
Dynamic Protocol Buffers encoding#
We have our baseline. Let’s see how to encode our Protocol Buffers without a
.proto
file. The wire format is simple and rely a lot on variable-width
integers.
Variable-width integers, or varints, are an efficient way of encoding unsigned integers using a variable number of bytes, from one to ten, with small values using fewer bytes. They work by splitting integers into 7-bit payloads and using the 8th bit as a continuation indicator, set to 1 for all payloads except the last.
For our usage, we only need two types: variable-width integers and byte sequences. A byte sequence is encoded by prefixing it by its length as a varint. When a message is encoded, each key-value pair is turned into a record consisting of a field number, a wire type, and a payload. The field number and the wire type are encoded as a single variable-width integer called a tag.
We use the following low-level functions to build the output buffer:
protowire.AppendTag()
encodes a tag,protowire.AppendVarint()
encodes a variable-width integer, andprotowire.AppendBytes()
append bytes as is.
Our schema abstraction contains the appropriate information to encode a message
(ProtobufIndex
) and to generate a proto definition file (fields starting with
Protobuf
):
type Column struct { Key ColumnKey Name string Disabled bool // […] // For protobuf. ProtobufIndex protowire.Number ProtobufType protoreflect.Kind // Uint64Kind, Uint32Kind, … ProtobufEnum map[int]string ProtobufEnumName string ProtobufRepeated bool }
We have a few helper methods around the protowire
functions to directly
encode the fields while decoding the flows. They skip disabled fields or
non-repeated fields already encoded. Here is an excerpt of the sFlow
decoder:
sch.ProtobufAppendVarint(bf, schema.ColumnBytes, uint64(recordData.Base.Length)) sch.ProtobufAppendVarint(bf, schema.ColumnProto, uint64(recordData.Base.Protocol)) sch.ProtobufAppendVarint(bf, schema.ColumnSrcPort, uint64(recordData.Base.SrcPort)) sch.ProtobufAppendVarint(bf, schema.ColumnDstPort, uint64(recordData.Base.DstPort)) sch.ProtobufAppendVarint(bf, schema.ColumnEType, helpers.ETypeIPv4)
For fields that are required later in the pipeline, like source and destination addresses, they are stored unencoded in a separate structure:
type FlowMessage struct { TimeReceived uint64 SamplingRate uint32 // For exporter classifier ExporterAddress netip.Addr // For interface classifier InIf uint32 OutIf uint32 // For geolocation or BMP SrcAddr netip.Addr DstAddr netip.Addr NextHop netip.Addr // Core component may override them SrcAS uint32 DstAS uint32 GotASPath bool // protobuf is the protobuf representation for the information not contained above. protobuf []byte protobufSet bitset.BitSet }
The protobuf
slice holds encoded data. It is initialized with a capacity of
500 bytes to avoid resizing during encoding. There is also some reserved room at
the beginning to be able to encode the total size as a variable-width integer.
Upon finalizing encoding, the remaining fields are added and the message length
is prefixed:
func (schema *Schema) ProtobufMarshal(bf *FlowMessage) []byte { schema.ProtobufAppendVarint(bf, ColumnTimeReceived, bf.TimeReceived) schema.ProtobufAppendVarint(bf, ColumnSamplingRate, uint64(bf.SamplingRate)) schema.ProtobufAppendIP(bf, ColumnExporterAddress, bf.ExporterAddress) schema.ProtobufAppendVarint(bf, ColumnSrcAS, uint64(bf.SrcAS)) schema.ProtobufAppendVarint(bf, ColumnDstAS, uint64(bf.DstAS)) schema.ProtobufAppendIP(bf, ColumnSrcAddr, bf.SrcAddr) schema.ProtobufAppendIP(bf, ColumnDstAddr, bf.DstAddr) // Add length and move it as a prefix end := len(bf.protobuf) payloadLen := end - maxSizeVarint bf.protobuf = protowire.AppendVarint(bf.protobuf, uint64(payloadLen)) sizeLen := len(bf.protobuf) - end result := bf.protobuf[maxSizeVarint-sizeLen : end] copy(result, bf.protobuf[end:end+sizeLen]) return result }
Minimizing allocations is critical for maintaining encoding performance. The
benchmark tests should be run with the -benchmem
flag to monitor allocation
numbers. Each allocation incurs an additional cost to the garbage collector. The
Go profiler is a valuable tool for identifying areas of code that can be
optimized:
$ go test -run=__nothing__ -bench=Netflow/with_encoding \ > -benchmem -cpuprofile profile.out \ > akvorado/inlet/flow goos: linux goarch: amd64 pkg: akvorado/inlet/flow cpu: AMD Ryzen 5 5600X 6-Core Processor Netflow/with_encoding-12 143953 7955 ns/op 8256 B/op 134 allocs/op PASS ok akvorado/inlet/flow 1.418s $ go tool pprof profile.out File: flow.test Type: cpu Time: Feb 4, 2023 at 8:12pm (CET) Duration: 1.41s, Total samples = 2.08s (147.96%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) web
After using the internal schema instead of code generated from the proto definition file, the performance improved. However, this comparison is not entirely fair as less information is being decoded and previously GoFlow2 was decoding to its own structure, which was then copied to our own version.
goos: linux goarch: amd64 pkg: akvorado/inlet/flow cpu: AMD Ryzen 5 5600X 6-Core Processor │ bench-2.txt │ bench-3.txt │ │ sec/op │ sec/op vs base │ Netflow/with_encoding-12 10.284µ ± 2% 7.758µ ± 3% -24.56% (p=0.000 n=10) Netflow/without_encoding-12 8.975µ ± 2% 7.304µ ± 2% -18.61% (p=0.000 n=10) Sflow/with_encoding-12 16.67µ ± 2% 14.26µ ± 1% -14.50% (p=0.000 n=10) Sflow/without_encoding-12 14.87µ ± 1% 13.56µ ± 2% -8.80% (p=0.000 n=10)
As for testing, we use github.com/jhump/protoreflect
: the
protoparse
package parses the proto definition file we generate and the
dynamic
package decodes the messages. Check the ProtobufDecode()
method for more details.4
To get the final figures, I have also optimized the decoding in GoFlow2. It
was relying heavily on binary.Read()
. This function may use
reflection in certain cases and each call allocates a byte array to read data.
Replacing it with a more efficient version provides the following
improvement:
goos: linux goarch: amd64 pkg: akvorado/inlet/flow cpu: AMD Ryzen 5 5600X 6-Core Processor │ bench-3.txt │ bench-4.txt │ │ sec/op │ sec/op vs base │ Netflow/with_encoding-12 7.758µ ± 3% 7.365µ ± 2% -5.07% (p=0.000 n=10) Netflow/without_encoding-12 7.304µ ± 2% 6.931µ ± 3% -5.11% (p=0.000 n=10) Sflow/with_encoding-12 14.256µ ± 1% 9.834µ ± 2% -31.02% (p=0.000 n=10) Sflow/without_encoding-12 13.559µ ± 2% 9.353µ ± 2% -31.02% (p=0.000 n=10)
It is now easier to collect new data and the inlet component is faster! 🚅
Notice
Some paragraphs were editorialized by ChatGPT, using “editorialize and keep it short” as a prompt. The result was proofread by a human for correctness. The main idea is that ChatGPT should be better at English than me.
-
While empty fields are not serialized to Protocol Buffers, empty columns in ClickHouse take some space, even if they compress well. Moreover, unused fields are still decoded and they may clutter the interface. ↩︎
-
There is a similar function using NetFlow. NetFlow and IPFIX protocols are less complex to decode than sFlow as they are using a simpler TLV structure. ↩︎
-
vtprotobuf
generates more optimized Go code by removing an abstraction layer. It directly generates the code encoding each field to bytes:↩︎if m.OutIfSpeed != 0 { i = encodeVarint(dAtA, i, uint64(m.OutIfSpeed)) i-- dAtA[i] = 0x6 i-- dAtA[i] = 0xd8 }
-
There is also a
protoprint
package to generate proto definition file. I did not use it. ↩︎