Building a SQL-like language to filter flows
Vincent Bernat
Akvorado collects network flows using IPFIX or sFlow. It stores them in a ClickHouse database. A web console allows a user to query the data and plot some graphs. A nice aspect of this console is how we can filter flows with a SQL-like language:
Often, web interfaces expose a query builder to build such filters. I think combining a SQL-like language with an editor supporting completion, syntax highlighting, and linting is a better approach.1
The language parser is built with pigeon (Go) from a parsing expression grammar—or PEG. The editor component is CodeMirror (TypeScript).
Language parser#
PEG grammars are relatively recent2 and are an alternative to context-free grammars. They are easier to write and they can generate better error messages. Python switched from an LL(1)-based parser to a PEG-based parser in Python 3.9.
pigeon generates a parser for Go. A grammar is a set of rules. Each rule is
an identifier, with an optional user-friendly label for error messages, an
expression, and an action in Go to be executed on match. You can find the
complete grammar in parser.peg
. Here is a simplified rule:
ConditionIPExpr "condition on IP" ← column:("ExporterAddress"i { return "ExporterAddress", nil } / "SrcAddr"i { return "SrcAddr", nil } / "DstAddr"i { return "DstAddr", nil }) _ operator:("=" / "!=") _ ip:IP { return fmt.Sprintf("%s %s IPv6StringToNum(%s)", toString(column), toString(operator), quote(ip)), nil }
The rule identifier is ConditionIPExpr
. It case-insensitively matches
ExporterAddress
, SrcAddr
, or DstAddr
. For each of them, it returns the the
column name with the proper case. That’s what is stored in the column
variable. Then, it matches one of the possible operators. As there is no code
block, it stores the matched string directly in the operator
variable. Then,
it tries to match the IP
rule, which is defined elsewhere in the grammar. If
it succeeds, it stores the result of the match in the ip
variable and executes
the final action. The action turns the column, operator, and IP into a proper
expression for ClickHouse. For example, if we have ExporterAddress =
203.0.113.15
, we get ExporterAddress = IPv6StringToNum('203.0.113.15')
.
The IP
rule uses a rudimentary regular expression but checks if the matched
address is correct in the action block, thanks to netip.ParseAddr()
:
IP "IP address" ← [0-9A-Fa-f:.]+ { ip, err := netip.ParseAddr(string(c.text)) if err != nil { return "", errors.New("expecting an IP address") } return ip.String(), nil }
Our parser safely turns the filter into a WHERE
clause accepted by
ClickHouse:3
WHERE InIfBoundary = 'external' AND ExporterRegion = 'france' AND InIfConnectivity = 'transit' AND SrcAS = 15169 AND DstAddr BETWEEN toIPv6('2a01:e0f:ffff::') AND toIPv6('2a01:e0f:ffff:ffff:ffff:ffff:ffff:ffff')
Integration in CodeMirror#
CodeMirror is a versatile code editor that can be easily integrated into
JavaScript projects. In Akvorado, the Vue.js component,
InputFilter
, uses CodeMirror as its foundation and
leverages features such as syntax highlighting, linting, and completion. The
source code for these capabilities can be found in the
codemirror/lang-filter/
directory.
Syntax highlighting#
The PEG grammar for Go cannot be utilized directly4 and the requirements for parsers for editors are distinct: they should be error-tolerant and operate incrementally, as code is typically updated character by character. CodeMirror offers a solution through its own parser generator, Lezer.
We don’t need this additional parser to fully understand the filter language. Only the basic structure is needed: column names, comparison and logic operators, quoted and unquoted values. The grammar is therefore quite short and does not need to be updated often:
@top Filter { expression } expression { Not expression | "(" expression ")" | "(" expression ")" And expression | "(" expression ")" Or expression | comparisonExpression And expression | comparisonExpression Or expression | comparisonExpression } comparisonExpression { Column Operator Value } Value { String | Literal | ValueLParen ListOfValues ValueRParen } ListOfValues { ListOfValues ValueComma (String | Literal) | String | Literal } // […] @tokens { // […] Column { std.asciiLetter (std.asciiLetter|std.digit)* } Operator { $[a-zA-Z!=><]+ } String { '"' (![\\\n"] | "\\" _)* '"'? | "'" (![\\\n'] | "\\" _)* "'"? } Literal { (std.digit | std.asciiLetter | $[.:/])+ } // […] }
The expression SrcAS = 12322 AND (DstAS = 1299 OR SrcAS = 29447)
is parsed to:
Filter(Column, Operator, Value(Literal), And, Column, Operator, Value(Literal), Or, Column, Operator, Value(Literal))
The last step is to teach CodeMirror how to map each token to a highlighting tag:
export const FilterLanguage = LRLanguage.define({ parser: parser.configure({ props: [ styleTags({ Column: t.propertyName, String: t.string, Literal: t.literal, LineComment: t.lineComment, BlockComment: t.blockComment, Or: t.logicOperator, And: t.logicOperator, Not: t.logicOperator, Operator: t.compareOperator, "( )": t.paren, }), ], }), });
Linting#
We offload linting to the original parser in Go. The
/api/v0/console/filter/validate
endpoint accepts a filter and returns a JSON
structure with the errors that were found:
{ "message": "at line 1, position 12: string literal not terminated", "errors": [{ "line": 1, "column": 12, "offset": 11, "message": "string literal not terminated", }] }
The linter source for CodeMirror queries the API and turns each error into a diagnostic.
Completion#
The completion system takes a hybrid approach. It splits the work between the frontend and the backend to offer useful suggestions for completing filters.
The frontend uses the parser built with Lezer to determine the context of
the completion: do we complete a column name, an operator, or a value? It also
extracts the column name if we are completing something else. It forwards the
result to the backend through the /api/v0/console/filter/complete
endpoint.
Walking the syntax tree was not as easy as I thought, but unit tests helped
a lot.
The backend uses the parser generated by pigeon to complete a column name or a comparison operator. For values, the completions are either static or extracted from the ClickHouse database. A user can complete an AS number from an organization name thanks to the following snippet:
results := []struct { Label string `ch:"label"` Detail string `ch:"detail"` }{} columnName := "DstAS" sqlQuery := fmt.Sprintf(` SELECT concat('AS', toString(%s)) AS label, dictGet('asns', 'name', %s) AS detail FROM flows WHERE TimeReceived > date_sub(minute, 1, now()) AND detail != '' AND positionCaseInsensitive(detail, $1) >= 1 GROUP BY label, detail ORDER BY COUNT(*) DESC LIMIT 20 `, columnName, columnName) if err := conn.Select(ctx, &results, sqlQuery, input.Prefix); err != nil { c.r.Err(err).Msg("unable to query database") break } for _, result := range results { completions = append(completions, filterCompletion{ Label: result.Label, Detail: result.Detail, Quoted: false, }) }
In my opinion, the completion system is a major factor in making the field editor an efficient way to select flows. While a query builder may have been more beginner-friendly, the completion system’s ease of use and functionality make it more enjoyable to use once you become familiar.
-
Moreover, building a query builder did not seem like a fun task for me. ↩︎
-
They were introduced in 2004 in “Parsing Expression Grammars: A Recognition-Based Syntactic Foundation.” LR parsers were introduced in 1965, LALR parsers in 1969, and LL parsers in the 1970s. Yacc, a popular parser generator, was written in 1975. ↩︎
-
The parser returns a string. It does not generate an intermediate AST. This makes it simpler and it currently fits our needs. ↩︎
-
It could be manually translated to JavaScript with Peggy. ↩︎