Asynchronicity & Net-SNMP AgentX protocol
Vincent Bernat
One of the best way to add SNMP support to an application is the use of AgentX protocol: an application can act as a subagent and handle requests from a master agent using AgentX protocol. The most widespread implementation for such a task is the one from Net-SNMP.
TL;DR
An application asynchronously leveraging Net-SNMP AgentX implementation has many opportunities to get stuck when the master agent becomes temporarily unavailable. A mostly efficient workaround is to dedicate a master agent for AgentX and delegate MIB handling to a subagent.
Pinging the master agent#
Net-SNMP uses an event-based model. It also provides some synchronous functions but they are built on top of the asynchronous ones. It is shipped with its own event loop. Therefore, Net-SNMP AgentX protocol implementation seems a fine choice to add SNMP support to an event-based program, for example in Keepalived, a VRRP daemon and a monitoring daemon for LVS clusters.
Unfortunately, an AgentX subagent is expected to check, at a regular
interval, if the master agent is alive . This is handled automatically
with Net-SNMP implementation as long as you call run_alarms()
function which will call agentx_check_session()
:
/* * check a session validity for connectivity to the master agent. If * not functioning, close and start attempts to reopen the session */ void agentx_check_session(unsigned int clientreg, void *clientarg) { /* […] */ DEBUGMSGTL(("agentx/subagent", "checking status of session %p\n", ss)); if (!agentx_send_ping(ss)) { snmp_log(LOG_WARNING, "AgentX master agent failed to respond to ping. Attempting to re-register.\n"); /* * master agent disappeared? Try and re-register. * close first, just to be sure . */ agentx_unregister_callbacks(ss); agentx_close_session(ss, AGENTX_CLOSE_TIMEOUT); /* […] */ if (main_session != NULL) { /* […] */ main_session = NULL; agentx_reopen_session(0, NULL); } else { snmp_close(main_session); main_session = NULL; } } else { DEBUGMSGTL(("agentx/subagent", "session %p responded to ping\n", ss)); } }
This piece of code will end up calling (directly or indirectly) the
following functions1 which are using
agentx_synch_response()
:
agentx_send_ping()
agentx_open_session()
agentx_reopen_session()
agentx_close_session()
agentx_register()
agentx_unregister()
agentx_register_index()
agentx_unregister_index()
agentx_add_agentcaps()
agentx_remove_agentcaps()
agentx_synch_response()
will synchronously wait for an answer
from the master agent. Your event-based program will just sit here
doing nothing while this happens. For example, keepalived may be
unable to send VRRP probes resulting in some serious havoc in your
cluster.
Functions prefixed by agentx_
are not part of Net-SNMP API and could
therefore be rewritten to do their work asynchronously. Unfortunately,
talk is cheap. Moreover, there exists
other deep paths that will block a subagent. For example, the
low-level connection to the master agent (using TCP or a Unix socket,
but not with UDP) will use a blocking connect()
call. It seems quite
difficult to fix this.
Here are some (partial) workarounds:
-
Call
agentx_check_session()
less often. The default interval is 15 seconds. Disabling it completely or using a very high value is not advised because this function is the only way to reestablish a connection to the master agent when it is, for example, restarted. Here is how to change the delay to 120 seconds:netsnmp_ds_set_int(NETSNMP_DS_APPLICATION_ID, NETSNMP_DS_AGENT_AGENTX_PING_INTERVAL, 120);
-
Don’t call
agentx_check_session()
if we know the master agent is alive. For example, if it just sent us a new request, no need to ping it. I have written a patch implementing such a workaround. The ABI is left untouched, therefore, you only need to recompile Net-SNMP libraries. -
Use a lower timeout and no retry. By default, the timeout for synchronous requests is 1 second and a failed request will be retried 5 times. Because several synchronous functions are used in a row, you may be trapped for as long as 30 seconds with these settings. Here is how2 to disable the retry mechanism and to set the timeout to a very low value:
static int snmp_setup_session_cb(int majorID, int minorID, void *serverarg, void *clientarg) { netsnmp_session *sess = serverarg; sess->timeout = ONE_SEC / 3; sess->retries = 0; return 0; } void some_init_function() { /* […] */ snmp_register_callback(SNMP_CALLBACK_LIBRARY, SNMP_CALLBACK_SESSION_INIT, snmp_setup_session_cb, NULL); /* […] */ }
Unresponsive master agent#
Why is the master agent unresponsive? snmpd
is also an event-based
program and, trust me, there are a lot of places where it will become
unresponsive. Here are two examples:
-
When using the
pass
or thepass_persist
directive, Net-SNMP will block until the delegated command outputs a complete line./* var_extensible_pass() in agent/mibgroup/ucd-snmp/pass.c */ /* * valid call. Exec and get output */ if ((fd = get_exec_output(passthru)) != -1) { file = fdopen(fd, "r"); if (fgets(buf, sizeof(buf), file) == NULL) { fclose(file); wait_on_exec(passthru); /* […] */ continue; } /* […] */ fclose(file); wait_on_exec(passthru); /* […] */ }
-
DISMAN-PING-MIB
implementation will just hang while pinging the remote host. If the host is down and you have requested 5 probes with a timeout of 1 second,snmpd
will be unresponsive for 5 seconds. A similar problem exists withDISMAN-TRACEROUTE-MIB
.
Minimal master agent#
More and more code is running in the main agent. You cannot expect it to not block. But we could add a minimal master agent to the mix: it will handle only a few MIB modules by itself while most MIB modules will be delagated to a complete agent running as a subagent. Your own subagent will connect to the minimal master agent.
It is quite easy to turn snmpd
into a subagent: just use -X
flag. We should also disable unwanted MIB modules for each agent with
-I
flag. For the minimal master agent, we only enable some
modules. All other modules3 will be served by snmpd
acting
as a subagent.
$ MODS="snmp_mib,sysORTable,usmConf,usmStats,usmUser,vacm_conf,vacm_context,vacm_vars" $ snmpd -Lsd -Lf /dev/null -u snmp -g snmp \ > -C -c /etc/snmp/snmpd.master.conf \ > -p /var/run/snmpd.master.pid \ > -I $MODS $ snmpd -Lsd -Lf /dev/null -u snmp -g snmp \ > -p /var/run/snmpd.pid -X \ > -I -$MODS
Access control configuration should be placed in
/etc/snmp/snmpd.master.conf
, as well as master agentx
directive. Other directives (like pass
, pass_persist
, load
,
sysname
, …) in /etc/snmp/snmpd.conf
.
The minimal master agent is very unlikely to block and while the
ground cause is still here, it is now very much mitigated. You may
extend this solution by splicing snmpd
into several subagents: one
handling safe modules, one handling pass
and pass_persist
stuff
and a last one for unsafe modules (like DISMAN-PING-MIB
).
-
agentx_reopen_session()
will callagentx_close_session()
andagentx_open_session()
.agentx_register()
and the following functions will be called by these two functions. ↩︎ -
NETSNMP_DS_AGENT_AGENTX_TIMEOUT
andNETSNMP_DS_AGENT_AGENTX_RETRIES
variables are only used by the master agent, not by a subagent. ↩︎ -
Use
snmpd -Dmib_init -H
to get the list of modules. You may want to addsmux
module to the master agent. ↩︎