generate logging messages when a Cribl worker node goes off-line & also when it recovers operation

aren · September 2023

I'd like to generate logging messages when a Cribl worker node goes off-line and also when it recovers operation. The CriblLogs internal data source generates an event with a "channel" value of "MetricsStore", a "message" value of "active messages" and a "numMetrics" value of 1 when a worker begins to recover operation, and other numMetrics values such as 212,193,303 and 304 when the worker is running. Those events appears to be valid sources to generate one or more log entries for worker recovery. Where are the numMetrics codes documented? I tried searching the Cribl documentation site and the web but didn't find anything about them. Finding an event generated when a worker goes off-line proved to be more difficult. There is a file on the leader node that records incidents in which a Cribl worker goes off-line. That file is: /opt/cribl/config-volume/state/kvstore/default/InactiveWorkerStore/state.json. I couldn't find any events similar to the entries in that file in the CriblLogs internal data source. Are there any events generated by Cribl internal data sources when a worker goes off-line? Thanks.

Brandon McCombs · September 2023

what is considered "goes offline"? Does that mean application is down? The OS is down? The API process is available but the worker processes aren't processing events or something else? The API log files show when the API shuts down. Each worker process has its own set of logs which do have a little different events logged when they stop/start. numMetrics is a field logged in the MetricsStore events which are logged by the API and each worker process. Those aren't different codes. Those are the number of metrics being collected at that point in time by the process logging the event. You can check the overall health of a node using the health endpoint if you want to do it that way. https://docs.cribl.io/stream/monitoring#endpoint

aren · September 2023

Thanks for explaining about numMetrics, In this context, "worker node goes offline" means that an API call to the leader node at http://<leader node>:api/v1/master/workers will not show that particular worker node. I like the idea of using the api/health call because it does not require authentication (master/workers does require authentication). Thanks for suggesting it. We're in a docker swarm with a reverse proxy server. I'll work with the team to route the API call to the individual workers in that environment.

aren · September 2023

Another approach would be to create a user that can execute a limited number of API calls and do nothing else. Such a user would be able to obtain an authorization token to call api/v1/master/workers or api/v1/master/summary/workers, but would not be able to log in to the UI, or call any other API functions. I'm not familiar with the Cribl security model. Would creating such a user be difficult? Thanks.

Brandon McCombs · September 2023

You can setup a role that has limited API endpoints listed to access. The user could still get into the UI with the way the RBAC model works but be unable to access most pages nor perform most actions due to the limited access defined by the role. The leader node currently doesn't keep state with respect to worker nodes so if they aren't returned in that specific API call then that means a node isn't talking to the leader node for whatever reason (we have a feature request to change this behavior so that the leader will list all nodes even if they disconnect). Currently if a node isn't listed in the response of a call to that endpoint it doesn't mean the worker node is offline though since the communication to the leader node is on a separate port and the worker node could be humming along just fine but not be able to talk to the leader on that port.

aren · September 2023

Thanks Brandon- it sounds like calling api/health on each individual worker (and the leader also) is the best monitoring approach.

aren · September 2023

<@U012ZP93EER> Brandon - you stated that a cribl worker API creates log entries if it fails. Can you point me to where the log entries are? Can they be routed by configuring rsyslogd? Can they be read using the Cribl internal logs data source? Thanks,

aren · September 2023

Also, my interest is not in the API. It is in detecting a scenario in which the worker is not operating. To test this, I go to the VM or container running the worker and execute "cribl stop".

Brandon McCombs · September 2023

API logs are stored in $CRIBL_HOME/logs/cribl.log They aren't included in the Cribl Internal Logs input. I suppose you could collect them via rsyslogd if you wanted to. Cribl Edge can work too, which is probably going to be easier.

generate logging messages when a Cribl worker node goes off-line & also when it recovers operation

Answers

Categories