My journey into sourcePQ and delays in events getting indexed in Splunk for low volume data sources
I am a long time Cribl & Splunk user, I have been on this platform for almost 5+ years now, and I have made my share of stupid stupid mistakes but learnt a lot about both Cribl & Splunk.
In my journey to build a more resilient Cribl + Splunk environment with the constraints I do have($$ + time), I am constantly trying to make things a bit better and I introduced a change in my Splunk UF deployment. I set up some routing entries in my output.conf so that all index=_internal traffic is shipped directly to the Splunk IDX. The thinking was this allowed me to monitor my SplunkUF's and get ahead of any problems the Cribl workers may have in processing the events that we really care for.
This change introduced some unintentional side effects which lead me to understand sourcePQ(sPQ) much better. My nodes ships out metrics periodically 24/7 and events on a much more systematic basis. Some of the nodes get very chatty during certain hours of the day and totally radio silent for hours on end during the day and at night.
Once I introduced the change into my SplunkUF I started noticing the metrics from some of the nodes getting into SplunkIDX processed through Cribl delayed some times by almost an hour. This was especially puzzling considering there was just not much load in my system both in terms of event flow & compute utilization(props to the CriblVision team here that helps me observe this from the Cribl side). Further, this problem would get resolved every time I would either bounce the Splunk UF OR apply a change to my worker group or bounce the worker group. I never lost any event, everything would catch up, but there would just be a delay. T
Upon further digging I think I was able to account for this and I feel like I have this almost under control.
- When the SplunkUF comes up it looks like it opens a socket to the Cribl Worker that it has been configured to connect to.
- If you do have sPQ enabled, it has a default setting of a 1000 events it receives before the message either gets written to disk(smart or always), also keep in mind you will never see the event in the live capture till the event is written to disk.
- IF you have a node which has a lot of worker processes, this problem will show up even more for the low volume nodes. If your nodes are always constantly producing events, this problem just doesn't show up.
- Finally I decided to try this, I created a fake log file of 1001 unique entries and I did a
splunk add oneshot
of this file and lo and behold all the events that were in queue but not drained showed up in the live view and further in Splunk. - Basically when I ended up re-routing all my index=_internal traffic away from Cribl to directly go into Splunk, I ended up bringing down the flow of events in Cribl for some nodes to almost a stand still. As long as all the index=_internal traffic was flowing the plumbing never dried up and sPQ was functioning across all workers and pushing data along 24/7/365.
Here are the things I have now done to kinda get this under control.
I did a few things, I shrunk the number of worker processes in the group in question from 30→10 in my case. I further adjusted the 1000 events in my buffer down to 250. I have also requested a feature to expose a time frequency feature in sPQ that will automatically force the worker to flush its queue rather than wait for it to fully fill up.
I would love to get feedback from other users.
gov