does cribl keep any sort of state regarding files it has seen before?

sondra@cribl · September 2023

Hi folks, new cribl user here. For the Google Cloud Storage collection (or any collection for that matter), does cribl keep any sort of state regarding files it has seen before? I'm coming from logstash/Splunk apps where they do actually keep track of files that have been indexed via checkpoints, but I didn't see anything like that in cribl - wanted to make sure I'm not missing it

sondra@cribl · September 2023

Another way to put it would be: If i set my collection to run 1x per hour over the last 3 hours worth of data, is it going to re-index those previously seen/processed files?

Joshua Napier · September 2023

Cribl does not store the state for collectors. You would need to set your time or search parameters when scheduling the search. For AWS S3 and Azure Blob, you have the option of using SQS or Queue to pull new data when it lands in the bucket/blob. I am not sure if that is available for Google Cloud Storage

Raanan Dagan · September 2023

Cribl leader keeps the state of what has been collected during the run itself. However, it does not .. as Josh highlighted .. keep the state of different runs in sync

Raanan Dagan · September 2023

In the Collection phase, the list of files to process are spread across 1..N workers with the goal of distributing tasks as evenly as possible across workers. Those workers are than stream the files from remote Google Cloud Storage location to self as an Input

Raanan Dagan · September 2023

The best practice is to: 1. Create a partition for Year / Month / Day / Hour 2. Schedule a Run to collected once an hour 3. Bring back 1 hour worth of data .. the above applies to any time range you desire ..

sondra@cribl · September 2023

Ah, got it, that's helpful, thank you! GCS appears to automatically handle the `Year / Month / Day / Hour` partitioning, so i take it cribl's workers are smart enough to just look in the correct directory (as opposed to rescanning the entire bucket)?

sondra@cribl · September 2023

<@U03JJNGAXB6> google supports bucket notifications via pubsub for newly created files. could cribl ingest those today and have that trigger GCS collections?

sondra@cribl · September 2023

https://cloud.google.com/storage/docs/pubsub-notifications

sondra@cribl · September 2023

It looks like the S3 source has an input for an associated SQS queue whereas GCS doesnt have an input for an associated pubsub topic

Raanan Dagan · September 2023

In the GCS collector you have the option to add to the Path a partition with ${_time:%Y}/${_time:%m}/${_time:%d} .. etc .. As well as the option to ' Disable time filter ' (default to no)

does cribl keep any sort of state regarding files it has seen before?

Answers

Categories