We have updated our Terms of Service, Code of Conduct, and Addendum.

does cribl keep any sort of state regarding files it has seen before?

Hi folks, new cribl user here. For the Google Cloud Storage collection (or any collection for that matter), does cribl keep any sort of state regarding files it has seen before? I'm coming from logstash/Splunk apps where they do actually keep track of files that have been indexed via checkpoints, but I didn't see anything like that in cribl - wanted to make sure I'm not missing it

Answers

  • Cribl does not store the state for collectors. You would need to set your time or search parameters when scheduling the search. For AWS S3 and Azure Blob, you have the option of using SQS or Queue to pull new data when it lands in the bucket/blob. I am not sure if that is available for Google Cloud Storage

  • Another way to put it would be: If i set my collection to run 1x per hour over the last 3 hours worth of data, is it going to re-index those previously seen/processed files?

  • Raanan Dagan
    Raanan Dagan Posts: 101 mod

    Cribl leader keeps the state of what has been collected during the run itself. However, it does not .. as Josh highlighted .. keep the state of different runs in sync

  • Raanan Dagan
    Raanan Dagan Posts: 101 mod

    In the Collection phase, the list of files to process are spread across 1..N workers with the goal of distributing tasks as evenly as possible across workers. Those workers are than stream the files from remote Google Cloud Storage location to self as an Input

  • Raanan Dagan
    Raanan Dagan Posts: 101 mod

    The best practice is to: 1. Create a partition for Year / Month / Day / Hour 2. Schedule a Run to collected once an hour 3. Bring back 1 hour worth of data .. the above applies to any time range you desire ..

  • Ah, got it, that's helpful, thank you! GCS appears to automatically handle the `Year / Month / Day / Hour` partitioning, so i take it cribl's workers are smart enough to just look in the correct directory (as opposed to rescanning the entire bucket)?

  • <@U03JJNGAXB6&gt; google supports bucket notifications via pubsub for newly created files. could cribl ingest those today and have that trigger GCS collections?

  • Raanan Dagan
    Raanan Dagan Posts: 101 mod

    In the GCS collector you have the option to add to the Path a partition with ${_time:%Y}/${_time:%m}/${_time:%d} .. etc .. As well as the option to ' Disable time filter ' (default to no)

  • It looks like the S3 source has an input for an associated SQS queue whereas GCS doesnt have an input for an associated pubsub topic