Looking for insight on the Time Range selection when running against an Amazon S3 bucket

Raanan Dagan · September 2023

Hi Cribl community. Could anyone provide some additional insight on the Time Range selection when running against an Amazon S3 bucket? The S3 bucket is full of .gz csv logs which are updated periodically (8 files per 10 minute period). See attached. Beginner's naivety suggested that setting a Relative time (-30m as an example) would only pull files in S3 last updated within that relative time. However, it pulls all files as if I had not set a time filter at all. Another thought is that this time range does not apply to the files themselves, but to the EVENT times contained within the files. In that case, Cribl would need to pull all files before it could filter on event time? https://docs.cribl.io/stream/collectors-schedule-run Thank you! (as an alternative, we could use the Amazon S3 specific collector which uses event notifications/SQS, but would like to at least understand the above)

David Maislin · September 2023

Jon, what does the Path look like to get to this folder?

Raanan Dagan · September 2023

```/s3_folder/dnslogs/YYYY-MM-DD/``` One folder per day

David Maislin · September 2023

So in your collector your Path is set to something like this? `/s3_folder/dnslogs/${_time:%Y}-${_time:%m}-${_time:%d}/`

Raanan Dagan · September 2023

It is not...yet :slightly_smiling_face: Currently working on the best way to accomplish only collecting the most recent logs while learning cribl along the way. I suspect this is the correct thing to do. But what then does the time range do?

Raanan Dagan · September 2023

I believe the path supplied in Cribl ended in dnslogs/

David Maislin · September 2023

IIRC, It will go to the correct folder based on your relative time and start collecting the events in that folder. If upon uncompressing the gz files it sees an event with a timestamp outside that range, then it will not collect that specific event.

Raanan Dagan · September 2023

THAT makes more sense... which is why ALL files needed to be pulled down to be checked. Not that we need this, but is there a way to use tokens to only pull files based on last modified time?

David Maislin · September 2023

The Destination path to S3 bucket look like this?: ``/s3_folder/dnslogs/${C.Time.strftime(_time ? _time : Date.now()/1000, '%Y-%m-%d')}``?

Raanan Dagan · September 2023

We haven't specified anything with tokens, etc. This is our first connection to S3 and just started using Cribl this month

David Maislin · September 2023

The examples above are not using tokens, just JavaScript expressions.

Raanan Dagan · September 2023

We haven't used javascript for any paths.

David Maislin · September 2023

The Destination just writes out the event to the correct folder using the values abstracted from _time and the Collector does similar.

Raanan Dagan · September 2023

Okay. We are only ingesting data from the S3 bucket. Do you know if there's a way to only pull in files based on last modified time?

Raanan Dagan · September 2023

I didn't see anything like that

Raanan Dagan · September 2023

Also wondering, assuming we configure cribl to only pull in the current day's logs, and then narrowing down to the current timeframe (hour, or 10 minute period), how we would go about ensuring that 1) duplicate events were not processed 2) events were not missed

Raanan Dagan · September 2023

(I think this is where the Amazon S3 connector comes into play...using event notications/SQS instead of the standard S3 collector)

Raanan Dagan · September 2023

Thanks for the assistance, David! Hopping off slack and will put in ticket for remaining Q's

Looking for insight on the Time Range selection when running against an Amazon S3 bucket

Answers

Categories

Looking for insight on the *Time Range* selection when running against an Amazon S3 bucket

Answers

Categories

Looking for insight on the Time Range selection when running against an Amazon S3 bucket