Looking for insight on the *Time Range* selection when running against an Amazon S3 bucket
Hi Cribl community. Could anyone provide some additional insight on the Time Range selection when running against an Amazon S3 bucket? The S3 bucket is full of .gz csv logs which are updated periodically (8 files per 10 minute period). See attached. Beginner's naivety suggested that setting a Relative time (-30m as an example) would only pull files in S3 last updated within that relative time. However, it pulls all files as if I had not set a time filter at all. Another thought is that this time range does not apply to the files themselves, but to the EVENT times contained within the files. In that case, Cribl would need to pull all files before it could filter on event time? https://docs.cribl.io/stream/collectors-schedule-run Thank you! (as an alternative, we could use the Amazon S3 specific collector which uses event notifications/SQS, but would like to at least understand the above)
Answers
-
Jon, what does the Path look like to get to this folder?
0 -
```/s3_folder/dnslogs/YYYY-MM-DD/``` One folder per day
0 -
So in your collector your Path is set to something like this? `/s3_folder/dnslogs/${_time:%Y}-${_time:%m}-${_time:%d}/`
0 -
It is not...yet :slightly_smiling_face: Currently working on the best way to accomplish only collecting the most recent logs while learning cribl along the way. I suspect this is the correct thing to do. But what then does the time range do?
0 -
I believe the path supplied in Cribl ended in dnslogs/
0 -
IIRC, It will go to the correct folder based on your relative time and start collecting the events in that folder. If upon uncompressing the gz files it sees an event with a timestamp outside that range, then it will not collect that specific event.
0 -
THAT makes more sense... which is why ALL files needed to be pulled down to be checked. Not that we need this, but is there a way to use tokens to only pull files based on last modified time?
0 -
The Destination path to S3 bucket look like this?: ``/s3_folder/dnslogs/${C.Time.strftime(_time ? _time : Date.now()/1000, '%Y-%m-%d')}``?
0 -
We haven't specified anything with tokens, etc. This is our first connection to S3 and just started using Cribl this month
0 -
The examples above are not using tokens, just JavaScript expressions.
0 -
We haven't used javascript for any paths.
0 -
The Destination just writes out the event to the correct folder using the values abstracted from _time and the Collector does similar.
0 -
Okay. We are only ingesting data from the S3 bucket. Do you know if there's a way to only pull in files based on last modified time?
0 -
I didn't see anything like that
0 -
Also wondering, assuming we configure cribl to only pull in the current day's logs, and then narrowing down to the current timeframe (hour, or 10 minute period), how we would go about ensuring that 1) duplicate events were not processed 2) events were not missed
0 -
(I think this is where the Amazon S3 connector comes into play...using event notications/SQS instead of the standard S3 collector)
0 -
Thanks for the assistance, David! Hopping off slack and will put in ticket for remaining Q's
0