Cribl S3 Collector collect All the log files from S3 bucket everytime instead only new ones

Dinesh Raja · November 2023

Hi Mates,
I'm using cribl S3 collector to collect the logs from AWS S3 bucket. S3 bucket contains logs of Akamai datastream which has the log format as below.

eg: s3://BUCKETNAME/APPNAME/ENV/ak-913478-1701139960-008071-ds.gz

ak- akamai file prefix

913478 - random string

170xxxxxxxx - EPOCH timestamp

008071 - random string

ds- file suffix

When I schedule the S3 collector to run every 15 minutes, it collects all the log files from buckets every time.

Looking for a suggestion, how do I collect only the new files from S3 not all the files every time.

Brendan Dalpe · November 2023

Hi @Dinesh Raja, what does the epoch timestamp in the filename represent? When the file was created? The first event timestamp? If you can provide this info, we can help with a solution.

Dinesh Raja · November 2023

Hello @Brendan Dalpe
Thanks for the response. Yes, the epoch timestamp represents "file creation time" .

Paul Dott · December 2023

@Dinesh Raja, you can use a path extractor in the Collector to set _time on the filename.

https://docs.cribl.io/stream/collectors-s3/#path-extractors

Something like this should work for you based on this path:

s3://BUCKETNAME/APPNAME/ENV/ak-913478-1701139960-008071-ds.gz

Path: /${appname}/${env}/${file}
Token: file
Extractor Expression: {"_time": value.match(/ak-\d+-(\d+)-\w+/)[1]}

This is making some assumptions that the file path is consistent, but should help get you going in the right direction. With _time set to the filename, the earliest/latest in your collection job has something to key off and won't need to grab all files. Hope this helps.

Cribl S3 Collector collect All the log files from S3 bucket everytime instead only new ones

Answers

Categories