We have updated our Terms of Service, Code of Conduct, and Addendum.

Cribl S3 Collector collect All the log files from S3 bucket everytime instead only new ones

Options
Dinesh Raja
Dinesh Raja Posts: 5

Hi Mates,
I'm using cribl S3 collector to collect the logs from AWS S3 bucket. S3 bucket contains logs of Akamai datastream which has the log format as below.

eg: s3://BUCKETNAME/APPNAME/ENV/ak-913478-1701139960-008071-ds.gz

ak- akamai file prefix

913478 - random string

170xxxxxxxx - EPOCH timestamp

008071 - random string

ds- file suffix

When I schedule the S3 collector to run every 15 minutes, it collects all the log files from buckets every time.

Looking for a suggestion, how do I collect only the new files from S3 not all the files every time.

Answers

  • Brendan Dalpe
    Brendan Dalpe Posts: 201 mod
    Options

    Hi @Dinesh Raja, what does the epoch timestamp in the filename represent? When the file was created? The first event timestamp? If you can provide this info, we can help with a solution.

  • Dinesh Raja
    Dinesh Raja Posts: 5
    Options

    Hello @Brendan Dalpe
    Thanks for the response. Yes, the epoch timestamp represents "file creation time" .

  • Paul Dott
    Paul Dott Posts: 33 ✭✭
    Options

    @Dinesh Raja, you can use a path extractor in the Collector to set _time on the filename.

    https://docs.cribl.io/stream/collectors-s3/#path-extractors

    Something like this should work for you based on this path:

    s3://BUCKETNAME/APPNAME/ENV/ak-913478-1701139960-008071-ds.gz

    Path: /${appname}/${env}/${file}
    Token: file
    Extractor Expression: {"_time": value.match(/ak-\d+-(\d+)-\w+/)[1]}

    This is making some assumptions that the file path is consistent, but should help get you going in the right direction. With _time set to the filename, the earliest/latest in your collection job has something to key off and won't need to grab all files. Hope this helps.