We have updated our Terms of Service, Code of Conduct, and Addendum.

Does the S3 collector pull every file from the s3 bucket on each run?

Good morning all. Does the S3 Collector pull every file from the s3 bucket on each run? Or, does it have some state to tell what has/hasn't been pulled?

Answers

  • Igor Gifrin
    Igor Gifrin Posts: 12 mod

    Looks like it grabs everything - but you can specify a prefix - and if you're lucky, your prefixes are timestamps: https://community.cribl.io/discussion/comment/158#Comment_158

  • Paul Dott
    Paul Dott Posts: 31 ✭✭

    Hi <@U02TBJ3P6CD&gt; the S3 collector will pull every file based on the criteria you enter when running an adhoc collection or scheduling recurring collections. Are you looking for an option to "collect only new records since last job"?

  • Igor Gifrin
    Igor Gifrin Posts: 12 mod

    Hey Ryan! We have a client looking to download Cisco Umbrella data from a Cisco-hosted S3 bucket, and to be honest, we're looking for reasons to tell them why it's a bad idea :smile:

  • Paul Dott
    Paul Dott Posts: 31 ✭✭

    Cribl S3 collector is certainly optimized for replaying data that you have partitioned and sent to S3 previously. That way you had full control of the prefix etc. With that said, there is still a lot of flexibility collecting and working with data that might not be ideally partitioned. But it takes a little trickery. Also Path Extractors should help. https://docs.cribl.io/stream/collectors-s3/#path-extractors

  • Franky Laarits
    Franky Laarits Posts: 59 ✭✭

    we did an sqs notification whenever an addition was made to the s3 bucket, and that way only new data was pulled.

  • Raanan Dagan
    Raanan Dagan Posts: 101 mod

    <@U02TBJ3P6CD&gt; is there a way to pass the ' earliest ' and ' latest ' time as part of the bucket structure? Last time we tried to access it .. for example, looking at this: <s3://cisco-managed-us-west-1/2069997_6ff2802af17337def701c2e7816cf14913zf848a/> We created a Cribl S3 Collector with these flags: » Bucket name = cisco-managed-us-west-1 » Path = 2069997_6ff2802af17337def701c2e7816cf14913zf848a/ » Region = Same Region as the Cisco Managed » Authentication = manual ( Key / Secret ) » Verify bucket permissions = Turn it off