Good morning all. Does the S3 Collector pull every file from the s3 bucket on each run? Or, does it have some state to tell what has/hasn't been pulled?
Looks like it grabs everything - but you can specify a prefix - and if you're lucky, your prefixes are timestamps: https://community.cribl.io/discussion/comment/158#Comment_158
Hi <@U02TBJ3P6CD> the S3 collector will pull every file based on the criteria you enter when running an adhoc collection or scheduling recurring collections. Are you looking for an option to "collect only new records since last job"?
Hey Ryan! We have a client looking to download Cisco Umbrella data from a Cisco-hosted S3 bucket, and to be honest, we're looking for reasons to tell them why it's a bad idea
Cribl S3 collector is certainly optimized for replaying data that you have partitioned and sent to S3 previously. That way you had full control of the prefix etc.
With that said, there is still a lot of flexibility collecting and working with data that might not be ideally partitioned. But it takes a little trickery. Also Path Extractors should help. https://docs.cribl.io/stream/collectors-s3/#path-extractors
we did an sqs notification whenever an addition was made to the s3 bucket, and that way only new data was pulled.
<@U02TBJ3P6CD> is there a way to pass the ' earliest ' and ' latest ' time as part of the bucket structure?
Last time we tried to access it .. for example, looking at this:
We created a Cribl S3 Collector with these flags:
» Bucket name = cisco-managed-us-west-1
» Path = 2069997_6ff2802af17337def701c2e7816cf14913zf848a/
» Region = Same Region as the Cisco Managed
» Authentication = manual ( Key / Secret )
» Verify bucket permissions = Turn it off