Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service that allows you to store and retrieve data from anywhere.

In this guide, you’ll set up a dataset provider and a dataset to search objects in your Amazon S3 bucket.

Data transfer costs and other charges may apply. For details, see the Data Transfer Charges section below.

Storage Class Compatibility

Cribl Search supports data preview, collection, and replay from S3 Glacier Instant Retrieval when you’re using the S3 Intelligent-Tiering storage class.

Cribl Search does not support data preview, collection, or replay from S3 Glacier or S3 Deep Glacier storage classes, whose stated retrieval lags (variously minutes to 48 hours) cannot guarantee data availability when Cribl Search needs it.

Create Dataset Provider

A dataset provider tells Cribl Search where to query and contains access credentials. Here, you will create an Amazon S3 dataset provider.

To add a new dataset provider, select Data from the top navigation bar and Dataset Providers from the left navigation pane, then click the Create Provider button on the right.

If you have a drop-down menu showing your Stream Worker Groups and Data Lake Amazon S3 Destinations, you can ignore them and click Create to create a new provider. This option is used in the setup for Data Lake Amazon S3.

Dataset providers selection
Dataset providers selection

Set the following configurations in the New Dataset Provider modal:

  1. ID is a unique identifier for the dataset provider. This is how you’ll reference it when assigning datasets to it.
  2. Description is optional.
  3. Set Dataset Provider Type to Amazon S3.
  4. Authentication method provides two options, Assume Role and AWS keys. See how to grant access to AWS for details on each option.
    • Assume Role requires the IAM role’s ARN (AssumeRole ARN) and has options to define an External ID and Duration.
      • The External ID on the dataset provider must match the external ID defined in the IAM Role Trust Policy.
      • Duration defines the Assumed Role’s session length of time, in seconds. Minimum is 900 (15 minutes), default is 3600 (1 hour), and maximum is 43200 (12 hours).
    • AWS keys requires the IAM user’s Access key and Secret key.
  5. Advanced Settings provides the following optional configurations:
    • Endpoint: S3 service or compatible endpoint. If empty, defaults to AWS’ Region-specific endpoint.
    • Signature version: Signature version to use for signing S3 requests. Defaults to v4.
    • Reuse connections: Whether to reuse connections between requests. The default setting (Yes) can improve performance.
    • Reject unauthorized certificates: Whether to reject certificates that cannot be verified against a valid Certificate Authority (for example, self-signed certificates). Defaults to Yes.
  6. Click Save when finished.

Create Dataset

Now you’ll create a dataset that tells Cribl Search what data to search from the dataset provider.

To add a new dataset, select Data from the top navigation bar and Datasets from the left navigation pane, then click the Add Dataset button on the right.

Datasets selection
Datasets selection

Set the following configurations in the New Dataset modal:

  1. ID is an identifier unique for both Cribl Search and Cribl Lake. You’ll use this to specify the dataset in a query’s scope, telling Cribl Search to search the dataset. For example, dataset=ID.
  2. Description is optional.
  3. Set Dataset Provider to the Amazon S3 dataset provider (its ID) you created in the above Create Dataset Provider section.
    1. Bucket path is the path to the S3 objects you’d like to search. Start the path with the bucket name, For example, my-bucket/data/*. Tokens and key-value pairs are supported, see bucket path for details.
    2. Toggle Auto-detect region to No if you want to manually select the region where the S3 bucket is located. When toggled to Yes, Cribl automatically detects the region.
    3. Path Filter is a JavaScript filter expression that is evaluated against the S3 bucket path. Defaults to true, which matches all data, but this value can be customized.
    4. In Advanced Settings you can select the Partitioning scheme if your data is incoming from Splunk. See Partitioning Scheme for details.
  4. Processing is done with Datatypes to break data into discrete events and define fields so they’re ready to search. Set the first rule to AWS Datatypes. These consist of rules that are applied to the data searched in your dataset. See Datatypes for details.
  5. In Acceleration, you can configure the dataset acceleration options. For details, see Enable Dataset Acceleration.
  6. Click Save when finished.

Bucket Path

The Bucket Path specifies the data that the dataset consists of. It defines the scope of data, to narrow down what data is in the dataset. This is a JavaScript expression that supports tokens and key-value pairs. For example,

  • my-bucket/${data}/ – where data becomes a field for all events of that dataset.
  • my-bucket/${data}/${*} – where data and the wildcarded path becomes a field for all events of that dataset.

Basic Tokens

Basic tokens’ syntax follows that of JS template literals: ${token_name} – where token_name is the field (name) of interest.

For example, if the path was set to /var/log/${hostname}/${sourcetype}/, you could use a filter such as hostname=='myHost' && sourcetype=='mySourcetype' to specify data only from the /var/log/myHost/mySourcetype/ subdirectory.

Time-based Tokens

Paths with time notation can be referenced with tokens, having a direct effect on the earliest and latest boundaries. The supported time fields are:

  • _time is the raw event’s timestamp.
  • __earliest is the search start time.
  • __latest is the search end time.

Time-based tokens are processed as follows:

  • For each path, times must be notated in descending order. So Year/Month/Day order is supported, but Day/Month/Year is not.
  • Paths may contain more than one time component. For example, /my/path/2020-04/20/.
  • In a given path, each time component can be used only once. So /my/path/${_time:%Y}/${_time:%m}/${_time:%d}/... is a valid expression format, but /my/path/${_time:%Y}/${_time:%m}/${host}/${_time:%Y}/... (with a repeated Y) is not supported.
  • For each path, all extracted dates/times are considered in UTC.

The following strptime format components are allowed:

  • Y, y for years
  • m, B, b, e for months
  • d, j for days
  • H, I for hours
  • M for minutes
  • S for seconds
  • s for Unix-style Epoch times (seconds since 1/1/1970)

Time-based token syntax follows that of a slightly modified JS template literal: ${_time: some_strptime_format_component}. Examples:

PathMatches
/path/${_time:%Y}/${_time:%m}/${_time:%d}/.../path/2020/04/20/...
/path/${_time:year=%Y}/${_time:month=%m}/${_time:day=%d}/.../path/year=2020/month=05/day=20/...
/path/${_time:%Y-%m-%d}/.../path/2020-05-20/...

Path Filter

This is a JavaScript filter expression that is evaluated against the provided Dataset path. The Filter value defaults to true, which matches all data, but this value can be customized almost arbitrarily.

For example, if a dataset has this Filter:

source.endsWith('.log') || source.endsWith('.txt')

…then only files/objects with .log or .txt extensions will be searched.

At the Filter field’s right edge are a Copy button and an Expand button that opens a validation modal.

Partitioning Scheme

Advanced Settings > Partitioning scheme lets you select the scheme for partitioning data incoming from Splunk.

You can choose between Splunk DDSS and Splunk SmartStore.

In both cases, provide the path where indices are stored (parent folder of your index) as Bucket path.

For Splunk DDSS, the full path takes the form:

<parent_folder>/<indexName>/db/db_<latestTime>_<earliestTime>_<bucketId>/rawdata/journal.gz

For SmartStore , it is:

<parent_folder>/<indexName>/db/<2-letter-hash>/<2-letter-hash>/<bucket_id_number-origin_guid>/<"guidSplunk"-uploader_guid>/

In both cases, if your bucket is organized using this default file path, Cribl Search will automatically discover its content.

Searching

Now that you have a dataset provider and dataset, you’re ready to start searching Amazon S3.

Data Transfer Charges

Data transfer to Cribl Search is free if your S3 bucket is in the same region as your Cribl.Cloud workspace.

Data transfer to Cribl Search is subject to AWS data transfer costs if your S3 bucket is not in the regions supported by Cribl.Cloud. See the Data transfer section in Amazon S3 Pricing.

Cribl Search will try to minimize the amount of data transferred as much as it can using path filters, pruning, and other methods.

Other Charges

Other charges may apply. For example, LIST and GET requests to your S3 buckets will be charged per AWS S3. See the Requests & data retrievals section in Amazon S3 Pricing.

Last updated by: Dritan Bitincka