We have updated our Terms of Service, Code of Conduct, and Addendum.

Is replay actually a feature in itself or just a technique implemented via a Source with different

Options
Brian Wasserman
Brian Wasserman Posts: 2
edited October 2023 in General Discussions

https://docs.cribl.io/stream/usecase-replay-s3/#which-format

Starting in version 4.3, Cribl Stream supports replaying data that has been exported as Parquet, using either the S3 Collector or the Filesystem Collector.
Meanwhile, the Azure Blob Storage and Google Cloud Storage Collectors support ingesting data in Parquet format, but do not support replay.

I am glad to see that Parquet is now supported for S3 replay.  Is replay actually a feature in itself or just a technique implemented via a Source with different handling?
How can Azure Blob and Google Cloud Storage collectors support ingesting but not support replay?

Best Answer

  • Tony Reinke - Cribl
    Tony Reinke - Cribl Posts: 134 admin
    Answer ✓
    Options

    OK, so yes, they should be replayable, no matter what their schemas are.
    Except there are 2 cases that the parquet files are not readable (not yet):
    if the parquet file is encrypted https://github.com/apache/parquet-format/blob/master/Encryption.md
    if the parquet file links to an external column data as defined in the parquet thrift file https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L789 (It's a very rare feature)
    The best is to give a go and please, let us know...

Answers

  • Tony Reinke - Cribl
    Options

    If I understand well the question, parquet replay is implemented by the source Collector.
    The source collector just handles the data.GCS and Azure blob will be released in the next release normally this month.

  • Brian Wasserman
    Options

    oh, are Azure and GCS only via notifications ? I didn’t realize GCS was just pub sub. Makes more sense now. It just is confusing to read.

  • Maliha Balala
    Maliha Balala Posts: 14 mod
    Options

    Does this mean I can take Databricks "CIM" data now out of their Gold storage tier and replay it else where?

  • Tony Reinke - Cribl
    Options

    I have to admit that I don't know what format the Databricks "CIM" data is.
    I cannot find online an answer, what kind of files is it?

  • Maliha Balala
    Maliha Balala Posts: 14 mod
    Options

    It's just parquet files

  • Tony Reinke - Cribl
    Tony Reinke - Cribl Posts: 134 admin
    Answer ✓
    Options

    OK, so yes, they should be replayable, no matter what their schemas are.
    Except there are 2 cases that the parquet files are not readable (not yet):
    if the parquet file is encrypted https://github.com/apache/parquet-format/blob/master/Encryption.md
    if the parquet file links to an external column data as defined in the parquet thrift file https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L789 (It's a very rare feature)
    The best is to give a go and please, let us know...