Is replay actually a feature in itself or just a technique implemented via a Source with different

Brian Wasserman · October 2023

https://docs.cribl.io/stream/usecase-replay-s3/#which-format

Starting in version 4.3, Cribl Stream supports replaying data that has been exported as Parquet, using either the S3 Collector or the Filesystem Collector.
Meanwhile, the Azure Blob Storage and Google Cloud Storage Collectors support ingesting data in Parquet format, but do not support replay.

I am glad to see that Parquet is now supported for S3 replay. Is replay actually a feature in itself or just a technique implemented via a Source with different handling?
How can Azure Blob and Google Cloud Storage collectors support ingesting but not support replay?

Tony Reinke - Cribl · October 2023

OK, so yes, they should be replayable, no matter what their schemas are.
Except there are 2 cases that the parquet files are not readable (not yet):
if the parquet file is encrypted https://github.com/apache/parquet-format/blob/master/Encryption.md
if the parquet file links to an external column data as defined in the parquet thrift file https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L789 (It's a very rare feature)
The best is to give a go and please, let us know...

Tony Reinke - Cribl · October 2023

If I understand well the question, parquet replay is implemented by the source Collector.
The source collector just handles the data.GCS and Azure blob will be released in the next release normally this month.

Brian Wasserman · October 2023

oh, are Azure and GCS only via notifications ? I didn’t realize GCS was just pub sub. Makes more sense now. It just is confusing to read.

Maliha Balala · October 2023

Does this mean I can take Databricks "CIM" data now out of their Gold storage tier and replay it else where?

Tony Reinke - Cribl · October 2023

I have to admit that I don't know what format the Databricks "CIM" data is.
I cannot find online an answer, what kind of files is it?

Maliha Balala · October 2023

It's just parquet files

Tony Reinke - Cribl · October 2023

OK, so yes, they should be replayable, no matter what their schemas are.
Except there are 2 cases that the parquet files are not readable (not yet):
if the parquet file is encrypted https://github.com/apache/parquet-format/blob/master/Encryption.md
if the parquet file links to an external column data as defined in the parquet thrift file https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L789 (It's a very rare feature)
The best is to give a go and please, let us know...

Is replay actually a feature in itself or just a technique implemented via a Source with different

Best Answer

Answers

Categories