Distributed deployment planning with compressing g-zip Destination?

Angelo Michele Pizzi · February 20

Hello everyone,

I have the following problem.

I'm designing a distributed deployment. I have to create an environment for data reduction but before that I need to send the data to a blob storage, for compliance reason, as you can imagine. Now, all of these is achieved but I have some doubts.

When I use the Azure blob storage as only destination, all the 100% of the data goes in there but, the Monitoring Page, is telling me that the data that comes in are the same as the data that goes out (sometimes, the data that goes out are even more), even with compression enable. As said, I have the g-zip compression enabled on the Azure Destination page. I've checked the storage and I found out that the data are really compressed.

So, my point is the following: when I try to estimates the possible number of workers and their capabilities, should I trust the monitoring page and consider DATA IN = DATA OUT (since the data is not processed by any pipeline at all) or I can consider the quantity of bytes that the storage actually save in the container?

Example: Considering VM with 16vCPU x86_64 with 3GHz. Worker Group Throughput = 4 TB/Day. Data IN= 2TB/DAY. What I should consider as Data OUT? Because if Data OUT = 2 TB/Day, I need 2 Workers with 20 worker process required. But, if the Data are lower, I would need less more process and so I could save more costs on the infrastructure. Of course these are low values, and probably with higher values of data, it makes even more sense (I hope, at least).

I hope everything is clear, I ask you sorry in advance if something is grammarly incorrect or if I made no point with this explanation. Thanks.

Jon Rust · February 20

THe monitoring page (and the internal metrics it's based on) use raw event sizes, before compression.

Angelo Michele Pizzi · February 20

https://community.cribl.io/discussion/comment/3656#Comment_3656

Thank you Jon, helpful as always! So this mean that when calculating the data that are going out, to calculate the requirements for Worker Node, I can use the compressed files total size, right?

Jon Rust · February 20

Worker Node requirements should be based on the raw data, before compression.

Distributed deployment planning with compressing g-zip Destination?

Comments

Categories