AWS / ECS autoscaling configuration and guidance

G H · April 2023

Hi,

We’re running Cribl in AWS, using ECS to manage docker containers on EC2 instances. We’re trying to tune our instance size and auto-scaling, and wondering if anyone else has any past experience on the matter.

For example, should we use memory reservations, and if so what are the best settings for soft and hard limits?

If we want to scale when under memory pressure, how do we get this information out?
I.e. to know when we’re under memory pressure we need to see heap usage of the worker threads, not just memory usage for the box (since the threads pre-allocate heap), what’s the best way to get this?

If anyone has anyone had any luck - good or bad - with this, I’d be keen to hear about your experiences!

Paul Dott · April 2023

Hi @GarethHumphriesGKC

I have been tinkering with auto scaling cribl worker-group (ECS on Fargate) to increase speed/throughput for ad-hoc S3 replay collector jobs with the aim to achieve around 10+MB/sec throughput so that ~10gb uncompressed would take just a few minutes to replay.

When not in use, the collector worker-group is idling or running very small tasks, it will have 1 ECS task with 4VCPU/8GB Mem.
If avg CPU reaches 50% scale-out to a max of 3 tasks.
When things cool off, scale-in to 1 task

One thing I'm not terribly satisfied with is the amount of time it takes to auto scale (red line is when it adds capacity), which as you see is quite late into this particular replay. This is due to the CloudWatch alarm threshold being CPUUtilization > 50 for 3 datapoints within 3 minutes and the period is 1 minute. (cannot modify)

Short of implementing some custom code handle the scale trigger. Ie lambda that could interrogate the CPU at quicker intervals (10secs) and increase capacity near immediately I am not sure what other options there are.

AWS / ECS autoscaling configuration and guidance

Answers

Categories