This English version was translated by Hermes Agent.
RUNNING does not mean it is immediately in a healthy serving state.healthCheckGracePeriodSeconds only around app startup time, the task can be terminated before the ALB finishes confirming consecutive successful checks.Recently, I ran into a case where a staging server went down once and was redeployed again even though there was no obvious application error.
At first, I wondered, “Did the server actually die?” But after tracing the logs, it turned out that the application itself had not crashed abnormally.
The real issue was the timing of the ECS and ALB health checks, together with the healthCheckGracePeriodSeconds setting.
In this post, I want to walk through the settings I checked at the time and explain why the server was treated as a termination target even after it had started up normally.
The environment where this happened looked roughly like this.
ECS works in units of tasks.
And the process of a task coming up is a little different from the process of it becoming “actually ready to receive traffic.”
Reference: Amazon ECS task lifecycle
When a task is created, the ECS agent moves it through several internal states.
Provisioning -> Pending -> Activating -> Running
Each stage roughly means the following.
If the work does not complete successfully, it goes back down in reverse order.
Running -> Deactivating -> Stopping -> Deprovisioning -> Stopped
The important point here was that RUNNING does not immediately mean the service is ready to serve traffic normally.
Even if the application process is up, it still needs to pass the health checks described next.
The ECS task definition had a container health check like this.
"healthCheck": {
"command": [
"CMD-SHELL",
"wget -q -O /dev/null http://localhost:8080/api/health || exit 1"
],
"interval": 30,
"timeout": 5,
"retries": 3
}Each option means the following.
interval: how often the check runstimeout: how many seconds it has to succeedretries: how many consecutive failures count as unhealthyThis check runs inside the container.
In other words, it checks whether the application is up from the perspective of localhost:8080.
The points that stood out again while reviewing this were these.
startPeriod, failures during initial boot are still counted from the beginning.Next is the health check setting on the ALB target group.
{
"TargetGroupName": "staging",
"TargetType": "instance",
"Protocol": "HTTP",
"Port": 8080,
"ProtocolVersion": "HTTP1",
"HealthCheckProtocol": "HTTP",
"HealthCheckPath": "/api/health",
"HealthCheckPort": "traffic-port",
"HealthCheckIntervalSeconds": 30,
"HealthCheckTimeoutSeconds": 5,
"HealthyThresholdCount": 3,
"UnhealthyThresholdCount": 4,
"Matcher": {
"HttpCode": "200"
}
}This is not an ECS task-level setting.
It is a target group-level setting on the ALB.
That means it should be understood as a request coming from outside the container, not from inside it.
The important parts to remember here are these.
TargetType is instance, the target group checks the instance’s traffic port.HealthCheckPort is traffic-port, requests go to the actual port the service is bound to.HealthyThresholdCount = 3, it takes three consecutive successes to become healthy.UnhealthyThresholdCount = 4, it becomes unhealthy after four consecutive failures.For example, if the traffic port is 28080, the ALB sends requests to /api/health on that port every 30 seconds to check the target state.
And this ALB check runs in parallel with the Container Health Check.
That was the key part of this issue.
An ECS service has a setting called healthCheckGracePeriodSeconds.
It is basically a grace period where ECS waits for a newly started task even if health checks fail right away.
Reference: Amazon ECS service definition parameters
If the service uses a load balancer, ECS does not just look at whether the process is up.
It evaluates both of the following.
In our staging ECS service, healthCheckGracePeriodSeconds was set to 150s.
The problem was that the ALB Target Health Check does not start when the app is ready. It starts when the task enters ACTIVATING.
So the check timing can look something like this.
10 -> 40 -> 70 -> 10025 -> 55 -> 85 -> 115For the ALB to consider the target healthy, you need three consecutive successful checks in that sequence.
Let’s assume the following.
82 seconds to actually become ready.Then even if the app becomes ready at 82 seconds, all ALB checks before that point will fail.
For example, if the checks happened like this:
40 -> 70 -> 100 -> 130 -> 160
That means healthy status is only fully confirmed at 160 seconds.
But the ECS grace period was 150 seconds.
So even though the app had effectively come up normally,
ECS judged that “this task did not become healthy within the grace period” and marked it for termination before the ALB could finish confirming it as healthy.
In the end, it is not enough to look only at the app’s startup time.
You also have to include the time required for the ALB to accumulate consecutive successful checks.
The logs at that time looked roughly like this.
At first, it was also a bit confusing why there was about a one-minute gap between “server marked for termination” and “server received a termination request.”
That part was caused by the ALB’s deregistration delay, in other words, the draining time.
Reference: Edit target group attributes for your Application Load Balancer
The target remains in the draining state until that time expires,
and only after that does the server receive the actual termination request and shut down.
The conclusion was simple.
healthCheckGracePeriodSeconds needs to be set with more margin.
More precisely, you should not set it by looking only at “application boot time.”
You should include the time it takes for the ALB health check to be fully confirmed as healthy.
The clearest way to think about it, at least for me, was like this.
ACTIVATING, regardless of whether the service is actually ready.healthCheckGracePeriodSeconds should be set with more margin than that combined time.This issue was less about the server being dead,
and more about not giving the system enough time to officially recognize the server as healthy.
When operating ECS, it is common to first focus only on application startup time.
But in an environment with a load balancer attached, you also need to look at the ALB health check interval and threshold.
While organizing this again, I strongly felt that
RUNNING and “actually healthy enough to receive traffic” can be quite different states.
If you run into a similar situation where a staging server looks fine but keeps getting terminated or redeployed during deployment,
it is worth checking more than just the application logs.
Look at these together as well.
healthCheckGracePeriodSecondsLooking at all four together can reveal the cause much faster than you might expect.