This English version was translated by Hermes Agent.

Three-line summary

Just because an ECS task reaches RUNNING does not mean it is immediately in a healthy serving state.
An ECS service behind a load balancer is only considered stably healthy after both the Container Health Check and the ALB Target Health Check pass.
If you size healthCheckGracePeriodSeconds only around app startup time, the task can be terminated before the ALB finishes confirming consecutive successful checks.

Why an ECS Staging Server Was Redeployed Even Though It Looked Fine

Recently, I ran into a case where a staging server went down once and was redeployed again even though there was no obvious application error.

At first, I wondered, “Did the server actually die?” But after tracing the logs, it turned out that the application itself had not crashed abnormally.
The real issue was the timing of the ECS and ALB health checks, together with the healthCheckGracePeriodSeconds setting.

In this post, I want to walk through the settings I checked at the time and explain why the server was treated as a termination target even after it had started up normally.

Infrastructure setup

The environment where this happened looked roughly like this.

ECS: rolling update deployment
Deployment group: EC2
ALB: load balancer

ECS works in units of tasks.
And the process of a task coming up is a little different from the process of it becoming “actually ready to receive traffic.”

1. ECS Task Lifecycle

Reference: Amazon ECS task lifecycle

When a task is created, the ECS agent moves it through several internal states.

Provisioning -> Pending -> Activating -> Running

Each stage roughly means the following.

Provisioning: infrastructure setup, creating and attaching an ENI (Elastic Network Interface), and so on
Pending: waiting until enough resources are available to place the task
Activating: actually creating and wiring up the container
- pulling the image
- creating the container
- attaching the ENI and setting up routing / DNS
- registering the target in the load balancer target group
Running: the process is up

If the work does not complete successfully, it goes back down in reverse order.

Running -> Deactivating -> Stopping -> Deprovisioning -> Stopped

The important point here was that RUNNING does not immediately mean the service is ready to serve traffic normally.
Even if the application process is up, it still needs to pass the health checks described next.

2. Container Health Check

The ECS task definition had a container health check like this.

"healthCheck": {
  "command": [
    "CMD-SHELL",
    "wget -q -O /dev/null http://localhost:8080/api/health || exit 1"
  ],
  "interval": 30,
  "timeout": 5,
  "retries": 3
}

Each option means the following.

interval: how often the check runs
timeout: how many seconds it has to succeed
retries: how many consecutive failures count as unhealthy

This check runs inside the container.
In other words, it checks whether the application is up from the perspective of localhost:8080.

The points that stood out again while reviewing this were these.

Because it is an in-container command, it does not guarantee network reachability from outside the container or through the ALB path.
On the other hand, it can keep failing until the app is fully up.
If there is no startPeriod, failures during initial boot are still counted from the beginning.

3. ALB Target Health Check

Next is the health check setting on the ALB target group.

{
  "TargetGroupName": "staging",
  "TargetType": "instance",
  "Protocol": "HTTP",
  "Port": 8080,
  "ProtocolVersion": "HTTP1",
  "HealthCheckProtocol": "HTTP",
  "HealthCheckPath": "/api/health",
  "HealthCheckPort": "traffic-port",
  "HealthCheckIntervalSeconds": 30,
  "HealthCheckTimeoutSeconds": 5,
  "HealthyThresholdCount": 3,
  "UnhealthyThresholdCount": 4,
  "Matcher": {
    "HttpCode": "200"
  }
}

This is not an ECS task-level setting.
It is a target group-level setting on the ALB.
That means it should be understood as a request coming from outside the container, not from inside it.

The important parts to remember here are these.

If TargetType is instance, the target group checks the instance’s traffic port.
If HealthCheckPort is traffic-port, requests go to the actual port the service is bound to.
Since HealthyThresholdCount = 3, it takes three consecutive successes to become healthy.
Since UnhealthyThresholdCount = 4, it becomes unhealthy after four consecutive failures.

For example, if the traffic port is 28080, the ALB sends requests to /api/health on that port every 30 seconds to check the target state.

And this ALB check runs in parallel with the Container Health Check.
That was the key part of this issue.

Where the problem happened

An ECS service has a setting called healthCheckGracePeriodSeconds.

It is basically a grace period where ECS waits for a newly started task even if health checks fail right away.

Reference: Amazon ECS service definition parameters

If the service uses a load balancer, ECS does not just look at whether the process is up.
It evaluates both of the following.

Container Health Check
ALB Target Health Check

In our staging ECS service, healthCheckGracePeriodSeconds was set to 150s.

The problem was that the ALB Target Health Check does not start when the app is ready. It starts when the task enters ACTIVATING.

So the check timing can look something like this.

If the first check starts at 10 seconds: 10 -> 40 -> 70 -> 100
If the first check starts at 25 seconds: 25 -> 55 -> 85 -> 115

For the ALB to consider the target healthy, you need three consecutive successful checks in that sequence.

Why it failed within 150 seconds

Let’s assume the following.

The application takes 82 seconds to actually become ready.
The ALB health check runs every 30 seconds.
Healthy status requires three consecutive successful checks.

Then even if the app becomes ready at 82 seconds, all ALB checks before that point will fail.

For example, if the checks happened like this:

40 -> 70 -> 100 -> 130 -> 160

40s: fail
70s: fail
100s: success
130s: success
160s: success

That means healthy status is only fully confirmed at 160 seconds.
But the ECS grace period was 150 seconds.

So even though the app had effectively come up normally,
ECS judged that “this task did not become healthy within the grace period” and marked it for termination before the ALB could finish confirming it as healthy.

In the end, it is not enough to look only at the app’s startup time.
You also have to include the time required for the ALB to accumulate consecutive successful checks.

Actual log flow

The logs at that time looked roughly like this.

40:38 deployment started
41:58 server finished booting
43:28 server marked for termination due to ALB health check failure
44:31 server received a termination request
44:35 server terminated

At first, it was also a bit confusing why there was about a one-minute gap between “server marked for termination” and “server received a termination request.”

That part was caused by the ALB’s deregistration delay, in other words, the draining time.

Reference: Edit target group attributes for your Application Load Balancer

The target remains in the draining state until that time expires,
and only after that does the server receive the actual termination request and shut down.

What I took away from this

The conclusion was simple.

healthCheckGracePeriodSeconds needs to be set with more margin.

More precisely, you should not set it by looking only at “application boot time.”
You should include the time it takes for the ALB health check to be fully confirmed as healthy.

The clearest way to think about it, at least for me, was like this.

The ALB health check starts from ACTIVATING, regardless of whether the service is actually ready.
A normal successful rollout only finishes after app readiness time + ALB consecutive success time have both passed.
For example, if app boot takes 82 seconds and the ALB requires 3 successful checks at 30-second intervals, another 60 to 90 seconds may be needed before the target is stably confirmed.
So healthCheckGracePeriodSeconds should be set with more margin than that combined time.

Wrap-up

This issue was less about the server being dead,
and more about not giving the system enough time to officially recognize the server as healthy.

When operating ECS, it is common to first focus only on application startup time.
But in an environment with a load balancer attached, you also need to look at the ALB health check interval and threshold.

While organizing this again, I strongly felt that
RUNNING and “actually healthy enough to receive traffic” can be quite different states.

If you run into a similar situation where a staging server looks fine but keeps getting terminated or redeployed during deployment,
it is worth checking more than just the application logs.

Look at these together as well.

ECS healthCheckGracePeriodSeconds
Container Health Check
ALB Target Group Health Check interval / threshold
deregistration delay

Looking at all four together can reveal the cause much faster than you might expect.

This English version was translated by Hermes Agent.

Three-line summary

Just because an ECS task reaches RUNNING does not mean it is immediately in a healthy serving state.
An ECS service behind a load balancer is only considered stably healthy after both the Container Health Check and the ALB Target Health Check pass.
If you size healthCheckGracePeriodSeconds only around app startup time, the task can be terminated before the ALB finishes confirming consecutive successful checks.

Why an ECS Staging Server Was Redeployed Even Though It Looked Fine

Recently, I ran into a case where a staging server went down once and was redeployed again even though there was no obvious application error.

In this post, I want to walk through the settings I checked at the time and explain why the server was treated as a termination target even after it had started up normally.

Infrastructure setup

The environment where this happened looked roughly like this.

ECS: rolling update deployment
Deployment group: EC2
ALB: load balancer

ECS works in units of tasks.
And the process of a task coming up is a little different from the process of it becoming “actually ready to receive traffic.”

1. ECS Task Lifecycle

Reference: Amazon ECS task lifecycle

When a task is created, the ECS agent moves it through several internal states.

Provisioning -> Pending -> Activating -> Running

Each stage roughly means the following.

Provisioning: infrastructure setup, creating and attaching an ENI (Elastic Network Interface), and so on
Pending: waiting until enough resources are available to place the task
Activating: actually creating and wiring up the container
- pulling the image
- creating the container
- attaching the ENI and setting up routing / DNS
- registering the target in the load balancer target group
Running: the process is up

If the work does not complete successfully, it goes back down in reverse order.

Running -> Deactivating -> Stopping -> Deprovisioning -> Stopped

2. Container Health Check

The ECS task definition had a container health check like this.

"healthCheck": {
  "command": [
    "CMD-SHELL",
    "wget -q -O /dev/null http://localhost:8080/api/health || exit 1"
  ],
  "interval": 30,
  "timeout": 5,
  "retries": 3
}

Each option means the following.

interval: how often the check runs
timeout: how many seconds it has to succeed
retries: how many consecutive failures count as unhealthy

This check runs inside the container.
In other words, it checks whether the application is up from the perspective of localhost:8080.

The points that stood out again while reviewing this were these.

Because it is an in-container command, it does not guarantee network reachability from outside the container or through the ALB path.
On the other hand, it can keep failing until the app is fully up.
If there is no startPeriod, failures during initial boot are still counted from the beginning.

3. ALB Target Health Check

Next is the health check setting on the ALB target group.

{
  "TargetGroupName": "staging",
  "TargetType": "instance",
  "Protocol": "HTTP",
  "Port": 8080,
  "ProtocolVersion": "HTTP1",
  "HealthCheckProtocol": "HTTP",
  "HealthCheckPath": "/api/health",
  "HealthCheckPort": "traffic-port",
  "HealthCheckIntervalSeconds": 30,
  "HealthCheckTimeoutSeconds": 5,
  "HealthyThresholdCount": 3,
  "UnhealthyThresholdCount": 4,
  "Matcher": {
    "HttpCode": "200"
  }
}

This is not an ECS task-level setting.
It is a target group-level setting on the ALB.
That means it should be understood as a request coming from outside the container, not from inside it.

The important parts to remember here are these.

If TargetType is instance, the target group checks the instance’s traffic port.
If HealthCheckPort is traffic-port, requests go to the actual port the service is bound to.
Since HealthyThresholdCount = 3, it takes three consecutive successes to become healthy.
Since UnhealthyThresholdCount = 4, it becomes unhealthy after four consecutive failures.

For example, if the traffic port is 28080, the ALB sends requests to /api/health on that port every 30 seconds to check the target state.

And this ALB check runs in parallel with the Container Health Check.
That was the key part of this issue.

Where the problem happened

An ECS service has a setting called healthCheckGracePeriodSeconds.

It is basically a grace period where ECS waits for a newly started task even if health checks fail right away.

Reference: Amazon ECS service definition parameters

If the service uses a load balancer, ECS does not just look at whether the process is up.
It evaluates both of the following.

Container Health Check
ALB Target Health Check

In our staging ECS service, healthCheckGracePeriodSeconds was set to 150s.

The problem was that the ALB Target Health Check does not start when the app is ready. It starts when the task enters ACTIVATING.

So the check timing can look something like this.

If the first check starts at 10 seconds: 10 -> 40 -> 70 -> 100
If the first check starts at 25 seconds: 25 -> 55 -> 85 -> 115

For the ALB to consider the target healthy, you need three consecutive successful checks in that sequence.

Why it failed within 150 seconds

Let’s assume the following.

The application takes 82 seconds to actually become ready.
The ALB health check runs every 30 seconds.
Healthy status requires three consecutive successful checks.

Then even if the app becomes ready at 82 seconds, all ALB checks before that point will fail.

For example, if the checks happened like this:

40 -> 70 -> 100 -> 130 -> 160

40s: fail
70s: fail
100s: success
130s: success
160s: success

That means healthy status is only fully confirmed at 160 seconds.
But the ECS grace period was 150 seconds.

In the end, it is not enough to look only at the app’s startup time.
You also have to include the time required for the ALB to accumulate consecutive successful checks.

Actual log flow

The logs at that time looked roughly like this.

40:38 deployment started
41:58 server finished booting
43:28 server marked for termination due to ALB health check failure
44:31 server received a termination request
44:35 server terminated

At first, it was also a bit confusing why there was about a one-minute gap between “server marked for termination” and “server received a termination request.”

That part was caused by the ALB’s deregistration delay, in other words, the draining time.

Reference: Edit target group attributes for your Application Load Balancer

The target remains in the draining state until that time expires,
and only after that does the server receive the actual termination request and shut down.

What I took away from this

The conclusion was simple.

healthCheckGracePeriodSeconds needs to be set with more margin.

More precisely, you should not set it by looking only at “application boot time.”
You should include the time it takes for the ALB health check to be fully confirmed as healthy.

The clearest way to think about it, at least for me, was like this.

The ALB health check starts from ACTIVATING, regardless of whether the service is actually ready.
A normal successful rollout only finishes after app readiness time + ALB consecutive success time have both passed.
For example, if app boot takes 82 seconds and the ALB requires 3 successful checks at 30-second intervals, another 60 to 90 seconds may be needed before the target is stably confirmed.
So healthCheckGracePeriodSeconds should be set with more margin than that combined time.

Wrap-up

This issue was less about the server being dead,
and more about not giving the system enough time to officially recognize the server as healthy.

While organizing this again, I strongly felt that
RUNNING and “actually healthy enough to receive traffic” can be quite different states.

If you run into a similar situation where a staging server looks fine but keeps getting terminated or redeployed during deployment,
it is worth checking more than just the application logs.

Look at these together as well.

ECS healthCheckGracePeriodSeconds
Container Health Check
ALB Target Group Health Check interval / threshold
deregistration delay

Looking at all four together can reveal the cause much faster than you might expect.

Why an ECS Staging Server Was Redeployed Even Though It Looked Fine

Three-line summary

Why an ECS Staging Server Was Redeployed Even Though It Looked Fine

Infrastructure setup

1. ECS Task Lifecycle

2. Container Health Check

3. ALB Target Health Check

Where the problem happened

Why it failed within 150 seconds

Actual log flow

What I took away from this

Wrap-up

ON THIS PAGE

Why an ECS Staging Server Was Redeployed Even Though It Looked Fine

Three-line summary

Why an ECS Staging Server Was Redeployed Even Though It Looked Fine

Infrastructure setup

1. ECS Task Lifecycle

2. Container Health Check

3. ALB Target Health Check

Where the problem happened

Why it failed within 150 seconds

Actual log flow

What I took away from this

Wrap-up

ON THIS PAGE