Summary
When the Metaflow Terraform module is configured with two subnets (as required by the module interface), it creates a multi-AZ Application Load Balancer (ALB). However, the Metaflow ECS/Fargate services (UI backend, UI static, metadata service) are deployed with desired_count = 1, resulting in tasks running in only one Availability Zone.
This leads to a situation where:
- the ALB advertises one IP per AZ (via DNS),
- but only one AZ has healthy ECS targets,
- and requests routed to the other AZ IP hang or time out.
The result is intermittent availability issues depending on which ALB IP the client resolves.
Affected versions
- Module:
outerbounds/metaflow/aws
- Tested with:
v0.13.0
- AWS Region: any multi-AZ region (e.g.
eu-west-3)
Environment / setup
- Metaflow deployed using the high-level module
- VPC spanning 2 Availability Zones
- Two subnets passed to the module:
subnet1_id = <subnet in AZ A>
subnet2_id = <subnet in AZ B>
- UI ALB enabled (internal or public)
- ECS services:
metaflow-*-metadata-service
metaflow-*-ui_backend
metaflow-*-ui_static
- All ECS services run with:
scheduling_strategy = REPLICA
desired_count = 1
Observed behavior
- The ALB is created across both subnets / AZs, as expected.
- DNS for the ALB resolves to two private IPs (one per AZ).
- ECS services run only one task total, placed in a single AZ.
- The ALB target group has:
- healthy targets in one AZ
- no healthy targets in the other AZ
- Requests routed to the “empty” AZ IP hang or time out.
- Clients experience intermittent failures depending on DNS resolution order.
Expected behavior
Option A — True HA
If two subnets / AZs are required, ECS services should run with:
desired_count >= number_of_AZs
so that each ALB node has at least one healthy target.
Option B — Safe single-AZ default
If ECS services are intentionally single-replica, the ALB should not be multi-AZ, or the module should clearly document that multi-AZ ALB + single-replica ECS is unsupported.
Proposed solutions
- Expose ECS replica counts as module inputs.
- Allow explicit single-AZ deployments (make
subnet2_id optional or support passing the same subnet twice).
- Improve documentation and warnings.
Summary
When the Metaflow Terraform module is configured with two subnets (as required by the module interface), it creates a multi-AZ Application Load Balancer (ALB). However, the Metaflow ECS/Fargate services (UI backend, UI static, metadata service) are deployed with
desired_count = 1, resulting in tasks running in only one Availability Zone.This leads to a situation where:
The result is intermittent availability issues depending on which ALB IP the client resolves.
Affected versions
outerbounds/metaflow/awsv0.13.0eu-west-3)Environment / setup
metaflow-*-metadata-servicemetaflow-*-ui_backendmetaflow-*-ui_staticObserved behavior
Expected behavior
Option A — True HA
If two subnets / AZs are required, ECS services should run with:
so that each ALB node has at least one healthy target.
Option B — Safe single-AZ default
If ECS services are intentionally single-replica, the ALB should not be multi-AZ, or the module should clearly document that multi-AZ ALB + single-replica ECS is unsupported.
Proposed solutions
subnet2_idoptional or support passing the same subnet twice).