Skip to content

Multi‑AZ ALB with single‑replica ECS services causes intermittent blackholing #121

@vladhor

Description

@vladhor

Summary

When the Metaflow Terraform module is configured with two subnets (as required by the module interface), it creates a multi-AZ Application Load Balancer (ALB). However, the Metaflow ECS/Fargate services (UI backend, UI static, metadata service) are deployed with desired_count = 1, resulting in tasks running in only one Availability Zone.

This leads to a situation where:

  • the ALB advertises one IP per AZ (via DNS),
  • but only one AZ has healthy ECS targets,
  • and requests routed to the other AZ IP hang or time out.

The result is intermittent availability issues depending on which ALB IP the client resolves.


Affected versions

  • Module: outerbounds/metaflow/aws
  • Tested with: v0.13.0
  • AWS Region: any multi-AZ region (e.g. eu-west-3)

Environment / setup

  • Metaflow deployed using the high-level module
  • VPC spanning 2 Availability Zones
  • Two subnets passed to the module:
    subnet1_id = <subnet in AZ A>
    subnet2_id = <subnet in AZ B>
  • UI ALB enabled (internal or public)
  • ECS services:
    • metaflow-*-metadata-service
    • metaflow-*-ui_backend
    • metaflow-*-ui_static
  • All ECS services run with:
    scheduling_strategy = REPLICA
    desired_count = 1
    

Observed behavior

  1. The ALB is created across both subnets / AZs, as expected.
  2. DNS for the ALB resolves to two private IPs (one per AZ).
  3. ECS services run only one task total, placed in a single AZ.
  4. The ALB target group has:
    • healthy targets in one AZ
    • no healthy targets in the other AZ
  5. Requests routed to the “empty” AZ IP hang or time out.
  6. Clients experience intermittent failures depending on DNS resolution order.

Expected behavior

Option A — True HA

If two subnets / AZs are required, ECS services should run with:

desired_count >= number_of_AZs

so that each ALB node has at least one healthy target.

Option B — Safe single-AZ default

If ECS services are intentionally single-replica, the ALB should not be multi-AZ, or the module should clearly document that multi-AZ ALB + single-replica ECS is unsupported.


Proposed solutions

  1. Expose ECS replica counts as module inputs.
  2. Allow explicit single-AZ deployments (make subnet2_id optional or support passing the same subnet twice).
  3. Improve documentation and warnings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions