Intermittent Airflow 3.1.2 API Timeouts Under High Load Causing Task Failures (KubernetesExecutor) #58453

em-eman · 2025-11-18T11:11:14Z

em-eman
Nov 18, 2025

Apache Airflow version

Other Airflow 2/3 version (please specify below)

If "Other Airflow 2/3 version" selected, which one?

3.1.2

What happened?

After upgrading from Airflow 2.11 to Airflow 3.1.2 (same issue happened in all 3.x version), we are experiencing intermittent task failures when the system is under high load. The failures occur due to timeouts and name-resolution errors in the Airflow execution API, causing tasks to fail during log setup (_remote_logging_conn → client.connections.get()).

When load is low or when we re-run the same DAGs individually, everything succeeds. The problem appears only when many DAGs run concurrently.

We are running Airflow on Kubernetes with the KubernetesExecutor.

Cluster Setup

Scheduler : 2 replica counts
Webserver / API : 2 replica counts
Dag-Processor : 1
Workers | KubernetesExecutor (pods)

Worker pods fail early in task execution with repeated retries from the Airflow SDK client, eventually raising:

httpx.ConnectError: [Errno -3] Temporary failure in name resolution

The worker attempts to contact the Airflow Webserver API at:

http://airflow-web.ws-nav-8662-pr.svc.cluster.local/execution/

During high DAG concurrency, the request repeatedly times out, then fails.

025-11-18T09:17:52.005158Z] {{configuration.py:871}} DEBUG - Could not retrieve value from section database, for key sql_alchemy_engine_args. Skipping redaction of this conf.
[2025-11-18T09:17:52.005754Z] {{configuration.py:871}} DEBUG - Could not retrieve value from section database, for key sql_alchemy_conn_async. Skipping redaction of this conf.
{"timestamp":"2025-11-18T09:17:52.168665Z","level":"info","event":"Executing workload","workload":"ExecuteTask(token='eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIwMTlhOTYzZi1lYjY1LTc0NGUtYTgzYi1kZDQ2ZTE4MjI5NDEiLCJqdGkiOiI4MTFhOTcyNzI0MzE0YjIzODRlYzQ3MTI5MjUxOTUxNSIsImlzcyI6Imh0dHBzOi8vYWlyZmxvdy13cy1uYXYtODY2Mi1wci53cy5uYXZpZ2FuY2UuY29tIiwiYXVkIjoidXJuOmFpcmZsb3cuYXBhY2hlLm9yZzp0YXNrIiwibmJmIjoxNzYzNDU3MzYzLCJleHAiOjE3NjM0NTc5NjMsImlhdCI6MTc2MzQ1NzM2M30.OB37xF8UgEjLB4FeDu6nno0RnmUOx8GWcf4Pvmj-N5Q', ti=TaskInstance(id=UUID('019a963f-eb65-744e-a83b-dd46e1822941'), dag_version_id=UUID('019a9525-3bf5-7c2f-9f58-4360d263ee9f'), task_id='unique_id_generator', dag_id='today_data_generator_sp_demo_c2_tailend_v2', run_id='manual__2025-11-18T09:16:01+00:00', try_number=1, map_index=-1, pool_slots=1, queue='kubernetes', priority_weight=5, executor_config=None, parent_context_carrier={}, context_carrier={}), dag_rel_path=PurePosixPath('plant/today_data_generator_v2.py'), bundle_info=BundleInfo(name='dags-folder', version=None), log_path='dag_id=today_data_generator_sp_demo_c2_tailend_v2/run_id=manual__2025-11-18T09:16:01+00:00/task_id=unique_id_generator/attempt=1.log', type='ExecuteTask')","logger":"__main__","filename":"execute_workload.py","lineno":56}
{"timestamp":"2025-11-18T09:17:52.169263Z","level":"info","event":"Connecting to server:","server":"http://airflow-web.ws-nav-8662-pr.svc.cluster.local/execution/","logger":"__main__","filename":"execute_workload.py","lineno":64}
{"timestamp":"2025-11-18T09:17:52.221962Z","level":"debug","event":"Connecting to execution API server","server":"http://airflow-web.ws-nav-8662-pr.svc.cluster.local/execution/","logger":"supervisor","filename":"supervisor.py","lineno":1920}
{"timestamp":"2025-11-18T09:18:12.240534Z","level":"warning","event":"Starting call to 'airflow.sdk.api.client.Client.request', this is the 1st time calling it.","logger":"airflow.sdk.api.client","filename":"before.py","lineno":42}
{"timestamp":"2025-11-18T09:18:32.959146Z","level":"warning","event":"Starting call to 'airflow.sdk.api.client.Client.request', this is the 2nd time calling it.","logger":"airflow.sdk.api.client","filename":"before.py","lineno":42}
{"timestamp":"2025-11-18T09:18:54.560692Z","level":"warning","event":"Starting call to 'airflow.sdk.api.client.Client.request', this is the 3rd time calling it.","logger":"airflow.sdk.api.client","filename":"before.py","lineno":42}
{"timestamp":"2025-11-18T09:19:16.095223Z","level":"warning","event":"Starting call to 'airflow.sdk.api.client.Client.request', this is the 4th time calling it.","logger":"airflow.sdk.api.client","filename":"before.py","lineno":42}
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request
    resp = self._pool.handle_request(req)
  File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
    raise exc from None
  File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
    response = connection.handle_request(
  File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 101, in handle_request
    raise exc
  File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 78, in handle_request
    stream = self._connect(request)
  File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 124, in _connect
    stream = self._network_backend.connect_tcp(**kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 207, in connect_tcp
    with map_exceptions(exc_map):
  File "/usr/python/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ConnectError: [Errno -3] Temporary failure in name resolution

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/python/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/python/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/execute_workload.py", line 125, in <module>
    main()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/execute_workload.py", line 121, in main
    execute_workload(workload)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/execute_workload.py", line 66, in execute_workload
    supervise(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/supervisor.py", line 1928, in supervise
    logger, log_file_descriptor = _configure_logging(log_path, client)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/supervisor.py", line 1843, in _configure_logging
    with _remote_logging_conn(client):
  File "/usr/python/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/supervisor.py", line 898, in _remote_logging_conn
    conn = _fetch_remote_logging_conn(conn_id, client)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/supervisor.py", line 862, in _fetch_remote_logging_conn
    conn = client.connections.get(conn_id)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/api/client.py", line 361, in get
    resp = self.client.get(f"connections/{conn_id}")
  File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 1053, in get
    return self.request(
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 185, in reraise
    raise self.last_attempt.result()
  File "/usr/python/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/python/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/api/client.py", line 885, in request
    return super().request(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 825, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 914, in send
    response = self._send_handling_auth(
  File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth
    response = self._send_handling_redirects(
  File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
    response = self._send_single_request(request)
  File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request
    response = transport.handle_request(request)
  File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 249, in handle_request
    with map_httpcore_exceptions():
  File "/usr/python/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ConnectError: [Errno -3] Temporary failure in name resolution

Airflow config:

  AIRFLOW__SDK__CONNECTION_URL: "http://airflow-web.${NEW_NS}.svc.cluster.local/execution/"
  AIRFLOW__CORE__EXECUTION_API_SERVER_URL: "http://airflow-web.${NEW_NS}.svc.cluster.local/execution/"
  AIRFLOW__SCHEDULER__TASK_INSTANCE_HEARTBEAT_SEC: "300"
  AIRFLOW__SCHEDULER__TASK_INSTANCE_HEARTBEAT_TIMEOUT: "300"
  AIRFLOW__SCHEDULER__TASK_INSTANCE_HEARTBEAT_TIMEOUT_DETECTION_INTERVAL: "5"
  AIRFLOW__DAG_PROCESSOR__DAG_FILE_PROCESSOR_TIMEOUT: "600"
  AIRFLOW__WORKERS__MAX_FAILED_HEARTBEATS: "10"
  AIRFLOW__WORKERS__MIN_HEARTBEAT_INTERVAL: "120"
  AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD: "120"
  AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: "30"
  AIRFLOW__SCHEDULER__TASK_QUEUED_TIMEOUT: "1500"
  AIRFLOW__DAG_PROCESSOR__REFRESH_INTERVAL: "60"
  AIRFLOW__API__LOG_FETCH_TIMEOUT_SEC: "30"
  AIRFLOW__WEBSERVER__LOG_FETCH_DELAY_SEC: "10"
  AIRFLOW__WORKERS__EXECUTION_API_TIMEOUT: "15.0"
  AIRFLOW__WORKERS__MAX_FAILED_HEARTBEATS: "5"
  AIRFLOW__CORE__DEFAULT_TASK_RETRY_DELAY: "30"
  AIRFLOW__CORE__PARALLELISM: "50"

What you think should happen instead?

Tasks should run as expected without timeout error.

How to reproduce

when more than 10 tasks pods are created then few them failed with above error massage.

Operating System

linux

Versions of Apache Airflow Providers

No response

Deployment

Other 3rd-party Helm chart

Deployment details

helm chart and eks cluster

Anything else?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

2025-11-18T11:11:18Z

boring-cyborg[bot]
bot Nov 18, 2025

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

0 replies

potiuk · 2025-11-18T20:15:56Z

potiuk
Nov 18, 2025
Collaborator

Two possible issues (both with your deployment):

your DNS seems to be not responding fast enough. Likely your deployment does not give it enough CPU or memory to operate - this is most likely root cause of your problem
You als might have wrong scaling setup.I think this might be because of different scaling patterns for Airlfow API server. With Fast API, it's better to scale by having 1 worker for API server and N replicas to handle multiple requests better under high load - see https://fastapi.tiangolo.com/deployment/docker/#one-load-balancer-multiple-worker-containers

Can you please try that and increase number of replicas while decreasing number of workers to 1 as explained in Fast API docs explhttps://fastapi.tiangolo.com/deployment/docker/#one-load-balancer-multiple-worker-containers

0 replies

potiuk · 2025-11-18T20:16:32Z

potiuk
Nov 18, 2025
Collaborator

Converting to a discussion - please let us know im the discussion if fixes in your deployment helped.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent Airflow 3.1.2 API Timeouts Under High Load Causing Task Failures (KubernetesExecutor) #58453

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Intermittent Airflow 3.1.2 API Timeouts Under High Load Causing Task Failures (KubernetesExecutor) #58453

Uh oh!

em-eman Nov 18, 2025

Apache Airflow version

If "Other Airflow 2/3 version" selected, which one?

What happened?

Cluster Setup

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Replies: 3 comments

Uh oh!

boring-cyborg[bot] bot Nov 18, 2025

Uh oh!

potiuk Nov 18, 2025 Collaborator

Uh oh!

potiuk Nov 18, 2025 Collaborator

em-eman
Nov 18, 2025

boring-cyborg[bot]
bot Nov 18, 2025

potiuk
Nov 18, 2025
Collaborator

potiuk
Nov 18, 2025
Collaborator