Replies: 3 comments
-
|
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Beta Was this translation helpful? Give feedback.
-
|
Two possible issues (both with your deployment):
Can you please try that and increase number of replicas while decreasing number of workers to 1 as explained in Fast API docs explhttps://fastapi.tiangolo.com/deployment/docker/#one-load-balancer-multiple-worker-containers |
Beta Was this translation helpful? Give feedback.
-
|
Converting to a discussion - please let us know im the discussion if fixes in your deployment helped. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Apache Airflow version
Other Airflow 2/3 version (please specify below)
If "Other Airflow 2/3 version" selected, which one?
3.1.2
What happened?
After upgrading from Airflow 2.11 to Airflow 3.1.2 (same issue happened in all 3.x version), we are experiencing intermittent task failures when the system is under high load. The failures occur due to timeouts and name-resolution errors in the Airflow execution API, causing tasks to fail during log setup (_remote_logging_conn → client.connections.get()).
When load is low or when we re-run the same DAGs individually, everything succeeds. The problem appears only when many DAGs run concurrently.
We are running Airflow on Kubernetes with the KubernetesExecutor.
Cluster Setup
Scheduler : 2 replica counts
Webserver / API : 2 replica counts
Dag-Processor : 1
Workers | KubernetesExecutor (pods)
Worker pods fail early in task execution with repeated retries from the Airflow SDK client, eventually raising:
httpx.ConnectError: [Errno -3] Temporary failure in name resolution
The worker attempts to contact the Airflow Webserver API at:
http://airflow-web.ws-nav-8662-pr.svc.cluster.local/execution/
During high DAG concurrency, the request repeatedly times out, then fails.
Airflow config:
What you think should happen instead?
Tasks should run as expected without timeout error.
How to reproduce
when more than 10 tasks pods are created then few them failed with above error massage.
Operating System
linux
Versions of Apache Airflow Providers
No response
Deployment
Other 3rd-party Helm chart
Deployment details
helm chart and eks cluster
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions