Skip to content

Plugin daemon fails after ECS task redeployment with connection errors #406

@zukizukizuki

Description

@zukizukizuki

Self Checks

To make sure we get to you in time, please check the following :)

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • "Please do not modify this template :) and fill in all the required fields."

Versions

  1. dify-plugin-daemon Version: 0.1.3-local
  2. dify-api Version: 1.4.3

Describe the bug

Plugin daemon experiences network connection failures when ECS tasks are redeployed, causing plugins to become inaccessible. Two distinct but related issues occur:

Important Question: Is deploying new versions and restarting ECS tasks an expected/supported use case for dify-plugin-daemon? We want to understand if this is a supported deployment pattern or if we're using the plugin daemon outside of its intended scope.

  1. S3 Storage Issue (Previous): When using S3 for plugin storage, plugins fail to load after container restart with error "plugin_unique_identifier is not valid"
  2. Network Connection Issue (Current): After switching to EFS + local storage and forcing new deployment, plugin daemon fails to connect to itself using old IP addresses

To Reproduce

Steps to reproduce the behavior:

Issue 1: S3 Storage Problem

  1. Configure dify-plugin-daemon with PLUGIN_STORAGE_TYPE=aws_s3
  2. Install plugins (OpenAI, Bedrock) via Dify web interface
  3. Restart/redeploy ECS service
  4. Observe error logs: 2025/07/17 04:50:14 watcher.go:73: [ERROR]list installed plugins failed: plugin_unique_identifier is not valid:

Issue 2: Network Connection Problem (After EFS Migration)

  1. Configure dify-plugin-daemon with PLUGIN_STORAGE_TYPE=local and EFS mount
  2. Install plugins via Dify web interface
  3. Force new deployment via ECS service update
  4. Check logs and see connection errors to old task IP addresses

Expected behavior

Plugin daemon should:

  1. Successfully reload installed plugins after container restarts/redeployments
  2. Handle IP address changes gracefully without requiring manual intervention
  3. Maintain plugin state consistency across deployments

Screenshots

Error Logs - S3 Storage Issue:

2025/07/17 04:50:14 watcher.go:73: [ERROR]list installed plugins failed: 
plugin_unique_identifier is not valid: 

Error Logs - Network Connection Issue:

2025/07/18 05:14:00 middleware.go:131: [ERROR]redirect request failed: Post "http://169.254.172.2:5002/plugin/a5df51ca-fba9-4170-8369-4ae0eff4f543/dispatch/model/schema": dial tcp 169.254.172.2:5002: connect: cannot assign requested address

2025/07/18 05:14:00 factory.go:28: [ERROR]PluginDaemonInternalServerError: redirect request failed: Post "http://169.254.172.2:5002/plugin/a5df51ca-fba9-4170-8369-4ae0eff4f543/dispatch/model/schema": dial tcp 169.254.172.2:5002: connect: cannot assign requested address

Environment Configuration:

SERVER_HOST=0.0.0.0
SERVER_PORT=5002
PLUGIN_WORKING_PATH=/app/shared_plugins
PLUGIN_INSTALLED_PATH=installed
PLUGIN_PACKAGE_CACHE_PATH=packages
PLUGIN_STORAGE_TYPE=local

Additional context

Infrastructure Details:

  • Platform: AWS ECS Fargate
  • Network: awsvpc mode with dynamic IP assignment
  • Storage:
    • Initial setup: S3 for plugin storage
    • Current setup: EFS (NFSv4.1) mounted at /app/shared_plugins
  • Database: Aurora PostgreSQL 15.10

Analysis:

  1. S3 Issue: Suggests plugin metadata in database becomes inconsistent with actual S3 storage
  2. Network Issue: Plugin daemon appears to cache/store IP addresses internally, causing connection failures when container IP changes

Potential Root Causes:

  1. Plugin daemon may be storing absolute network references instead of using localhost/relative addressing
  2. Database plugin metadata may contain stale network configuration
  3. Plugin state management may not be designed for ephemeral container environments

Workaround Applied:

Complete plugin cleanup and reinstallation after each deployment:

# Clear EFS plugin data
rm -rf /app/shared_plugins/langgenius/

# Clear database plugin references
DELETE FROM plugins;
DELETE FROM plugin_installations;
DELETE FROM plugin_declarations;

Request:

Is there a configuration option or best practice to make plugin daemon more resilient to container restarts and IP address changes in containerized environments like ECS Fargate?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions