A comprehensive implementation and evaluation framework for distributed multi-agent job scheduling in high-performance computing (HPC) environments. This repository contains the complete codebase for the paper "Distributed Multi-Agent Scheduling for Resilient High-Performance Computing: Experimental Evaluation".
The system implements a decentralized scheduling architecture where multiple autonomous agents collaborate to schedule jobs across distributed computing resources, replacing traditional centralized schedulers that create single points of failure.
1. Multi-Agent Architecture
- Resource Agents: Autonomous agents managing individual compute nodes/clusters
- Distributed Coordination: No central scheduler - agents negotiate directly
- Competitive Bidding: Agents bid for jobs based on resource availability and capability matching
2. Event-Driven Scheduling
- Discrete Event Simulation: Pure event-driven approach (no polling)
- Priority Queue Management: O(log n) complexity for scalable event processing
- Message-Passing Protocol: Asynchronous communication between agents
3. Fault Tolerance Mechanisms
- Heartbeat Monitoring: Continuous health checking of all agents
- Automatic Recovery: Failed jobs automatically redistributed
- No Single Point of Failure: System continues operating even if multiple agents fail
Phase 1: Job Arrival
1. Job submitted to system
2. Resource agents evaluate job requirements
3. Capable agents generate competitive bids
4. Bids include resource availability scores
Phase 2: Competitive Selection
1. Agents compete based on multi-factor scoring:
- CPU/Memory/GPU availability match
- Current workload and utilization
- Historical performance metrics
2. Best-fit agent selected automatically
3. Job assignment and execution initiated
Phase 3: Fault Handling
1. Continuous monitoring via heartbeat protocol
2. Failure detection triggers automatic recovery
3. Failed jobs redistributed to available agents
4. System maintains >95% availability under failures
1. Scalable Event Processing
- Heap-based priority queue for O(log n) event scheduling
- No time-driven polling - purely reactive system
- Efficient message routing and processing
2. Intelligent Resource Matching
- Multi-dimensional scoring algorithm considering CPU, memory, GPU requirements
- Dynamic capability assessment and load balancing
- Preference-based job placement optimization
3. Resilience Through Redundancy
- Distributed state management across multiple agents
- Automatic job retry and rescheduling mechanisms
- Cascading failure prevention through isolation
Scalability: Linear performance scaling up to 500+ jobs and 50+ agents
Fault Tolerance: 96.2% completion rate under 30% agent failure scenarios
Throughput: 10+ jobs/second sustained processing capability
Recovery Time: <160 seconds average recovery from major failures
- Eliminates Single Point of Failure: No central scheduler to fail
- Better Fault Tolerance: System degrades gracefully under failures
- Improved Scalability: Distributed decision-making reduces bottlenecks
- Adaptive Resource Management: Agents respond dynamically to changing conditions
- Lower Latency: Direct agent-to-agent communication reduces delays
- Novel distributed coordination protocol for HPC job scheduling
- Competitive bidding mechanism optimizing resource utilization
- Comprehensive fault tolerance framework with automatic recovery
- Scalable event-driven architecture supporting large-scale deployments
This approach represents a paradigm shift from traditional centralized HPC schedulers to resilient, self-organizing distributed systems that maintain performance even under significant infrastructure failures.
- Distributed Multi-Agent Architecture: Autonomous agents with competitive bidding and fault tolerance
- Discrete Event Simulation: High-performance event-driven scheduling simulation
- Comprehensive Evaluation Framework: 26 test configurations across 5 experimental dimensions
- Fault Injection & Recovery: Configurable failure patterns and autonomous recovery mechanisms
- Publication-Ready Results: Automated generation of research figures and statistical analysis
- 96.2% Win Rate: Demonstrated superiority over centralized scheduling approaches
- 25x better completion rate under extreme load (400 concurrent jobs)
- Graceful degradation: Maintains 82% completion vs 28% centralized at 35% failure rates
- Superior scalability: 81-96% completion across varying workload sizes
- Statistical significance: p < 0.001, Cohen's d = 2.84 (large effect size)
multiagent/
βββ src/ # Core implementation
β βββ agents/ # Multi-agent system
β β βββ base_agent.py # Base agent with heartbeat monitoring
β β βββ resource_agent.py # Resource management and job execution
β βββ communication/ # Message passing infrastructure
β β βββ protocol.py # Pub-sub messaging with fault tolerance
β βββ scheduler/ # Scheduling algorithms
β β βββ discrete_event_scheduler.py # Event-driven coordination
β βββ jobs/ # Job management
β β βββ job.py # Job lifecycle and dependencies
β βββ resources/ # Resource modeling
β βββ resource.py # HPC resource abstraction
βββ evaluation/ # Evaluation framework
β βββ systematic_resilience_evaluation.py # Main evaluation suite
β βββ fault_tolerant_test.py # Fault tolerance testing
β βββ high_throughput_test.py # Performance benchmarking
βββ demos/ # Example implementations
βββ figures/ # Generated evaluation results
βββ docs/ # Documentation
- Python 3.8 or higher
- Git
# Clone the repository
git clone https://github.com/username/distributed-multiagent-scheduling.git
cd distributed-multiagent-scheduling
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install package in development mode
pip install -e .Core dependencies:
numpy >= 1.21.0- Numerical computationsmatplotlib >= 3.5.0- Visualization and figure generationpandas >= 1.4.0- Data analysis and manipulationseaborn >= 0.11.0- Statistical visualizationdataclasses- Data structure definitions (Python 3.7+)
from src.scheduler.discrete_event_scheduler import DiscreteEventScheduler
from src.agents.resource_agent import ResourceAgent
from src.resources.resource import Resource, ResourceCapacity, ResourceType
# Create scheduler and agents
scheduler = DiscreteEventScheduler()
agent = ResourceAgent("agent-1", resource, scheduler, failure_rate=0.1)
# Run simulation
scheduler.start()
# Submit jobs and monitor results# Quick evaluation (2-3 minutes)
python evaluation/quick_resilience_test.py
# Ultra-quick demo (30 seconds)
python evaluation/ultra_quick_test.py
# Full systematic evaluation (30-45 minutes)
python evaluation/systematic_resilience_evaluation.py
# Generate publication figures
python create_bw_publication_figures.py# 1. Run systematic resilience evaluation
python evaluation/systematic_resilience_evaluation.py
# 2. Run fault tolerance tests
python evaluation/fault_tolerant_test.py
# 3. Run high throughput benchmarks
python evaluation/high_throughput_test.py
# 4. Generate all publication figures
python create_bw_publication_figures.py
# 5. Compile LaTeX results document
pdflatex resilience_evaluation_results.tex- Quick test: 2-3 minutes (5 configurations)
- Systematic evaluation: 30-45 minutes (26 configurations)
- Complete reproduction: 60-90 minutes (all tests + figures)
Generated Results:
βββ bw_figures/ # Black & white publication figures (11 files)
βββ figures/ # Color figures and statistical tables
βββ resilience_study_results_*.json # Raw evaluation data
βββ resilience_evaluation_results.pdf # LaTeX compiled results
The evaluation framework includes 26 test configurations across 5 dimensions:
- Scale Testing (12 configs): 50-500 jobs, 5-20 agents
- Failure Rate Testing (4 configs): 5%-35% failure rates
- Failure Pattern Testing (3 configs): Random, cascading, network partition
- Load Pattern Testing (3 configs): Constant, burst, Poisson arrivals
- High Load Testing (4 configs): 50-400 concurrent job bursts
- Job Completion Rate: Primary success metric (%)
- System Availability: Operational uptime (%)
- Fault Tolerance Score: Composite resilience index (0-100)
- Mean Time to Recovery: Average failure recovery duration
- Throughput: Jobs completed per time unit
All results include:
- Multiple repetitions (3-5 per configuration)
- Statistical significance testing (p-values, effect sizes)
- Confidence intervals and variance analysis
- Reproducible random seeds for consistency
| Experimental Dimension | Configs | Distributed Wins | Win Rate | Avg Advantage |
|---|---|---|---|---|
| Scale Testing | 12 | 11 | 91.7% | +52.3% |
| Failure Rate Testing | 4 | 4 | 100% | +38.5% |
| Failure Pattern Testing | 3 | 3 | 100% | +55.7% |
| Load Pattern Testing | 3 | 3 | 100% | +41.3% |
| High Load Performance | 4 | 4 | 100% | +47.8% |
| Overall Results | 26 | 25 | 96.2% | +47.1% |
Statistical Significance: p < 0.001, Cohen's d = 2.84, Effect Size: Large
from evaluation.systematic_resilience_evaluation import ExperimentConfig, run_resilience_experiment
# Define custom experiment
config = ExperimentConfig(
name="Custom-Test",
num_jobs=100,
num_agents=10,
agent_failure_rate=0.2,
scheduler_failure_rate=0.1,
job_arrival_pattern='burst',
failure_pattern='cascading',
simulation_time=150.0,
repetitions=5
)
# Run evaluation
results = run_resilience_experiment(config)from src.agents.resource_agent import ResourceAgent
class CustomAgent(ResourceAgent):
def _calculate_job_score(self, job_data):
# Custom scoring algorithm
score = super()._calculate_job_score(job_data)
# Add custom logic
return modified_scorefrom evaluation.systematic_resilience_evaluation import inject_failure_pattern
# Custom failure injection
def custom_failure_pattern(simulation, pattern, tracker, simulation_time):
agents = list(simulation.agents.values())
# Implement custom failure timing and patterns
for agent in agents:
agent.failure_time = custom_failure_schedule()- API Documentation: See
docs/directory - Architecture Guide:
docs/ARCHITECTURE.md - Evaluation Guide:
docs/EVALUATION.md - Figure Descriptions:
bw_figure_descriptions.md - LaTeX Results:
resilience_evaluation_results.tex
# Run unit tests
python -m pytest tests/
# Run integration tests
python -m pytest tests/integration/
# Run evaluation validation
python tests/validate_evaluation.pyPerformance benchmarks on standard hardware (Intel i7, 16GB RAM):
- Simulation Rate: ~10,000 events/second
- Agent Scalability: Up to 50 agents efficiently
- Job Throughput: 1,000+ jobs per simulation
- Memory Usage: <2GB for largest configurations
We welcome contributions! Please see our Contributing Guidelines.
# Install development dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install
# Run tests before committing
python -m pytest- New evaluation scenarios and metrics
- Performance optimizations for large-scale simulations
- Additional scheduling algorithms for comparison
- Visualization improvements and interactive dashboards
- Documentation and tutorial improvements
If you use this code in your research, please cite:
@article{distributed_multiagent_scheduling_2024,
title={Fault-Tolerant Distributed Multi-Agent Scheduling for High-Performance Computing: A Resilience-Centric Approach},
author={Prachi Jadhav, Fred Sutter, Ewa Deelman, Prasanna Balaprakash},
journal={TBD},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Research conducted through SWARM project supported by the Department of Energy Award #DE-SC0024387.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
β Star this repository if you find it useful for your research!