Status: Implementation Complete Version: 1.0 (Public Release) Purpose: Fully autonomous multi-target calibration of ELM-FATES using Claude API + HPC + Adaptive Memory
# Copy the Kougarok example (recommended) or the minimal template
cp -r use_cases/Kougarok use_cases/YourSite
# OR
cp -r use_cases/TEMPLATE use_cases/YourSiteEdit your site configuration file with ALL site-specific settings:
vim use_cases/YourSite/config/yoursite_config.sh
# Key settings to modify:
# ========================
# SITE INFORMATION
export A2MC_SITE_NAME="YourSite"
export A2MC_SITE_LAT=45.0
export A2MC_SITE_LON=-120.0
# PFT CONFIGURATION
export A2MC_PFTS="1,2,3" # Your target PFTs
export A2MC_PFT_NAMES="PFT1,PFT2,PFT3"
# DOMAIN AND SURFACE DATA
export A2MC_DOMAIN_FILE="domain_yoursite.nc"
export A2MC_SURFACE_FILE="surfdata_yoursite.nc"
# PARAMETER CONFIGURATION
export A2MC_N_PARAMS=100 # Number of parameters
export A2MC_N_TRAJECTORIES=30 # For Morris method
export A2MC_PARAM_LIST_FILE="${A2MC_USE_CASE_DIR}/parameters/your_param_list.txt"
# VALIDATION
export A2MC_VALIDATION_FILE="${A2MC_USE_CASE_DIR}/validation/your_targets.txt"
# HPC PATHS (ensemble output, parameter files)
export A2MC_PARAM_DIR="/path/to/fates_param_files"
export A2MC_ENSEMBLE_OUTPUT="${A2MC_OUTPUT_ROOT}/YourEnsemble"Create these files in your use case folder:
# Parameter list with bounds
vim use_cases/YourSite/parameters/your_param_list.txt
# SALib problem definition (optional, for sensitivity analysis)
vim use_cases/YourSite/parameters/salib_problem.txt
# Validation targets
vim use_cases/YourSite/validation/your_targets.txtOnly edit a2mc_config.sh if you need to change HPC-level settings:
vim a2mc_config.sh
# Settings that might need changing:
export A2MC_PROJECT="your_project" # HPC allocation
export A2MC_E3SM_ROOT="/path/to/E3SM" # E3SM source code
export A2MC_OUTPUT_ROOT="/path/to/output" # Simulation output rootSet your AI API key (required for AI-driven phases 2, 3, 4, 6):
# Required: Set your API key
export AI_API_KEY="sk-ant-api03-..."
# Optional: Change AI model (default: claude-sonnet-4-20250514)
export A2MC_AI_MODEL="claude-sonnet-4-20250514" # Balanced (default)
export A2MC_AI_MODEL="claude-opus-4-20250514" # Most capable
export A2MC_AI_MODEL="claude-haiku-3-20240307" # Fastest/cheapest
# Add to ~/.bashrc for persistence
echo 'export AI_API_KEY="your-key-here"' >> ~/.bashrc# Source BOTH configuration files
source a2mc_config.sh
source use_cases/YourSite/config/yoursite_config.sh
print_config # Verify settings
# Run calibration
python orchestrator.pyConfiguration hierarchy:
a2mc_config.sh- Machine-level defaults (HPC paths, COMPSET, etc.)use_cases/{site}/config/{site}_config.sh- ALL site-specific settings
See "Installation & Setup" section below for detailed HPC setup instructions.
A2MC is an autonomous calibration framework that combines:
- Morris/Sobol sensitivity analysis for parameter space exploration
- Claude API reasoning for diagnosis and hypothesis generation
- HPC-native execution for efficient simulation management
- Multi-objective optimization for simultaneous PFT calibration
- Adaptive Memory System for learning from experiments and avoiding repeated failures
The framework runs entirely on NERSC HPC (no SSH tunneling) and uses the Anthropic Claude API for intelligent decision-making. The Adaptive Memory System enables the AI agent to persistently store and retrieve knowledge across sessions.
┌─────────────────────────────────────────────────────────────────────────────┐
│ A2MC FRAMEWORK │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ ORCHESTRATOR (orchestrator.py) │ │
│ │ │ │
│ │ 7-Phase State Machine with Iteration Paths: │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Phase 0 │───►│ Phase 1 │───►│ Phase 2 │───►│ Phase 3 │ │ │
│ │ │ DESIGN │ │ EXPLORATION │ │ SCREENING │ │ DIAGNOSIS │ │ │
│ │ └────▲────┘ └─────────────┘ └───────────┘ └─────┬─────┘ │ │
│ │ │ │ │ │
│ │ │ Redesign: ┌─────────────────┤ │ │
│ │ │ Expand params │ │ │ │
│ │ │ │ ┌─────▼─────┐ │ │
│ │ ┌────┴────┐ ┌───────────┐ ┌──────┴──────┐ │ Phase 4 │ │ │
│ │ │ Phase 7 │◄───│ Phase 5 │◄───│ Phase 6 │◄───│HYPOTHESIS │ │ │
│ │ │CONVERGED│ │ TESTING │ │ REFINEMENT │ └─────┬─────┘ │ │
│ │ └─────────┘ └───────────┘ └──────┬──────┘ │ │ │
│ │ │ │ │ │
│ │ Rethink: │ Skip test: │ │ │
│ │ Hypothesis │ Use existing │ │ │
│ │ proven wrong └────────┬────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Back to Phase 3 │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────┼─────────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────────┐ ┌───────────────────┐ ┌──────────────────┐ │
│ │ REASONING │ │ INTEGRATION │ │ EXISTING TOOLS │ │
│ │ (reasoning.py) │ │ (integration.py) │ │ │ │
│ │ │ │ │ │ modify_fates_ │ │
│ │ • diagnose() │ │ • ParameterManager│◄──►│ parameters.py │ │
│ │ • hypothesize() │ │ • HPCExecutor │ │ │ │
│ │ • design_exp() │ │ • DataPipeline │◄──►│ extract_monthly_ │ │
│ │ • interpret() │ │ • ExperimentRunner│ │ variables.py │ │
│ │ │ │ │ │ │ │
│ │ Claude Sonnet │ │ Direct sbatch/ │ │ NetCDF handling │ │
│ │ 4.5 API │ │ squeue calls │ │ │ │
│ └──────────────────┘ └───────────────────┘ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
# Load Python module
module load python
# Create a virtual environment (one-time setup)
python -m venv ~/a2mc_env
source ~/a2mc_env/bin/activate
# Install anthropic and A2MC dependencies (including RAG)
pip install anthropic numpy pandas netCDF4 scipy SALib networkx chromadb sentence-transformers pyyaml
# Set your API key (add to ~/.bashrc for persistence)
export AI_API_KEY="your-api-key-here"
# Verify installation
python -c "import anthropic; print(anthropic.__version__)"Note: The virtual environment is auto-activated when you source a2mc_config.sh.
A2MC uses a 7-phase workflow with intelligent iteration paths to minimize HPC costs while maximizing learning.
| Phase | Name | Purpose | AI-Driven? | Scripts |
|---|---|---|---|---|
| 0 | DESIGN | Morris/Sobol sampling, create cases, submit to HPC | No | create_morris_ensemble.py |
| 1 | EXPLORATION | Extract Y matrix, run sensitivity analysis | Yes | extract_sensitivity_outputs.py, morris_sensitivity_analysis.py |
| 2 | SCREENING | Rank ensemble by validation targets | Yes | screen_ensemble.py |
| 3 | DIAGNOSIS | Root cause analysis, edge case detection | Yes | reasoning.py |
| 4 | HYPOTHESIS | Generate experiments OR test with existing data | Yes | reasoning.py |
| 5 | TESTING | Run designed experiments on HPC | No | submit_experiment.sh |
| 6 | REFINEMENT | Evaluate results, extract lessons, check equifinality | Yes | reasoning.py, memory/manager.py |
| 7 | CONVERGED | Final optimal configuration | - | - |
A2MC supports non-linear iteration to avoid unnecessary HPC computation:
Normal Flow:
Phase 0 → [HPC] → Phase 1 → Phase 2 → Phase 3 → Phase 4 → Phase 5 → [HPC] → Phase 6 → Phase 7
Iteration Paths:
Phase 4 → Phase 3: Skip testing when existing data can test hypothesis
Phase 6 → Phase 3: Rethink hypothesis when experiment results disprove it
Phase 6 → Phase 0: Redesign when parameter space needs expansion
Phase 4 → Phase 3 (Skip Testing): When a hypothesis can be tested using existing ensemble data (e.g., P mass balance analysis, comparing PFT responses), skip the HPC testing phase and return to diagnosis with new insights.
Phase 6 → Phase 3 (Rethink Hypothesis): When experiment results disprove the hypothesis, return to diagnosis to revise understanding and generate new hypotheses.
Phase 6 → Phase 0 (Redesign): When all parameter candidates are at bounds and calibration fails, expand parameter ranges and run a new ensemble.
Purpose: Create initial parameter sampling design and submit to HPC
# Morris method: n_trajectories × (n_params + 1) simulations
# Example: 30 trajectories × 163 params = 4890 simulations
python orchestrator.py --run --start-phase 0Outputs:
- Morris ensemble matrix (X matrix):
phases/phase0_design/FATES_*_Morris_*sets.txt - Modified parameter files for each ensemble member
- HPC jobs submitted to queue
Purpose: Extract results and run Morris sensitivity analysis
Key operations:
- Extract Y matrix (model outputs) from completed simulations
- Run Morris sensitivity analysis using SALib
- Rank parameters by μ* (mean absolute effect) and σ (interaction effect)
- Generate sensitivity plots and CSV rankings
Outputs:
- Y matrices:
MorrisLeafbiomass_*.txt,MorrisFineroootbiomass_*.txt,MorrisAbgbiomass_*.txt - Sensitivity rankings by PFT (top parameters with μ*, σ values)
- Sensitivity plots (PNG)
Command-line usage:
python orchestrator.py --run --start-phase 1 --start-iteration 2
# Or equivalently:
python orchestrator.py --run --start-phase phase1 --start-iteration 2
python orchestrator.py --run --start-phase exploration --start-iteration 2Purpose: Rank ensemble members against validation targets
Analysis:
- Calculate cost metrics (RMSRE, NRMSE) across all targets
- Rank all simulations by multi-objective performance
- Identify which targets are met/failed for each case
- Detect edge cases (parameters at bounds)
Outputs:
- Ranked case list with composite cost
- Per-target error statistics
- Edge parameter analysis
Purpose: Root cause analysis of calibration failures
Claude API tasks:
- Analyze which targets are failing and why
- Identify mechanistic causes (e.g., P-limitation, allocation issues)
- Find cross-PFT parameter conflicts
- Compare best vs worst cases to identify key differences
- Generate parameter adjustment recommendations
Output: Diagnosis report with root causes, affected mechanisms, and priority rankings
Purpose: Generate testable hypotheses
Claude API tasks:
- Create named hypotheses (e.g., "PFT10 P-starvation hypothesis")
- Specify parameters to modify and expected direction
- Define expected outcomes and success criteria
- Choose approach:
- Run experiments: Submit new simulations to test hypothesis
- Use existing data: Test hypothesis with existing ensemble (e.g., mass balance analysis)
Output: Hypothesis with modification plan or analysis plan
Purpose: Run designed experiments on HPC
Key operations:
- Create modified parameter files based on hypothesis
- Submit experiment simulations to HPC
- Extract and evaluate results
- Compare actual outcomes to expected outcomes
Purpose: Evaluate results and extract lessons
Decision logic:
- If hypothesis confirmed → apply changes, check if more targets remain
- If partially confirmed → adjust hypothesis, return to Phase 4
- If rejected → record failed approach, return to Phase 3
- If all targets met → advance to CONVERGED
- If parameter bounds too restrictive → return to Phase 0 (redesign)
Adaptive Memory Learning:
- Extract lessons from experiment outcomes
- Store successful discoveries in
gained_knowledge/discoveries.json - Record failed approaches in
gained_knowledge/failed_approaches.json - Update parameter knowledge for future reasoning
- Check for equifinality (multiple parameter sets achieving same targets)
Purpose: Finalize calibration
Outputs:
- Best parameter configuration
- Final calibration report
- Complete experiment history
- Extracted knowledge for future calibrations
Validation targets are site-specific and defined in use_cases/{site}/README.md.
Typical target types:
- Biomass: Leaf, fine root, AGB by PFT (g C/m²)
- Ecosystem fluxes: GPP, NPP, NEE (g C/m²/yr)
- Structure: LAI, canopy height
- Phenology: Leaf-on/off dates
See example: use_cases/Kougarok/README.md for a complete target specification.
Main workflow controller with state persistence.
from orchestrator import CalibrationOrchestrator, Phase
# Initialize
orchestrator = CalibrationOrchestrator(
work_dir="/path/to/work",
param_file="/path/to/fates_params.nc",
output_root="/path/to/simulations"
)
# Run from current phase
orchestrator.run()
# Or run specific phase
orchestrator.run_phase(Phase.DIAGNOSIS)
# Resume from saved state
orchestrator = CalibrationOrchestrator.load_state("/path/to/state.json")
orchestrator.run()Key Classes:
Phase- Enum of 8 workflow phasesValidationTargets- Dataclass with all target valuesWorkflowState- Persistent state with full historyCalibrationOrchestrator- Main controller
Claude API interface for intelligent reasoning.
from reasoning import ReasoningModule, Diagnosis, Hypothesis
# Initialize (requires AI_API_KEY env var, or uses A2MC_AI_MODEL config)
reasoning = ReasoningModule() # Uses config defaults
# Diagnose calibration failure
diagnosis = reasoning.diagnose(
results={"leaf_pft10": 45.2, ...},
targets={"leaf_pft10": {"mean": 82.7, "uncertainty": 0.20}, ...},
morris_rankings={"leaf_pft10": [{"param": "...", "mu_star": 0.45}]},
iteration=1
)
# Generate hypothesis
hypothesis = reasoning.generate_hypothesis(
diagnosis=diagnosis,
morris_data={...},
previous_experiments=[]
)
# Design experiment
experiments = reasoning.design_experiments(
hypothesis=hypothesis,
base_case={"case_id": 2678, "parameters": {...}}
)
# Interpret results
interpretation = reasoning.interpret_results(
experiment=experiments[0],
actual_results={...},
targets={...}
)Output Structures:
Diagnosis- Failing targets, causes, recommendationsHypothesis- Name, mechanism, parameter modificationsExperiment- Base case, modifications, expected results
HPC-native interfaces for simulation management.
from integration import (
HPCConfig, ParameterManager, HPCExecutor,
DataPipeline, ExperimentRunner
)
# Configure for NERSC
config = HPCConfig(
scratch_root="/pscratch/sd/j/jingtao",
cfs_root="/global/cfs/cdirs/m2467/jingtao",
project="m2467",
qos="regular"
)
# Modify parameters
param_mgr = ParameterManager(config)
new_param_file = param_mgr.create_modified_file(
base_file="fates_params.nc",
modifications=[
{"parameter": "fates_alloc_storage_cushion", "pft": 10, "value": 3.0}
],
output_file="fates_params_modified.nc"
)
# Submit jobs
executor = HPCExecutor(config)
job_id = executor.submit_case(case_name="PtCNPEn100_TRANS")
# Wait for completion
results = executor.wait_for_jobs([job_id], poll_interval=300)
# Extract data
pipeline = DataPipeline(config)
data = pipeline.extract_case_data(case_name="PtCNPEn100_TRANS")
evaluation = pipeline.evaluate_against_targets(data)Key Classes:
HPCConfig- NERSC paths, project, QOS settingsParameterManager- Wraps modify_fates_parameters.pyHPCExecutor- Direct sbatch/squeue executionDataPipeline- Wraps extract_monthly_variables_FATES.pyExperimentRunner- High-level experiment coordinator
A2MC uses a three-tier architecture for FATES knowledge, ensuring the AI has access via multiple retrieval paths:
| Tier | Location | Format | Purpose |
|---|---|---|---|
| Static Documentation | docs/fates-knowledge-base/ |
Markdown | Human reference, RAG indexing |
| RAG/GraphRAG | rag/ |
ChromaDB + JSON graph | AI semantic search, graph traversal |
| Adaptive Memory | memory/gained_knowledge/ |
JSON | AI reasoning context, learned discoveries |
Key resources for CNP calibration:
- START HERE:
docs/fates-knowledge-base/fates-codebase-wiki/advanced/cnp_calibration_guide.md(Knox 2026) - PID controller:
docs/fates-knowledge-base/fates-codebase-wiki/plant-physiology/parteh/cnp_allocation.md - ECA/RD competition:
docs/fates-knowledge-base/fates-codebase-wiki/advanced/nutrient_competition.md - Nutrient uptake:
docs/fates-knowledge-base/fates-codebase-wiki/plant-physiology/parteh/soil_plant_interface.md
Two-tier knowledge architecture enabling learning across sessions while keeping site-specific knowledge separate.
┌─────────────────────────────────────────────────────────────────┐
│ A2MC Knowledge System │
├─────────────────────────────────────────────────────────────────┤
│ GENERIC KNOWLEDGE (memory/gained_knowledge/) │
│ ───────────────────────────────────────────── │
│ • General FATES mechanistic insights │
│ • Applies to all sites │
│ │
│ SITE-SPECIFIC KNOWLEDGE (use_cases/{site}/memory/) │
│ ───────────────────────────────────────────── │
│ • Site-specific discoveries and experiments │
│ • Phase execution logs with AI reasoning │
│ • Lessons learned from site calibration │
│ │
│ KNOWLEDGE PROMOTION │
│ ───────────────────────────────────────────── │
│ • AI evaluates site-specific discoveries │
│ • Generalizable lessons promoted to generic knowledge │
└─────────────────────────────────────────────────────────────────┘
Generic Knowledge (memory/gained_knowledge/):
| Store | Purpose |
|---|---|
discoveries.json |
General FATES mechanistic insights |
experiments.json |
Generic experiment patterns |
parameters.json |
Parameter knowledge (not site-specific) |
failed_approaches.json |
Generic approaches to NOT repeat |
Site-Specific Knowledge (use_cases/{site}/memory/gained_knowledge/):
| Store | Purpose |
|---|---|
discoveries.json |
Site-specific insights (e.g., "Kougarok Allocation Paradox") |
experiments.json |
Site experiments with outcomes |
failed_approaches.json |
Site-specific approaches to NOT repeat |
Phase Execution Logs (use_cases/{site}/memory/logs/):
| Directory | Purpose |
|---|---|
phase2_screening/ |
Screening analysis logs (Markdown) |
phase3_diagnosis/ |
Root cause analysis with AI reasoning |
phase4_hypothesis/ |
Hypothesis generation logs |
phase6_refinement/ |
Lessons learned and knowledge extraction |
from memory import MemoryManager
# Generic knowledge
memory = MemoryManager("memory/gained_knowledge")
# Site-specific knowledge
memory = MemoryManager("use_cases/Kougarok/memory/gained_knowledge")
# Query methods
context = memory.get_relevant_context(targets, parameters, phase)
failed = memory.get_failed_experiments(parameters)
knowledge = memory.get_parameter_knowledge("fates_alloc_storage_cushion")
stats = memory.stats()
# Update methods
memory.record_experiment(experiment_id, base_case, modifications, results, outcome)
memory.add_discovery(name, description, mechanism, affects, confidence)
memory.add_failed_approach(approach, experiment_id, why_failed, severity, alternatives)
memory.update_parameter_knowledge(param_name, knowledge)The ReasoningModule automatically queries memory during:
- Diagnosis: Retrieves relevant discoveries and parameter knowledge
- Hypothesis Generation: Checks failed approaches to avoid repetition
- Refinement: Extracts lessons and updates memory with new discoveries
When A2MC performs diagnosis or generates hypotheses, three knowledge sources are combined into the Claude API prompt:
| Source | Content | Role |
|---|---|---|
| RAG/GraphRAG | FATES + ELM documentation (3,914 chunks) | General knowledge - "how does the PID controller work?" |
| Adaptive Memory | Discoveries, failed approaches, parameter insights | Learned knowledge - "what failed before? what worked?" |
| Task Data | Results, targets, sensitivity rankings | Current context - "what are we trying to calibrate?" |
Prompt Structure (in order):
┌─────────────────────────────────────────────────────────────┐
│ ## FATES Knowledge Base Context (RAG/GraphRAG) │
│ [Vector search results from docs + Graph traversal] │
│ │
│ ## Adaptive Memory Context │
│ [Relevant discoveries, FAILED APPROACHES - DO NOT REPEAT] │
│ │
│ ## Current Data │
│ [Simulation results, validation targets, sensitivity] │
│ │
│ ## Task Instructions + Response Format │
└─────────────────────────────────────────────────────────────┘
Key Safeguard: The system explicitly marks failed approaches with "DO NOT REPEAT" and instructs Claude to avoid proposing them unless there's strong justification.
No strict priority - the sources serve complementary roles:
- RAG provides the "textbook" knowledge (how FATES mechanisms work)
- Memory provides the "experience" (what we learned from previous iterations)
- Both inform the AI's reasoning about the current task data
When calibrating a new site, you can reference knowledge from existing sites with similar characteristics:
| Your Site Type | Reference Site | Transferable Knowledge |
|---|---|---|
| Arctic/tundra | use_cases/Kougarok/ |
Allocation Paradox, P-limitation dynamics, graminoid-shrub competition |
| CNP-enabled | use_cases/Kougarok/ |
PID controller behavior, ECA competition, vmax calibration strategies |
What transfers: Mechanistic insights, diagnostic patterns, failed approaches to avoid What doesn't transfer: Exact parameter values (these are site-specific)
# Reference another site's knowledge
from memory import MemoryManager
# Load Kougarok knowledge for reference
kougarok_memory = MemoryManager("use_cases/Kougarok/memory/gained_knowledge")
# Check discoveries relevant to your calibration
discoveries = kougarok_memory.discoveries.get('discoveries', [])
for d in discoveries:
print(f"- {d['name']}: {d['description'][:80]}...")
# Check failed approaches to avoid
failed = kougarok_memory.failed_approaches.get('failed_approaches', [])
for f in failed:
print(f"AVOID: {f['approach']}")See also: use_cases/Kougarok/README.md → "Reference for Similar Sites" section for detailed applicability guidance.
To seed memory with curated knowledge:
# Create curated_knowledge.yaml from template
cp scripts/curated_knowledge_template.yaml scripts/curated_knowledge.yaml
# Edit with your discoveries...
# Run seeding script
python scripts/seed_memory_from_yaml.py --input scripts/curated_knowledge.yamlEnable memory in the orchestrator:
orchestrator = CalibrationOrchestrator(
work_dir="/path/to/work",
param_file="/path/to/fates_params.nc",
output_root="/path/to/simulations",
use_memory=True, # Enable Adaptive Memory
auto_learn=True, # Automatically extract lessons
memory_dir="memory/data" # Memory storage location
)# 1. Clone the repository
cd /global/homes/$USER
git clone https://github.com/jingtao-lbl/A2MC.git
cd A2MC
# 2. Set up Python environment (one-time setup)
module load python
python -m venv ~/a2mc_env
source ~/a2mc_env/bin/activate
pip install anthropic netCDF4 numpy scipy SALib pandas networkx chromadb sentence-transformers pyyaml
# 3. Set API key (add to ~/.bashrc for persistence)
export AI_API_KEY="sk-ant-..."
# 4. Verify setup
python -c "import anthropic; print('Anthropic OK')"
python -c "from orchestrator import CalibrationOrchestrator; print('Orchestrator OK')"Note: After initial setup, the virtual environment is auto-activated when you source a2mc_config.sh.
# Start new calibration (run in screen/tmux for long runs)
screen -S a2mc
cd /global/homes/j/jingtao/A2MC
python -c "
from orchestrator import CalibrationOrchestrator
orch = CalibrationOrchestrator(
work_dir='/pscratch/sd/j/jingtao/A2MC_calibration',
param_file='/path/to/base_params.nc',
output_root='/global/cfs/cdirs/m2467/jingtao/A2MC_runs'
)
orch.run()
"
# Resume from checkpoint
python -c "
from orchestrator import CalibrationOrchestrator
orch = CalibrationOrchestrator.load_state('/pscratch/sd/j/jingtao/A2MC_calibration/workflow_state.json')
orch.run()
"
# Monitor progress
tail -f /pscratch/sd/j/jingtao/A2MC_calibration/a2mc.logTest parameters sequentially, adding one at a time:
Exp1: param_A only
Exp2: param_A + param_B
Exp3: param_A + param_B + param_C
Use when: Parameters act through sequential mechanisms (A → B → C)
Test all combinations of parameters:
Exp1: param_A=low, param_B=low
Exp2: param_A=low, param_B=high
Exp3: param_A=high, param_B=low
Exp4: param_A=high, param_B=high
Use when: Parameters may interact (synergistic or antagonistic effects)
All workflow state is saved to JSON for resumability:
{
"phase": "DIAGNOSIS",
"iteration": 3,
"start_time": "2025-01-06T10:30:00",
"config": {
"work_dir": "/pscratch/sd/j/jingtao/A2MC",
"param_file": "fates_params.nc",
"output_root": "/global/cfs/cdirs/m2467/jingtao/A2MC_runs"
},
"design": {
"method": "morris", // or "lhs", "sobol", "custom"
"n_params": 162,
"n_trajectories": 30, // Morris: total = traj × (params+1)
"n_samples": 1000, // LHS/Sobol
"total_ensemble": 4890 // Auto-calculated from scheme
},
"screening": {
"top_cases": [2678, 845, 3930],
"best_composite_nrmse": 0.493
},
"experiments": [
{
"name": "Exp1_storage_cushion",
"base_case": 2678,
"modifications": [...],
"results": {...},
"interpretation": {...}
}
],
"phase_history": [
{"phase": "DESIGN", "completed": "2025-01-06T11:00:00"},
{"phase": "EXPLORATION", "completed": "2025-01-08T14:30:00"}
]
}A2MC wraps existing well-tested tools rather than reimplementing:
modify_fates_parameters.py
├── create_modified_parameter_file(input, output, modifications)
├── Handles 1D/2D parameters
├── Supports absolute values or percent changes
└── Verifies modifications after applying
extract_monthly_variables_FATES.py
├── Extracts site-level, PFT-level, SZPF-level variables
├── Outputs NetCDF (all vars) + CSV (site/PFT only)
├── Processes yearly files (12 months each)
└── ~50-100× faster than daily extraction
Direct SLURM commands:
├── sbatch case.submit
├── squeue -u $USER
├── scancel job_id
└── sacct -j job_id --format=...
- Automatic retry with exponential backoff
- Maximum 3 retries per job
- Log failed jobs for manual inspection
- Rate limiting with automatic backoff
- Fallback to rule-based reasoning if API unavailable
- Cache repeated queries to reduce costs
- Verify expected files before proceeding
- Clear error messages with suggested fixes
- Option to skip incomplete cases
- Diagnosis: ~2K tokens input, ~1K output
- Hypothesis: ~3K tokens input, ~1K output
- Experiment design: ~2K tokens input, ~500 output
- Interpretation: ~2K tokens input, ~1K output
Estimated cost per iteration: ~$0.10-0.20 (Sonnet)
- Morris ensemble (4890 sims): ~50K node-hours
- Single experiment: ~10 node-hours
- Data extraction: ~0.1 node-hours per case
A2MC/
├── README.md # This file
├── a2mc_config.sh # Machine-level configuration (HPC paths, defaults)
├── orchestrator.py # Main workflow controller
├── reasoning.py # Claude API interface
├── integration.py # HPC integration layer
│
├── use_cases/ # Site-specific case studies
│ ├── README.md # Overview and instructions
│ ├── TEMPLATE/ # Template for new sites
│ └── Kougarok/ # Kougarok, Alaska (NGEE-Arctic)
│ ├── README.md # Site description and discoveries
│ ├── config/
│ │ └── kougarok_config.sh # ALL site-specific settings
│ ├── parameters/
│ │ ├── FATES_Parameter_List_Full_162_Finalized.txt
│ │ └── salib_problem_162params.txt
│ ├── validation/
│ │ └── validation_targets_leafroot.txt
│ └── memory/ # SITE-SPECIFIC KNOWLEDGE
│ ├── logs/ # Phase execution logs (Markdown with AI reasoning)
│ │ ├── phase2_screening/
│ │ ├── phase3_diagnosis/
│ │ ├── phase4_hypothesis/
│ │ └── phase6_refinement/
│ ├── extracted/ # Extracted lessons (YAML)
│ └── gained_knowledge/ # Site-specific knowledge (JSON)
│ ├── discoveries.json
│ ├── experiments.json
│ └── failed_approaches.json
│
├── phases/ # Phase-specific scripts
│ ├── CLAUDE.md # Phase overview for AI assistants
│ ├── phase0_design/ # Morris sampling, case creation
│ ├── phase1_exploration/# Sensitivity analysis
│ ├── phase2_screening/ # Ensemble ranking
│ ├── phase3_diagnosis/ # Root cause analysis
│ ├── phase4_hypothesis/ # Hypothesis generation
│ ├── phase5_testing/ # Run experiments
│ └── phase6_refinement/ # Learn from results
│
├── tools/ # Shared utilities
│ ├── config.py # Python config loader (reads a2mc_config.sh)
│ ├── phase_logger.py # Site-specific Markdown logging
│ ├── workflow_status.py # Master workflow status
│ ├── cost_functions.py # Error metrics (RE, RMSE, NSE, KGE)
│ ├── optimize_function.py # Ensemble ranking
│ ├── fates_utils.py # FATES data utilities
│ ├── modify_fates_parameters.py
│ ├── diagnose_ensemble_status.py
│ └── extract_knowledge.py # Knowledge extraction from logs
│
├── memory/ # GENERIC KNOWLEDGE (framework-level)
│ ├── __init__.py # Package exports
│ ├── store.py # JSON persistence utilities
│ ├── manager.py # MemoryManager class
│ ├── gained_knowledge/ # Generic FATES knowledge (JSON)
│ │ ├── discoveries.json
│ │ ├── experiments.json
│ │ ├── parameters.json
│ │ └── failed_approaches.json
│ ├── logs/ # A2MC DEVELOPMENT session logs (Markdown)
│ ├── extracted/ # Generic extracted lessons (YAML)
│ └── workflow_log.json # Master workflow status
│
├── rag/ # RAG/GraphRAG System (FATES + ELM knowledge)
│ ├── loader.py # Document loading
│ ├── vector_store.py # ChromaDB wrapper (3,914 chunks)
│ ├── knowledge_graph.py # NetworkX graph (220 nodes, 562 edges)
│ ├── graph_builder.py # Build from YAML
│ ├── hybrid_retriever.py# Combined retrieval
│ ├── data/
│ │ └── curated_relationships.yaml # Knowledge source of truth
│ ├── chroma_db/ # Vector index
│ └── fates_knowledge_graph.json # Serialized graph
│
├── docs/ # Documentation
│ └── fates-knowledge-base/ # FATES documentation (official + wiki)
│
├── scripts/ # Utility scripts
│ ├── seed_memory_from_yaml.py
│ ├── build_rag_index.py
│ ├── migrate_fates_wiki.py
│ ├── curated_knowledge.yaml
│ └── curated_knowledge_template.yaml
│
└── plot/ # Visualization scripts
└── visualize_a2mc_horizontal.py
- Anthropic Claude API
- NERSC SLURM Documentation
- ELM-FATES Technical Reference
- SALib Morris Sensitivity
-
v1.1 (2026-02-02) - Knowledge system enhancements
- Knox 2026 CNP Guidebook integrated into three-tier knowledge system
- Knowledge graph expanded: 220 nodes, 562 edges (added RD_Competition, 15+ output variables)
- Cross-site knowledge reference documentation added
- CNP calibration guide: vmax tuning, PID diagnostics, spinup workflow
-
v1.0 (2026-01-24) - Initial public release
- 7-phase calibration workflow with intelligent iteration paths
- RAG/GraphRAG knowledge retrieval system for FATES
- Adaptive Memory System for learning across sessions
- Morris/Sobol sensitivity analysis via SALib
- HPC-native execution on NERSC Perlmutter
- Kougarok use case example included
Author: Jing Tao Email: jingtao@lbl.gov Project: NGEE-Arctic ELM-FATES calibration GitHub: https://github.com/jingtao-lbl/A2MC-elm