feat(dht): implement K-replication for DHT store operations#3
feat(dht): implement K-replication for DHT store operations#3mickvandijke wants to merge 5 commits intomainfrom
Conversation
Add K-way replication to ensure data redundancy across the DHT network: - Add optional transport to DhtCoreEngine for network communication - Modify store() to replicate data to K-1 remote nodes via DhtMessage::Replicate - Activate ReplicationManager with schedule_repair, take_pending_repairs methods - Add background repair task to maintenance loop (60s interval) - Implement node failure handling that schedules repairs for affected keys - Add DataIntegrityMonitor accessors: get_storage_nodes, remove_node_from_all, add_storage_node Partial replication is accepted - background repair completes missing replicas. Content-addressed storage means identical keys guarantee identical data. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add O(1) has_key check to DhtCoreEngine and use it in the replication handler to skip storing data we already have. Since storage is content-addressed, identical keys guarantee identical data. This prevents wasted I/O when multiple nodes send replication requests for the same key to the same target node. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change replication maintenance interval from 60 seconds to 3600 seconds (1 hour) to match the standard Kademlia tReplicate parameter. Also increase MAX_REPAIRS_PER_CYCLE from 10 to 100 since with hourly checks we can afford larger batches. References: - Kademlia spec: tReplicate = 3600s - libp2p/IPFS: 1 hour default replication interval Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reduce from 100 to 20 to align with standard Kademlia k parameter used in libp2p/IPFS. With K=8 replication this means ~160 network messages max per hourly cycle, handling ~480 repairs/day. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add store_local_only() method for Replicate handlers to store without triggering further replication. Original Store requests still replicate to K-1 nodes, but Replicate message receivers now persist locally only. This prevents the cascade where node A sends Replicate to B, then B would also try to replicate to C, D, etc. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@claude Please review this |
|
@mickvandijke be careful here that it's not a simple DHT and uses a multi layer model for trust lstm etc. So this needs a deeper dive on the multi layers and how they choose nodes and use GEO distribution and so on. |
|
Claude finished @dirvine's task in 3m 45s —— View job Review: K-Replication Integration with Multi-Layer Architecture ✅Tasks
Critical Issues Found🚨 MAJOR: Bypassing the Intelligent Placement SystemThe K-replication implementation in Current flow (WRONG): Expected flow (from architecture): Evidence:
🔴 CRITICAL: No EigenTrust IntegrationCurrent state in src/dht/core_engine.rs:
What should happen (from src/placement/algorithms.rs:403-404): let trust_score = 0.8; // Mock trust score // ← Should be: trust_system.get_trust(&node_id)
let stability_score = 0.9; // Mock stability score // ← Should be: churn_predictor.predict(&node_id)The placement algorithm has mock values hardcoded, showing this integration was never completed. 🟠 HIGH: Geographic Diversity Only at Add, Not Replicationsrc/dht/core_engine.rs:1317-1337 - Geographic diversity enforcer exists and is used when adding nodes to routing table. BUT src/dht/core_engine.rs:940-1013 (store operation) does NOT calculate geographic diversity factors when selecting replica targets. The Compare to placement system:
This logic exists but is never called in the replication path! 🟡 MEDIUM: Adaptive Network Layer Not ConsultedThe PR description mentions "multi-layer model for trust lstm etc." but: src/adaptive/multi_armed_bandit.rs:82-106 - src/adaptive/routing.rs defines strategy selection for:
BUT src/dht/core_engine.rs:943 hardcodes Kademlia XOR distance. No MAB consultation, no strategy adaptation, no learning from past successes/failures. Architecture ViolationsLayer Separation BrokenThe current implementation conflates concerns:
Missing Integration Points
RecommendationsOption A: Fix Integration (Recommended)Refactor Add to pub struct DhtCoreEngine {
// ... existing fields ...
placement_engine: Option<Arc<RwLock<PlacementEngine>>>,
trust_system: Option<Arc<EigenTrustEngine>>,
performance_monitor: Option<Arc<PerformanceMonitor>>,
}Then in pub async fn store(&mut self, key: &DhtKey, value: Vec<u8>) -> Result<StoreReceipt> {
// Get candidate nodes from routing table
let routing = self.routing_table.read().await;
let candidate_nodes = routing.find_closest_nodes(key, 20); // Get more candidates
drop(routing);
let selected_nodes = if let (Some(placement), Some(trust), Some(perf)) =
(&self.placement_engine, &self.trust_system, &self.performance_monitor)
{
// Use trust-weighted placement
let available: HashSet<NodeId> = candidate_nodes.iter()
.map(|n| n.id.clone()).collect();
let metadata = self.get_node_metadata(&candidate_nodes).await?;
let mut engine = placement.write().await;
let decision = engine.select_nodes(
&available,
DEFAULT_REPLICATION_FACTOR as u8,
trust, // ← EigenTrust scores
perf, // ← Stability/churn prediction
&metadata, // ← Geographic diversity
).await?;
decision.selected_nodes
} else {
// Fallback to basic Kademlia (existing logic)
let load_balancer = self.load_balancer.read().await;
load_balancer.select_least_loaded(&candidate_nodes, 8)
};
// Now replicate to selected_nodes (existing code from line 955+)
// ...
}Benefits:
Option B: Document the TradeoffIf Option A is too complex for this PR, explicitly document that this is a simplified implementation and file an issue to integrate with placement system later. Add to src/dht/core_engine.rs:936 docstring: /// Store data in the DHT with K-replication
///
/// # Current Limitations (FIXME: Issue #XXX)
///
/// This implementation uses basic Kademlia XOR distance and load balancing.
/// It does NOT integrate with:
/// - PlacementEngine's trust-weighted selection (EigenTrust scores)
/// - Geographic diversity enforcement beyond routing table limits
/// - Adaptive routing strategy selection (Multi-Armed Bandit)
/// - Churn prediction for stability scoring
///
/// Replica selection currently ignores:
/// - Node reputation (τ_i from EigenTrust)
/// - Performance/stability scores (p_i from PerformanceMonitor)
/// - Geographic diversity bonuses (d_i from DiversityEnforcer)
///
/// This means malicious or unreliable nodes may be selected for storage,
/// and geographic concentration attacks are not mitigated during replication.
Also required: Create tracking issue with title like "Integrate DHT K-replication with trust-weighted placement system" Positive Aspects✅ Good infrastructure:
✅ Security components present:
❌ But none of these are used during replica selection in store()! Test Coverage Gaptests/dht_replication_test.rs tests:
Suggested additional tests: #[tokio::test]
async fn test_store_respects_trust_scores() {
// Verify high-trust nodes are preferred over low-trust nodes
}
#[tokio::test]
async fn test_store_enforces_geographic_diversity() {
// Verify replicas spread across regions, not clustered
}
#[tokio::test]
async fn test_store_uses_placement_engine_when_available() {
// Verify PlacementEngine is consulted if provided
}SummaryThis PR implements solid DHT replication mechanics but bypasses the intelligent placement system that is core to saorsa's architecture. Before merging, the team must decide:
The current implementation is essentially "traditional Kademlia" when the architecture calls for "trust-weighted, geo-diverse, adaptive Kademlia." Security ImplicationsWithout trust-weighted selection:
The Byzantine fault tolerance and witness systems (src/dht/witness.rs) are present but weakened if replica selection ignores trust scores. Recommendation: Implement Option B immediately (document + issue), then prioritize Option A for next sprint. This PR has good foundational work but needs the trust layer integrated before production use. |
|
Ah thanks for raising that, I'll make sure to consider these other layers when I implement replication on the node side. Closing this PR because the logic will be moved to the node. |
…ing) Implements 11 hardening improvements from PR #21 review comments: 1. Request Leak/Cleanup: - Added RAII RequestCleanupGuard for automatic cleanup on drop/cancel - Atomic check+insert under single write lock to prevent races 2. Input Validation: - Added TransportError::ValidationError for input validation errors - validate_protocol_name() now uses ValidationError instead of StreamError 3. Timeout Handling: - Added MIN_REQUEST_TIMEOUT (100ms) to prevent Duration::ZERO immediate timeout - send_request() clamps timeout to [100ms, 5min] range 4. Response Routing: - Improved logging for failed pending.send() to clarify timeout scenario 5. Documentation: - Added documentation to ReplicationResult clarifying remote-only counts - Fixed brittle error assertion in tests to check error variant 6. Testing: - New tests/request_response_e2e_test.rs with 7 comprehensive tests: * Successful request/response routing * Timeout cleanup behavior * Invalid protocol rejection (empty, /, \, \0) * Protocol validation in send_response() * Minimum timeout enforcement * Trust reporting on failure Items already correct (verified): - #3: Protocol validation in send_response() already present - #6: Response-origin mismatch uses get() before remove() - #7: Unmatched /rr/ responses already suppressed - #9: Trust reporting on send_message() failure already implemented - #10: PeerStoreOutcome docs correct (no latency mention) Closes #23 Co-authored-by: David Irvine <dirvine@users.noreply.github.com>
Summary
Changes
K-Replication in store()
transportfield toDhtCoreEnginefor network communicationstore()now sendsDhtMessage::Replicateto K-1 remote nodesStoreReceipt.stored_atReplicationManager
HashSet<DhtKey>for O(1) pending repair lookupsschedule_repair(),take_pending_repairs_batch()methodsMAX_REPAIRS_PER_CYCLE = 20(matches Kademlia k)Background Repair Task
start_maintenance_tasks()REPLICATION_INTERVAL_SECS = 3600(1 hour, per Kademlia tReplicate spec)DataIntegrityMonitorrecommendationsNode Failure Handling
handle_node_failure()now finds affected keys and schedules repairsDataIntegrityMonitor::remove_node_from_all()helperDeduplication
DhtCoreEngine::has_key()for O(1) existence checkConstants (per Kademlia spec)
DEFAULT_REPLICATION_FACTORREPLICATION_INTERVAL_SECSMAX_REPAIRS_PER_CYCLETest plan
cargo test --lib- 1318 tests passcargo test --test dht_replication_test- 10 tests passcargo clippy -- -D warnings -D clippy::unwrap_used -D clippy::expect_used- passesReferences
🤖 Generated with Claude Code