Skip to content

enostr: add network resilience integration tests#1239

Draft
alltheseas wants to merge 3 commits intodamus-io:masterfrom
alltheseas:feature/network-resilience-tests
Draft

enostr: add network resilience integration tests#1239
alltheseas wants to merge 3 commits intodamus-io:masterfrom
alltheseas:feature/network-resilience-tests

Conversation

@alltheseas
Copy link
Copy Markdown
Contributor

@alltheseas alltheseas commented Jan 6, 2026

Summary

  • Add network resilience integration tests for relay connections under degraded network conditions (3G, packet loss, disconnects)
  • Add local test runner script with strfry relay and throttle network simulation
  • Add GitHub Actions workflow for automated network resilience testing

Uses sitespeedio/throttle for system-level network simulation (latency, bandwidth, packet loss) via pfctl (macOS) or tc (Linux).

Test Plan

  • cargo fmt --check passes
  • cargo clippy passes
  • cargo test --package enostr passes
  • Integration tests skip without TEST_RELAY_URL set (allows regular CI to pass)
  • Manual test with Docker strfry relay: all 10 tests pass
  • Manual test with 3G throttling: all tests pass with expected latency increase

Local Testing

# Start strfry relay
mkdir -p /tmp/strfry-test-db/strfry-db
docker run -d --name strfry-test -p 7777:7777 \
  -v /tmp/strfry-test-db/strfry-db:/app/strfry-db \
  dockurr/strfry

# Run tests
TEST_RELAY_URL="ws://127.0.0.1:7777" TEST_SCENARIO="baseline" \
  cargo test --package enostr --test network_resilience -- --nocapture

# Cleanup
docker stop strfry-test && docker rm strfry-test

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 6, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive network resilience testing infrastructure for the enostr relay client. It includes a GitHub Actions CI workflow (supporting Linux and macOS), Rust integration tests with configurable scenarios, a Bash orchestration script, and test configuration files. The system validates client behavior under various network conditions including throttling, packet loss, and disconnection scenarios.

Changes

Cohort / File(s) Summary
GitHub Actions Workflow
\.github/workflows/network-resilience\.yml
New CI workflow with two jobs (Linux and macOS). Linux job runs matrix of five network scenario tests (baseline, 3G, 3G slow, packet loss, disconnect) with strfry relay setup and kernel module loading. macOS job runs baseline and throttled unit tests with conditional strfry availability handling. Both include toolchain setup, caching, and environment variable configuration.
Test Implementation
crates/enostr/tests/network_resilience\.rs
Comprehensive test module introducing three public structs: ScenarioConfig (encapsulates per-scenario timing and tolerances), TestContext (aggregates relay pool and configuration), and EventStats (tracks connection/message/error metrics). Includes six integration tests validating relay connections, reconnection logic, subscriptions under load, latency, and sustained connections. Adds helper functions for event-driven statistics collection and per-scenario threshold validation.
Test Orchestration
scripts/network-resilience-test\.sh
Bash script orchestrating full test workflow: validates prerequisites, starts strfry relay, applies network throttling profiles or simulates disconnects, runs Rust tests with scenario-specific configuration, and handles cleanup. Supports scenarios: baseline, 3G, 3G slow, packet loss, disconnect, and all. Provides colored logging and exit codes indicating success/failure/prerequisites-missing.
Dependencies & Configuration
crates/enostr/Cargo\.toml, test/strfry-test\.conf
Adds profiling dev-dependency to Cargo.toml. Introduces strfry test relay configuration specifying database path, bind address (0.0.0.0:7777), relay metadata, and compression settings.

Sequence Diagram

sequenceDiagram
    actor User
    participant CI as GitHub Actions
    participant Script as Bash Script
    participant Relay as strfry Relay
    participant Network as Network Layer
    participant Tests as Cargo Tests
    participant Stats as EventStats

    User->>CI: Trigger workflow (schedule/dispatch)
    CI->>Script: Run network-resilience-test.sh SCENARIO
    Script->>Script: Check prerequisites
    Script->>Relay: Start strfry relay process
    Relay-->>Script: Listening (port 7777)
    
    alt Based on Scenario
        Script->>Network: Apply throttle/packet loss config<br/>(baseline/3G/3Gslow/packetloss)
        Note over Network: Network conditions active
    else Disconnect Scenario
        Script->>Network: Setup intermittent disconnect<br/>simulation (background process)
        Note over Network: Simulating disconnects
    end
    
    Script->>Tests: cargo test --test network_resilience
    Tests->>Tests: Initialize TestContext<br/>(load STRFRY_PATH, TEST_SCENARIO)
    Tests->>Relay: Establish connection to relay
    Relay-->>Tests: Connected
    
    loop Event Processing
        Tests->>Relay: Send subscription requests
        Relay-->>Tests: Relay events, EOSE
        Tests->>Stats: Collect connection/error/message<br/>metrics
    end
    
    Tests->>Tests: Validate against ScenarioConfig<br/>thresholds (timeouts, error rates)
    Tests-->>Script: Test result (pass/fail)
    
    Script->>Network: Remove throttle config
    Script->>Relay: Stop relay process
    Script-->>CI: Exit code (0/1/2)
    CI-->>User: Workflow result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A warren of tests through the network we hop,\br
With throttles and relays, we just never stop!$br
From Linux to macOS, we resilient remain,$br
Through packet loss, disconnect—no network pain! 🌐

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: adding network resilience integration tests to the enostr crate, which aligns with the primary focus of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Comment @coderabbitai help to get the list of available commands and usage tips.

alltheseas and others added 2 commits January 6, 2026 02:54
Add integration tests for relay connections under degraded network
conditions. Tests cover connection establishment, subscription handling,
reconnection logic, and sustained connections with configurable
timeouts for different network scenarios (baseline, 3G, packet loss).

Integration tests are skipped when TEST_RELAY_URL is not set, allowing
regular cargo test to pass without a running relay.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: alltheseas <[email protected]>
Add shell script to orchestrate network resilience testing locally
using strfry relay and throttle for network simulation. Supports
baseline, 3G, slow 3G, packet loss, and disconnect scenarios.

Includes strfry Docker configuration for test relay.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: alltheseas <[email protected]>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI Agents
In @.github/workflows/network-resilience.yml:
- Around line 201-213: The macOS throttled test step suppresses failures by
appending "|| true" to the cargo test command (cargo test --package enostr --
--test-threads=1 || true); either remove "|| true" so test failures fail the
workflow, or if suppression is intentional add a concise comment above that
command explaining why macOS test failures are allowed (e.g., flaky or
non-blocking) and reference the exact command "cargo test --package enostr --
--test-threads=1" to locate the line to change.
🧹 Nitpick comments (4)
.github/workflows/network-resilience.yml (2)

85-91: Quote command substitution to prevent word splitting.

Per static analysis hint (SC2046), the $(nproc) substitution should be quoted.

🔎 Proposed fix
       - name: Build strfry
         if: steps.strfry-cache.outputs.cache-hit != 'true'
         working-directory: strfry
         run: |
           git submodule update --init
           make setup-golpe
-          make -j$(nproc)
+          make -j"$(nproc)"

182-189: Placeholder logic for macOS strfry installation.

This step never attempts to install strfry and always sets strfry_available=false. If this is intentional (strfry not available via Homebrew), consider simplifying to remove the misleading step name. Also, quote the output path per SC2086:

🔎 Proposed fix
-      - name: Install strfry via Homebrew (if available) or skip
+      - name: Check strfry availability (not available on macOS)
         id: strfry-install
-        continue-on-error: true
         run: |
           # strfry may not be available on Homebrew for macOS
           # Fall back to a simple WebSocket echo server for basic testing
           echo "strfry not easily available on macOS, using simplified tests"
-          echo "strfry_available=false" >> $GITHUB_OUTPUT
+          echo "strfry_available=false" >> "$GITHUB_OUTPUT"
scripts/network-resilience-test.sh (1)

183-193: Consider nc availability on different systems.

The port check uses nc -z which may not be available on all systems (e.g., some minimal Docker images). Consider adding a fallback or checking for nc in prerequisites:

🔎 Alternative using bash built-in
     # Wait for strfry to be ready (early return on timeout)
     local max_wait=30
     local waited=0
-    while ! nc -z 127.0.0.1 "$STRFRY_PORT" 2>/dev/null; do
+    while ! (echo > /dev/tcp/127.0.0.1/"$STRFRY_PORT") 2>/dev/null; do
         sleep 1
         waited=$((waited + 1))
         if [ $waited -ge $max_wait ]; then
             log_error "strfry failed to start within ${max_wait}s"
             exit 1
         fi
     done

Note: The /dev/tcp approach is bash-specific but removes the nc dependency.

crates/enostr/tests/network_resilience.rs (1)

483-499: Latency measurement includes pool creation.

The start instant is captured before TestContext::new(), so the measured latency includes pool creation time, not just the connection handshake. If the intent is to measure only connection latency, consider moving the start instant after context creation:

🔎 To measure connection-only latency
     println!("[{}] Measuring connection latency", scenario);

-    let start = Instant::now();
-
     let mut ctx = match TestContext::new() {
         Ok(ctx) => ctx,
         Err(e) => {
             panic!("Failed to create test context: {}", e);
         }
     };

+    let start = Instant::now();
     let connected = wait_for_connection(&mut ctx);
     let latency = start.elapsed();
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ffe7d5c and 7c0c8a7.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • .github/workflows/network-resilience.yml
  • crates/enostr/Cargo.toml
  • crates/enostr/tests/network_resilience.rs
  • scripts/network-resilience-test.sh
  • test/strfry-test.conf
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-05T20:25:35.921Z
Learnt from: CR
Repo: damus-io/notedeck PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-05T20:25:35.921Z
Learning: Applies to crates/notedeck*/src/**/*.rs : Tests should live alongside modules (e.g., in test submodules), often using `#[tokio::test]` when async behavior is involved

Applied to files:

  • crates/enostr/tests/network_resilience.rs
🧬 Code graph analysis (1)
crates/enostr/tests/network_resilience.rs (1)
crates/enostr/src/relay/pool.rs (1)
  • status (71-76)
🪛 actionlint (1.7.9)
.github/workflows/network-resilience.yml

88-88: shellcheck reported issue in this script: SC2046:warning:3:8: Quote this to prevent word splitting

(shellcheck)


185-185: shellcheck reported issue in this script: SC2086:info:4:34: Double quote to prevent globbing and word splitting

(shellcheck)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Test (macOS) / run
  • GitHub Check: Test (Linux) / run
  • GitHub Check: Check (android)
  • GitHub Check: Test (Windows) / run
  • GitHub Check: Rustfmt + Clippy
  • GitHub Check: Network Resilience (Linux)
🔇 Additional comments (15)
test/strfry-test.conf (1)

1-19: Configuration looks appropriate for containerized testing.

This config binds to 0.0.0.0 (suitable for Docker) while the shell script dynamically generates a config with 127.0.0.1 for local runs. The separation is sensible for different execution contexts.

Consider adding a brief comment at the top of the file indicating this is specifically for Docker/container usage to distinguish it from the dynamically generated config in the script.

crates/enostr/Cargo.toml (1)

23-26: LGTM!

The profiling dev-dependency is correctly added and aligns with its usage in the network resilience tests (e.g., #[profiling::function] attributes on wait_for_connection and process_events_for_duration).

.github/workflows/network-resilience.yml (1)

1-32: Well-structured workflow with appropriate triggers.

The workflow design is solid: path-filtered PR triggers, nightly schedule, and manual dispatch with scenario selection. The conditional step execution based on scenario input is correctly implemented.

scripts/network-resilience-test.sh (5)

1-67: Excellent documentation and configuration.

The header documentation is comprehensive with clear usage instructions, environment variables, and exit codes. The configuration section provides sensible defaults with environment variable overrides.


88-111: Robust cleanup implementation.

The cleanup function properly handles edge cases with || true for error suppression, checks if the process exists before killing, and is correctly registered as an EXIT trap.


117-149: Thorough prerequisites validation.

Good use of early returns with clear error messages and remediation instructions. The sudo check gracefully warns without failing.


306-337: Effective disconnect simulation approach.

Using high RTT (5000ms) to simulate disconnects is a reasonable approach that's more portable than actual interface manipulation. The background process lifecycle is properly managed with cleanup.


343-399: Clean entry point with proper failure aggregation.

The main function correctly routes scenarios and aggregates failures without short-circuiting in "all" mode. Exit codes align with the documented values.

crates/enostr/tests/network_resilience.rs (7)

42-116: Well-designed scenario configuration with progressive thresholds.

The timeout and error rate thresholds appropriately scale with network degradation severity. The fallback for unknown scenarios provides sensible defaults.


118-167: Clean test context encapsulation.

The TestContext properly encapsulates all test state and handles initialization with appropriate defaults and error propagation.


196-238: Robust connection waiting with dual detection.

The function correctly uses both event-based (WsEvent::Opened) and status-based (RelayStatus::Connected) detection for connection establishment, ensuring reliability across different timing scenarios.


284-312: Clean event statistics tracking.

The error_rate() calculation correctly handles the zero-division case and provides a meaningful metric.


354-410: Good reconnection test with conditional behavior.

The test appropriately differentiates between disconnect and non-disconnect scenarios. The assertion logic correctly verifies that reconnections occur after disconnects.


267-274: Pragmatic EOSE detection for testing.

Using text.contains("EOSE") is a simple heuristic that works for test purposes. For production code, consider parsing the JSON properly, but for these integration tests this approach is acceptable.


663-705: Good unit test coverage.

The unit tests properly verify scenario configuration and error rate calculations, including edge cases like empty stats and unknown scenarios.

Comment on lines +201 to +213
- name: Run throttled unit tests (macOS)
env:
RUST_LOG: info
run: |
echo "=== macOS Throttled Tests ==="
# Apply 3G throttling
sudo throttle 3g --localhost || true

# Run unit tests
cargo test --package enostr -- --test-threads=1 || true

# Stop throttling
sudo throttle --stop --localhost || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test failures are suppressed by || true.

The cargo test command on line 210 uses || true, which means test failures won't fail the workflow. If this is intentional for macOS (non-critical path), consider adding a comment explaining why. Otherwise, remove || true from the test command to surface failures:

🔎 Proposed fix if failures should be reported
       - name: Run throttled unit tests (macOS)
         env:
           RUST_LOG: info
         run: |
           echo "=== macOS Throttled Tests ==="
           # Apply 3G throttling
           sudo throttle 3g --localhost || true

           # Run unit tests
-          cargo test --package enostr -- --test-threads=1 || true
+          cargo test --package enostr -- --test-threads=1

           # Stop throttling
           sudo throttle --stop --localhost || true
🤖 Prompt for AI Agents
In @.github/workflows/network-resilience.yml around lines 201 - 213, The macOS
throttled test step suppresses failures by appending "|| true" to the cargo test
command (cargo test --package enostr -- --test-threads=1 || true); either remove
"|| true" so test failures fail the workflow, or if suppression is intentional
add a concise comment above that command explaining why macOS test failures are
allowed (e.g., flaky or non-blocking) and reference the exact command "cargo
test --package enostr -- --test-threads=1" to locate the line to change.

Add GitHub Actions workflow to run network resilience tests on PRs
touching relay code, nightly, and manual dispatch. Tests run on
Linux with strfry relay and throttle for network simulation.

Closes damus-io#1238

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: alltheseas <[email protected]>
@alltheseas alltheseas marked this pull request as draft January 6, 2026 08:56
@alltheseas
Copy link
Copy Markdown
Contributor Author

Need help figuring out how to set up strfry in CI

@alltheseas alltheseas force-pushed the feature/network-resilience-tests branch from 7c0c8a7 to 88455e3 Compare January 6, 2026 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant