The same architectural property serves both directions:
INFERENCE: network encoders → K-vectors (one round-trip) → LOCAL DECODER (autoregressive)
TRAINING: LOCAL ENCODER (backprop on user data) → weight deltas (one round-trip) → NETWORK AGGREGATION
Autoregressive decoding can't be distributed (N tokens = N round-trips at network latency). Backprop through encoder layers can't be distributed either (layer N depends on layer N-1 activations). JEPA makes both local:
- Inference: only the encoder pipeline runs on the network; the decoder is local
- Training: only the weight deltas cross the network; backprop is local
User activity produces three streams of data:
| Stream | Source | Format | Volume |
|---|---|---|---|
| Text | Agent conversations, tool calls, execution records | JSONL (role, content, timestamp) | High — every interaction |
| Visual | Screen capture, browser frames | PNG/JPEG → [B,3,H,W] float32 | Medium — opt-in, 2-5 fps |
| Audio | Voice input (STT), ambient (future) | int16 numpy @ 16kHz | Low — opt-in, push-to-talk |
These map to VL-JEPA's architecture, which already has:
VisionTransformerEncoder— processes image patches (2D positional embeddings)TextEncoder— processes token sequences (1D positional embeddings)CrossModalFusion— cross-attention between visual and textSemanticPredictor— produces K-vectors from fused representationTextDecoder— local autoregressive generation from K-vectors
The question is whether these need separate training loops or one unified loop.
They do NOT need separate loops. VL-JEPA's self_supervised_loss() already handles
the case where both modalities are present. The missing piece is handling the case
where only one modality is available in a given batch:
Text-only batches (agent framework output):
images→ zero tensor (or learned "no-image" embedding)token_ids→ conversation text, byte-tokenized- Loss: predict masked text embeddings from context (text JEPA)
- The cross-modal fusion degrades gracefully: text attends to zeros, effectively becoming self-attention on text
Visual-only batches (screen capture, browser frames):
images→ captured frames, preprocessed [B,3,224,224]token_ids→ zero/padding (or learned "no-text" embedding)- Loss: predict masked visual patches from context (I-JEPA, already working)
Multimodal batches (user looking at screen while talking to agent):
images→ concurrent screen capturetoken_ids→ concurrent conversation text- Loss: predict masked patches AND masked tokens from cross-modal context
- This is the richest signal — model learns to relate visual context to actions
This is elegant because:
- The solver doesn't need different code paths per modality
- The proposer just specifies which data sources are included in a task
- Verification works the same way (cosine similarity of K-vectors)
- FedAvg works the same way (average weight deltas)
- The model naturally learns cross-modal associations when both are present
The self-supervised objective for text JEPA:
Input: [user: "check my calendar for tomorrow"] [assistant: "I'll use..."] [MASK] [MASK] [tool: calendar_read] [result: "3 meetings"]
Target: predict embeddings for masked tokens from visible context
This teaches the model:
- Action prediction: given user intent + context, predict what tool/action comes next
- State understanding: given partial conversation, reconstruct the full situation
- Behavioral patterns: correlations between user phrasing and agent responses
Unlike a standard language model (predict next token), JEPA predicts in embedding space. The model learns abstract representations of "what happens next" rather than specific token sequences. This is more robust to paraphrasing and generalizes better across users.
Visual JEPA masks rectangular blocks of patches (spatial locality). Text needs an analogous strategy that respects conversational structure:
Option A — Span masking (like SpanBERT):
- Mask contiguous spans of 3-15 tokens
- Preserves local context, tests global understanding
- Simple, well-studied
Option B — Turn masking (conversation-aware):
- Mask entire conversation turns (user or assistant)
- Forces cross-turn prediction (given user question, predict assistant action)
- More aligned with behavioral learning
Option C — Role masking (structural):
- Mask all tool calls, or all results, or all assistant text
- Forces the model to predict actions from intent (or intent from actions)
- Most aligned with the training goal
Recommendation: Start with span masking (simpler, well-understood), add turn masking as a training curriculum once the pipeline is working.
Audio from voice interactions is already converted to text via STT (faster-whisper)
before it enters the agent framework. The transcript is tagged "🎤 [Voice Input]"
and injected as a regular conversation turn. So:
- Voice text already flows through the text pipeline — no separate audio training needed initially
- Raw audio (waveform) could eventually feed a future audio encoder, but this is Phase 2 at earliest — the text transcription captures the semantic content
Screen capture and browser frames are already preprocessed by FramePreprocessor
into [B,3,224,224] tensors normalized with ImageNet stats. The existing I-JEPA
training path in train_jepa_on_task() handles this — it just needs a data source
swap from CIFAR/FakeData to the actual capture buffer.
When both text and visual data are available simultaneously (user working with agent
while screen is captured), they should be paired into multimodal batches. The
CrossModalFusion module handles the rest.
Goal: Agent framework JSONL → VL-JEPA text encoder training batches
Create nodes/common/text_data.py:
- Read conversation JSONL from
~/.atn/conversations/ - Read execution records from
~/.atn/agents/*/execution.jsonl - Byte-level tokenization (matching VLJEPAConfig.vocab_size=260)
- Sliding window chunking to
max_seq_length(512 default, should be configurable) - Produces
(token_ids: Tensor[B, S], attention_mask: Tensor[B, S]) - Privacy: respect exclusion patterns from autonet.yaml privacy config
- Consent: only read data when user has opted into training
Interface:
class TextTrainingDataSource:
def __init__(self, data_dir: Path, config: VLJEPAConfig, privacy_config):
...
def __iter__(self) -> Iterator[dict[str, Tensor]]:
# yields {"token_ids": [B, S], "attention_mask": [B, S]}
def __len__(self) -> int:
...Acceptance criteria:
- Reads real ATN conversation data
- Byte tokenization matches VLJEPAConfig vocab (4 special + 256 bytes)
- Handles empty/corrupt JSONL gracefully
- Respects privacy exclusion list
- Unit tests with synthetic JSONL
Add TextMasker to nodes/common/jepa.py (alongside JEPAMasker):
- Span masking: randomly select contiguous spans of 3-15 tokens to mask
- ~15-25% of tokens masked per sequence (configurable)
- Returns
(context_mask, target_masks)same interface asJEPAMasker - 1D positions instead of 2D grid (text is sequential, not spatial)
Acceptance criteria:
- Masks produce valid context/target splits
- Masked ratio stays within configured bounds
- Works with variable-length sequences (attention_mask respected)
- Unit tests
Extend the JEPA training path to handle text input:
JEPAConfig(modality="text")→ useTextEncoderinstead ofVisionTransformerEncoderTextEncoderneeds masking support (analogous to VisionTransformerEncoder's mask param)- Token embedding + 1D positional embedding + mask → context embeddings
- Predictor gets context embeddings + target indices → predicted target embeddings
- Target encoder (EMA) provides supervision signal
- Loss: smooth L1 in embedding space (same as visual JEPA)
The TextEncoder in vl_jepa.py already has the right structure but lacks:
- Mask parameter support (only processes full sequences currently)
- Integration with JEPAPredictor (needs 1D positional embeddings in predictor)
Acceptance criteria:
- Text-only JEPA training runs end-to-end
- Loss decreases over epochs on held-out text
- Cosine similarity metric works for text embeddings
- Weight deltas are compatible with FedAvg (same dict structure)
Goal: Single training function handles text-only, visual-only, and multimodal batches
Create nodes/common/multimodal_data.py:
- Combines
TextTrainingDataSourcewith visual capture data - Time-aligns text and visual data when both are available
- Produces batches that may have:
- Both modalities (text + concurrent screen frame)
- Text only (agent interaction without screen capture)
- Visual only (screen capture without concurrent agent interaction)
- Missing modality → zero tensor with a modality-present flag
Interface:
class MultimodalDataLoader:
def __init__(self, text_source, visual_source, config):
...
def __iter__(self) -> Iterator[dict]:
# yields {
# "images": [B, 3, H, W] or zeros,
# "token_ids": [B, S] or zeros,
# "attention_mask": [B, S],
# "has_visual": [B] bool,
# "has_text": [B] bool,
# }Acceptance criteria:
- Correctly pairs temporally-aligned text+visual data
- Fills zeros for missing modality
- Shuffles across modality combinations
- Handles unbalanced data (text >> visual or vice versa)
New training function in nodes/common/ml.py (alongside train_jepa_on_task()):
- Uses VL-JEPA model instead of vision-only JEPA
- Accepts multimodal batches from MultimodalDataLoader
- Training step:
- Text masking (TextMasker) + visual masking (JEPAMasker)
- Encode context through text encoder + visual encoder
- Cross-modal fusion on visible context
- Predict target embeddings (both text and visual targets)
- Loss against target encoder outputs
- Backprop through context encoders + predictor + fusion
- EMA update target encoder
- Returns weight deltas compatible with FedAvg
- Falls back gracefully when only one modality is present
Acceptance criteria:
- Trains on text-only batches (visual = zeros)
- Trains on visual-only batches (text = zeros)
- Trains on multimodal batches
- Loss decreases on all three batch types
- Weight deltas are FedAvg-compatible
- Metrics include per-modality cosine similarity
Extend verify_jepa_solution() in ml.py:
- Accept text and/or visual validation data
- Compute cosine similarity in shared embedding space
- Verification threshold applies to K-vector similarity regardless of source modality
- Coordinators don't need to know which modalities were used
Acceptance criteria:
- Verification works for text-only, visual-only, and multimodal solutions
- Same cosine similarity threshold applies uniformly
- Compatible with existing Yuma consensus voting
Goal: When user opts in, agent activity feeds local JEPA training in real time
Add consent mechanism to ATN:
- New config field:
autonet.train_on_agent_data: bool(default false) - UI toggle in network page (alongside existing training switch)
- Separate from screen capture opt-in (user may share agent data but not screen)
- When enabled, ConversationStore and ExecutionLog emit events that the training data adapter subscribes to
Acceptance criteria:
- Toggle persists in config.yaml
- Training only reads agent data when explicitly enabled
- Can be toggled independently of screen/browser capture
- Clear UI indication of what data is being used
Bridge ATN event bus to training data pipeline:
- Subscribe to
EXECUTION_COMPLETEDevents - Subscribe to
STEP_COMPLETEDevents (for cognitive step outputs) - Extract conversation text + tool call data
- Buffer into training batches (accumulate N interactions before training step)
- Feed to
TextTrainingDataSourcering buffer
Acceptance criteria:
- New agent interactions appear in training data within one batch cycle
- Buffering prevents training on every single message (wasteful)
- Old data ages out (ring buffer or time window)
- Works with training service running or stopped (buffer persists)
Wire existing capture infrastructure to VL-JEPA training:
FramePreprocessoralready produces[B, 3, 224, 224]tensors- Route preprocessed frames to
MultimodalDataLoader - Time-align with concurrent conversation data
- Respect fps_cap and resolution settings from capture config
Acceptance criteria:
- Screen frames flow into visual training batches
- Browser relay frames flow into visual training batches
- Time alignment with conversation data works (±2 second window)
- fps_cap respected (no training on more frames than configured)
Update solver node to use new training functions:
- Task spec includes
modalities: ["text", "visual"]field - Solver calls
train_vljepa_on_task()instead oftrain_jepa_on_task()when VL-JEPA config is provided - Data comes from local agent framework (not CIFAR/FakeData)
- Weight deltas flow through existing commit-reveal pipeline
Acceptance criteria:
- Solver trains VL-JEPA on real agent data
- Commit-reveal protocol works with VL-JEPA weight deltas
- Coordinator verification works on VL-JEPA outputs
- Aggregator FedAvg works on VL-JEPA weight deltas
- Full loop: train → commit → verify → reward → aggregate → publish
Goal: Training is gated on wallet + stake + jurisdiction membership
Modify AutonetBridge.start():
- Require
wallet_connected == Truebefore starting training - Require valid RPC connection to chain
- Check wallet has sufficient ATN for solver stake (50 ATN)
- If not staked, call
ParticipantStaking.stake()before starting - Return clear error messages when prerequisites aren't met
Acceptance criteria:
- Can't start training without wallet
- Can't start training without stake
- Auto-stakes if wallet has sufficient balance
- Clear error if balance insufficient
Wire training completions to epoch attestation:
- After each successful training cycle, call
attestUsage(serviceId, units) - Units = number of training steps completed (not number of conversations read)
- Service ID = registered training service for user's jurisdiction
- Attestation flows to
Autonet.solepoch tracking
Acceptance criteria:
- Training cycles produce on-chain attestation
- Attestation count matches actual training work
- Epoch rewards claimable after attestation
- Works with existing emission schedule
Add jurisdiction discovery and joining:
- Query
GuildRegistry.solfor available jurisdictions - Show jurisdictions in UI with their specialization (text, visual, multimodal)
- User selects jurisdiction → joins via contract call
- Jurisdiction membership determines which guild aggregates your deltas
Acceptance criteria:
- UI shows available jurisdictions
- User can join a jurisdiction
- Training tasks are scoped to jurisdiction
- Aggregation happens within jurisdiction first, then cross-jurisdiction
Goal: Inference pricing is driven by demonstrated behavioral alignment over time, not point-in-time configuration. Nodes that consistently do aligned work get cheaper (potentially free) inference. Misaligned work pays a premium that funds the aligned subsidy. Pricing can't exist until inference exists — during bootstrap, nodes accumulate behavioral profiles that will determine their pricing tier when inference activates.
Build the alignment track record during training, before inference exists:
- After each training cycle, compute mean-pooled K-vectors from the node's agent interaction embeddings (the text/multimodal data that was just trained on)
- Update a local behavioral EMA (exponential moving average):
With
profile_t = decay * profile_{t-1} + (1 - decay) * current_embeddingsdecay = 0.998and daily updates, it takes ~500 days for old behavior to decay to 1/e. This prevents gaming — you can't flip your agent prompts today and get cheap inference tomorrow. - Persist profile locally (it's a single
[K, D]tensor, ~50KB) - Publish profile hash on-chain per epoch (not the profile itself — privacy)
- The profile hash links to the training attestation, creating a verifiable chain: "this node trained on data that produced this behavioral signature"
What the profile captures:
- Semantic distribution of agent interactions (healthcare, education, finance, etc.)
- Tool usage patterns (what kinds of actions the node's agents take)
- Conversation topic distribution (what users ask about)
- Goal alignment signals (from user profile standards)
What the profile does NOT reveal:
- Individual conversations or queries
- Specific tool call contents
- Personal information (PII scrubbed before training)
Anti-gaming properties:
- EMA with slow decay means months of consistent behavior required
- Profile is derived from actual training data (weight deltas), not agent config
- You can't fake training data — coordinators verify weight delta quality
- Switching agent system prompts changes future behavior but doesn't erase history
Acceptance criteria:
- Profile accumulates over training cycles (EMA update verified)
- Profile persists across restarts
- Profile hash published on-chain per epoch
- Profile is deterministic (same training data → same profile update)
- Unit tests verify EMA decay behavior over simulated epochs
Inference activation is a governance decision, not a hardcoded threshold:
- New proposal type in
EvolutionProposal.sol:INFERENCE_ACTIVATION - Proposal includes:
- Benchmark suite (CID of evaluation dataset)
- Minimum quality metrics (cosine similarity, perplexity, task accuracy)
- Jurisdiction scope (which jurisdictions can serve inference)
- RPB evaluator assesses proposal (Phase 1: external AI, Phase 3: self-evaluation)
- Jurisdiction coordinators vote via existing Yuma consensus
- If adopted:
InferencePipelineactivates for that jurisdiction
Different jurisdictions may activate inference at different times based on their specialization. A text-heavy jurisdiction may go live before a multimodal one.
Acceptance criteria:
- New proposal type registered in EvolutionProposal contract
- Proposal includes benchmark CID and quality thresholds
- Voting follows existing Yuma consensus path
- Activation is per-jurisdiction (not global)
- Inference service checks activation status before serving
When a node requests inference, compute dynamic pricing:
- Load the requesting node's behavioral profile (accumulated EMA from 5.1)
- Load the jurisdiction's standards embedding (from Registry, published on-chain)
- Encode the inference request's semantic content (via text encoder → K-vectors)
- Compute alignment as k-NN distance in embedding space between:
- Node's behavioral profile ↔ jurisdiction standards (long-term alignment)
- Request semantics ↔ jurisdiction standards (task-level alignment)
- Node's behavioral profile ↔ request semantics (consistency check — is this node doing what it usually does, or something unusual?)
- Final alignment score = geometric mean of all three distances
(matches existing
AlignmentPricingformula structure)
Pricing tiers:
alignment > 0.8 → subsidized (network pays part/all of inference cost)
alignment 0.5-0.8 → base cost (node pays full ATN burn)
alignment < 0.5 → premium (node pays base + surcharge)
Premium revenue flows to a jurisdiction-level subsidy treasury. Subsidy draws from that treasury.
Key property: Alignment is demonstrated over time through work done, not through what agents you have configured at any given moment. A node that has been doing healthcare-related agent work for 6 months gets subsidized healthcare inference. If they suddenly request financial trading inference, their behavioral profile doesn't match the request (distance 3 is high), so they pay full price even if their jurisdiction alignment is fine.
Acceptance criteria:
- K-NN distance computation works in embedding space
- Geometric mean of three alignment distances matches paper formula
- Pricing tiers produce correct burn amounts
- Premium surcharges route to jurisdiction subsidy treasury
- Subsidy draws reduce burn amount for aligned nodes
- Integration tests with simulated node profiles
On-chain treasury that balances aligned subsidies with misaligned premiums:
- Each jurisdiction has a
subsidy_treasurybalance - Premium surcharges (from misaligned inference) deposit to treasury
- Subsidies (for aligned inference) withdraw from treasury
- If treasury is empty, subsidized nodes pay base cost (no free inference)
- If treasury is full (cap), premium rate decreases (self-balancing)
Self-balancing properties:
- More misaligned inference → more premium revenue → bigger subsidy pool
- More aligned inference → more subsidy draws → smaller pool → premiums stay
- Equilibrium: subsidy pool size reflects the alignment ratio of the jurisdiction
- Jurisdictions with mostly aligned nodes have small treasuries (little premium revenue, little subsidy needed)
- Jurisdictions with mixed alignment have larger treasuries (active flow)
Treasury parameters (governance-configurable per jurisdiction):
max_subsidy_rate: Maximum fraction of inference cost the network covers (e.g., 0.9 = 90% subsidy max)treasury_cap: Maximum treasury balance (prevents unbounded accumulation)premium_multiplier: How much extra misaligned nodes pay (e.g., 1.5x = 50% surcharge)
Acceptance criteria:
- Treasury contract holds and disburses funds correctly
- Subsidy rate scales with treasury balance (empty = no subsidy)
- Premium deposits tracked per-node for audit
- Treasury balance queryable from dashboard
- Governance can update parameters via proposal
- Self-balancing verified in simulation (treasury converges to equilibrium)
Show users their alignment status and pricing implications:
- Current behavioral profile summary (top semantic clusters, not raw embeddings)
- Alignment score vs. jurisdiction standards
- Historical alignment trajectory (line chart over epochs)
- Estimated inference pricing tier based on current profile
- Comparison: "If you maintain current behavior, your inference cost in 30/90/180 days will be approximately X ATN per request"
This replaces the simpler "earnings display" — the dashboard now shows not just what you've earned but what your behavioral track record means for future costs.
Acceptance criteria:
- Dashboard shows alignment score with breakdown (3 distance components)
- Historical trajectory visible (per-epoch data points)
- Pricing tier estimate based on current profile + treasury state
- Updates after each training cycle
- Works before inference is active (shows projected tier)
Goal: Handle real conversation lengths beyond 512 tokens
The current VLJEPAConfig.max_seq_length = 512 is ~512 characters in byte
tokenization. A typical agent conversation turn is 200-2000 characters. Options:
- Increase to 2048 (covers most single turns, 4x memory)
- Increase to 4096 (covers multi-turn context, 16x memory)
- Use chunked encoding: split long sequences into 512-token chunks, encode separately, pool/concatenate
Recommendation: Start with 2048 (practical for consumer GPUs), add chunking later for longer context.
Acceptance criteria:
- TextEncoder handles sequences up to configured max_seq_length
- Positional embeddings scale to new length
- Memory usage stays within consumer GPU budget (8-16GB)
- Training stability verified (longer sequences may need adjusted learning rate)
The byte-level vocab (260) is simple but inefficient for English text (~4x more tokens than BPE). Consider:
- BPE tokenizer trained on agent interaction data (domain-specific)
- SentencePiece with vocab ~8000 (balances efficiency and simplicity)
- Keep byte-level as fallback for non-text content
NOTE: This is optional. Byte-level works, just uses longer sequences. The simplicity of no-external-tokenizer may be worth the sequence length cost.
Epic 1 (Text Data Adapter)
├── Story 1.1: TextTrainingDataSource
├── Story 1.2: TextMasker (depends on 1.1 for testing)
└── Story 1.3: TextEncoder JEPA mode (depends on 1.2)
│
Epic 2 (Unified Training Loop)
├── Story 2.1: MultimodalDataLoader (depends on 1.1)
├── Story 2.2: train_vljepa_on_task (depends on 1.3, 2.1)
└── Story 2.3: Multimodal verification (depends on 2.2)
│
Epic 3 (Wire Agent Framework)
├── Story 3.1: Consent & opt-in (independent)
├── Story 3.2: Real-time data feed (depends on 1.1, 3.1)
├── Story 3.3: Visual capture wiring (depends on 2.1)
└── Story 3.4: Solver integration (depends on 2.2, 3.2, 3.3)
│
Epic 4 (Economic Gating)
├── Story 4.1: Gate on wallet (independent, can parallelize)
├── Story 4.2: Attestation (depends on 3.4)
└── Story 4.3: Jurisdiction join (depends on 4.1)
│
Epic 5 (Alignment-Based Inference Pricing)
├── Story 5.1: Behavioral profile accumulation (depends on 3.4 — needs training
│ cycles to accumulate from; CAN START as soon as solver trains on
│ real agent data)
├── Story 5.2: Inference-ready governance (depends on 4.3 — needs jurisdictions)
├── Story 5.3: K-NN alignment scoring (depends on 5.1, 5.2 — needs profiles +
│ active inference)
├── Story 5.4: Subsidy/premium treasury (depends on 5.3 — needs pricing tiers)
└── Story 5.5: Alignment dashboard (depends on 5.1 — can show profile before
inference is active)
Epic 6 (Tokenizer Scaling) ← can run in parallel with Epics 1-3
├── Story 6.1: Increase max_seq_length
└── Story 6.2: Vocabulary expansion (optional)
The shortest path to agent framework data training the model and earning tokens:
1.1 → 1.2 → 1.3 → 2.2 → 3.2 → 3.4 → 4.2
Text data adapter → text masking → text JEPA training → unified training function → real-time data feed → solver integration → on-chain attestation.
Visual and multimodal support (2.1, 3.3) can come after the text-only path is working. Economic gating (4.1, 4.3) can be parallelized.
As soon as Path A reaches 3.4 (solver trains on real agent data), start:
3.4 → 5.1 → 5.5
Behavioral profile accumulation → alignment dashboard. Nodes begin building their behavioral track records immediately. By the time inference is ready, early adopters have months/years of demonstrated alignment — their pricing tier is already established.
This path can't start until the model reaches useful quality, which is a governance decision:
5.2 → 5.3 → 5.4
Governance votes to enable inference → K-NN pricing activates → treasury mechanics go live. At this point, behavioral profiles from Path B determine each node's pricing tier.
Path B is the bootstrap incentive. Early participants:
- Earn ATN tokens through training (Path A)
- Accumulate long behavioral track records (Path B)
- When inference activates (Path C), they have the best alignment scores
- Best alignment = cheapest inference = most value from their earned tokens
- Late joiners start with no track record → pay base/premium rates
This creates a natural first-mover advantage that doesn't require vesting contracts or hardcoded exchange rates. The advantage is earned through demonstrated behavior, not through being early per se.