Skip to content

Releases: VectorArc/avp-python

v0.6.1

05 Apr 00:21

Choose a tag to compare

Added

  • generate_on_context() — Third latent primitive on LlamaCppConnector. Autoregressive generation on a caller-owned context with streaming via token_callback, n_ctx awareness for capacity checking, extra_stop_strings for custom stops, and generated_ids in the return tuple. Completes the create/think/generate primitive set alongside create_inference_context() and run_latent_steps().
  • tokenize(add_bos=True) — Optional add_bos parameter on tokenize() across all connectors (ABC, LlamaCpp, HuggingFace, vLLM). Default False preserves backward compatibility. Use True when tokenizing for manual decoding onto a fresh context.

Changed

  • _generate_on_think_ctx refactored — Now delegates to generate_on_context() for the generation loop. Context lifecycle (free/keep) still managed by the wrapper. No behavior change for existing callers.

Fixed

  • run_latent_steps docstring — Fixed duplicate Args section and incorrect default value (was 10, should be 20).
  • _generate_on_think_ctx n_cur scoping — Fixed potential UnboundLocalError in finally block if generate_on_context raised.
  • Closed context validationgenerate_on_context raises ValueError on closed LlamaCppInferenceContext instead of segfaulting.
  • HF add_bos semanticstokenize(add_bos=True) on HuggingFace now prepends only BOS token (not all special tokens via add_special_tokens=True).

v0.6.0

04 Apr 16:02

Choose a tag to compare

v0.6.0 — Public Connector API

Highlights

  • Public connector APIhidden_dim, num_layers, context_length, vocab_size, device, dtype, tokenize(), detokenize(), apply_chat_template(), stop_token_ids, stop_strings on the EngineConnector ABC
  • OutputType/PayloadType split — API-level enum (OutputType) separated from wire-level enum (PayloadType). PayloadType.AUTO removed.
  • LlamaCpp latent primitivescreate_inference_context(), run_latent_steps(), keep_context=True, grammar= for constrained generation
  • 11 bug fixes including critical HF inject_and_generate() crash and LlamaCpp use-after-free

Breaking changes

  • tokenize() returns List[int] (was Any/torch.Tensor)
  • PayloadType.AUTO removed — use OutputType.AUTO
  • extract_hidden_state()/inject_and_generate() removed from ABC
  • LlamaCpp tokenize() no longer adds BOS

See CHANGELOG.md for full details.

v0.5.1

03 Apr 06:49

Choose a tag to compare

v0.5.1 — Flexible Transfer Granularity

Added

  • output=PayloadType on think() — Controls what the returned context contains:
    • PayloadType.AUTO (default): system decides
    • PayloadType.KV_CACHE: full KV-cache + hidden state
    • PayloadType.HIDDEN_STATE: only last hidden state [1, D] (~14KB vs ~76MB, KV freed)
  • PayloadType.AUTO — SDK-only sentinel (-1), resolves at runtime, never serialized
  • AVPContext.payload_type property — Derived from data present
  • Same-model hidden state injectiongenerate() accepts hidden-state-only contexts for same-model (previously cross-model only)
  • Type validation for output= with actionable ConfigurationError

Removed

  • PayloadType.EMBEDDING — Removed from SDK and proto schema (dead code, never used)

Install

pip install avp==0.5.1

Full changelog: https://github.com/VectorArc/avp-python/blob/main/CHANGELOG.md

v0.4.2

30 Mar 06:25

Choose a tag to compare

What's New

model= accepts Union[str, EngineConnector] — The Easy API (think(), generate()) now accepts either a model name string or a pre-built EngineConnector instance. All backends (Ollama, llama.cpp, vLLM, HuggingFace) are now first-class in the Easy API.

import avp
from avp import OllamaConnector

# With any connector
conn = OllamaConnector.from_ollama("qwen2.5:7b")
context = avp.think("Analyze this", model=conn)
answer = avp.generate("Solve it", model=conn, context=context)

# With a model name (still works, auto-creates HuggingFace backend)
context = avp.think("Analyze this", model="Qwen/Qwen2.5-7B-Instruct")

Added

  • ModelSpec type aliasUnion[str, EngineConnector], importable from top-level avp
  • EngineConnector top-level exportfrom avp import EngineConnector now works
  • can_think validation in generate() — clear error with actionable message when a connector without think support is passed with steps > 0
  • transformers 5.4 compatibility — removed explicit cache_position kwarg (now managed internally by generate())
  • 19 new tests for connector parameter handling and backward compatibility

Changed

  • generate() reduced from 17 to 15 parameters
  • source_model= also accepts Union[str, EngineConnector] for cross-model projection
  • Framework integrations (ChatAVP, AVPLLM, AVPChatCompletionClient) resolve connectors internally

Full Changelog

v0.4.1...v0.4.2

v0.4.1

26 Mar 04:09

Choose a tag to compare

API stability release. All public APIs audited against stable protocol design principles (Protobuf, Arrow, gRPC). 33 issues found and fixed. 500 tests pass, cloud validated on A100.

Highlights

Stable return typesthink() and generate() now return ThinkResult and GenerateResult objects instead of Union types. GenerateResult is a str subclass, so all existing string operations work. Access metrics via result.metrics instead of tuple unpacking.

result = avp.generate("Solve: 2+2", model="Qwen/Qwen2.5-7B-Instruct", collect_metrics=True)
print(result)          # works — it's a str
print(result.metrics)  # GenerateMetrics

Payload integrity — CRC32 checksum on all wire payloads. Catches corruption and truncation. Zero overhead for same-process transfers (optional field).

Simpler connector APIEngineConnector ABC reduced from 6 required methods to 1. Writing a custom connector now requires only get_model_identity(). Extension policy documented: new methods will always have defaults.

Breaking changes

These are pre-launch changes with zero known external users affected.

  • generate(content=) renamed to generate(prompt=). Old name works with deprecation warning.
  • think() returns ThinkResult (delegates to AVPContext via __getattr__). Tuple unpacking still works: ctx, metrics = avp.think(...).
  • generate() returns GenerateResult (str subclass). text, metrics = avp.generate(...) tuple unpacking no longer works — use result.metrics.
  • AVPContext requires keyword-only construction.
  • ConfigurationError replaces bare TypeError/ValueError in easy API. Catchable via except avp.AVPError.

Bug fixes

  • OllamaConnector.get_model_identity() used wrong field names — runtime crash
  • LlamaCppConnector.get_model_identity() — same bug
  • Codec silently corrupted data on unknown dtype values — now raises DecodeError
  • to_bytes() hardcoded FLOAT32 regardless of actual tensor dtype
  • Framework integrations (LangChain, CrewAI, AutoGen) stored wrong type in ContextStore

Install

pip install --upgrade avp

Full changelog: CHANGELOG.md

v0.4.0

23 Mar 06:58

Choose a tag to compare

Ollama, llama.cpp, vLLM, LangChain, CrewAI, AutoGen – all shipped. torch is now optional.

AVP v0.4.0 ships 4 engine backends, 3 framework integrations, and makes torch an optional dependency. pip install avp[ollama] is 85 MB instead of 3 GB.

New engines

Ollama – use models you already have:

from avp.connectors.ollama import OllamaConnector

researcher = OllamaConnector.from_ollama("qwen2.5:7b")
solver = OllamaConnector.from_ollama("llama3.2:3b")
ctx = researcher.think("Analyze this", steps=10)
answer = solver.generate("Solve it", context=ctx, source=researcher, cross_model=True)

llama.cpp – any GGUF file, CPU or GPU. No torch, no forks, no custom builds.

vLLM – production latent communication via KV connector + model plugin. Qwen2, Llama, Mistral, Gemma. CUDA graphs validated.

New frameworks

Framework Integration Install
LangChain ChatAVP avp[langchain]
CrewAI AVPLLM avp[crewai]
AutoGen AVPChatCompletionClient avp[autogen]

torch is optional

Projection math rewritten in numpy. Pick what you need:

pip install avp[ollama]     # 85 MB – local GGUF models
pip install avp[hf]         # 625 MB – HuggingFace models
pip install avp[vllm]       # ~2 GB – production serving

Breaking changes

  • pack(), unpack(), PackedMessage removed (deprecated since v0.3.0 – use think()/generate())
  • PackMetrics, UnpackMetrics removed (use ThinkMetrics/GenerateMetrics)
  • Python >=3.10 required (was >=3.9)
  • transformers>=5.0 required for [hf] extra (was >=4.36)
  • RIDGE and PROCRUSTES removed from ProjectionMethod enum
  • Base pip install avp no longer includes torch – use avp[hf] for HuggingFace models

Also in this release

  • Docs rewritten with per-engine code examples for every backend
  • Protocol spec synced to v0.4
  • 493 tests, all CI green

Full changelog: CHANGELOG.md

v0.3.2

13 Mar 06:40

Choose a tag to compare

What's New

  • Colab quickstart notebooknotebooks/avp_quick_start.ipynb. Runs on a free T4 GPU in ~8 minutes. Compares direct, latent, and text chain on 10 GSM8K problems.
  • think() and generate() can now use different prompts – e.g., researcher prompt for think(), solver prompt for generate().
  • Cross-model projection is now opt-in – pass cross_model=True to enable Rosetta Stone projection.

Bug Fixes

  • Critical: prompt_len bug in connector.generate() – prompt length was computed after extending the attention mask with KV-cache entries, causing empty or truncated output when using context=.
  • Easy API cross-model path dropped user-provided context= and ignored store/store_key/prior_key.

Install

pip install avp==0.3.2

Full changelog: https://github.com/VectorArc/avp-python/blob/main/CHANGELOG.md

v0.3.1

08 Mar 07:36

Choose a tag to compare

Fix protobuf compatibility

Removes the protobuf gencode version check from avp_pb2.py that required protobuf >=6.31.1 at runtime. AVP now works with protobuf >=4.21 as declared in dependencies.

This fixes pip install avp on Google Colab and other environments running protobuf 4.x or 5.x.

Install

pip install avp==0.3.1

Full changelog: CHANGELOG.md

v0.3.0

07 Mar 20:53

Choose a tag to compare

AVP v0.3.0 — the think() / generate() release.

Highlights

New API. think() and generate() replace pack() / unpack(). Zero-friction entry point:

import avp

answer = avp.generate("Solve: 24 * 17 + 3", model="Qwen/Qwen2.5-7B-Instruct")

Cross-model transfer, zero ceremony. One parameter handles model loading, handshake, calibration, and projection:

answer = avp.generate("Solve: 24 * 17 + 3",
                       model="meta-llama/Llama-3.2-3B-Instruct",
                       source_model="Qwen/Qwen2.5-7B-Instruct")

Install just works. pip install avp — torch and transformers are now required deps. No extras needed for core functionality.

Results

Direct Latent (AVP) Text
HumanEval (Qwen 7B, n=164) 58.5% 67.1% 53.0%
GSM8K (Qwen 7B, n=200) 91.0% 90.5% 87.0%
DebugBench (Qwen 7B, n=100) 50.0% 51.0% 49.0%

+8.6pp on code generation (p=0.029). 46-78% fewer tokens. 2-4x faster.

Cross-model (zero training, 6 KB wire):

Source → Target GSM8K HumanEval
Llama 3B → Qwen 7B 90.0% 79.3%
Qwen 7B → Llama 3B 74.5% 47.0%

What's New

Added

  • think() / generate() API — replaces pack() / unpack()
  • Cross-model source= parameterconnector.generate(prompt, context=ctx, source=other)
  • Easy API cross-modelavp.generate(prompt, model=target, source_model=source)
  • ContextStore — thread-safe, TTL-backed store for multi-turn latent conversations
  • avp.inspect(data) — decode AVP binary header/metadata without loading models
  • Debug modedebug=True surfaces TransferDiagnostics: norm trajectory, projection metrics, quality gate
  • Always-on warningsRuntimeWarning for empty output, NaN/Inf in hidden states
  • Vocabulary-overlap projection — cross-family zero-parameter projection (~85% shared BPE tokens for Qwen/Llama)
  • Per-transfer quality gateassess_transfer() recommends latent vs JSON based on prompt length
  • Projection validation — cosine similarity + pseudo-perplexity two-tier gate
  • vLLM connector (experimental) — text generation and identity extraction work; KV-cache transfer plugin not yet validated end-to-end
  • 8 benchmark suites — GSM8K, HotpotQA, MATH, HumanEval, ClassEval, DebugBench with cloud results

Changed

  • API rename: pack()think(), unpack()generate() (old names still work with deprecation warnings)
  • Protocol version bumped to 0.3.0
  • CommunicationMode simplified to LATENT = 0, JSON = 1
  • Package extras — torch/transformers now required. [vllm] extra for production serving. Removed [latent], [hf], [demo], [all]

Removed

  • Hybrid mode — wire format bundling latent + text fallback (never consumed)
  • Universal representation mode — learned cross-model adapters (validated negative: 0% accuracy)
  • FallbackRequest, FallbackRequested, bytes_to_embedding(), confidence_score — unused code
  • v0.1.0 proto backward-compat fields

Fixed

  • Tied-weight models — softmax projection fixes cosine similarity from ~0.24 to ~1.0
  • Vocab size mismatch — truncation to shared prefix for Qwen 7B vs 1.5B
  • KV-cache serialization — bfloat16 support and transformers 5.x compatibility
  • Cross-platform — Windows console encoding, MPS device detection, pre-Ampere GPU support

Full changelog

See CHANGELOG.md for all versions.

v0.2.3

02 Mar 05:49

Choose a tag to compare

AVP Python SDK v0.2.3

Multi-agent text handoffs discard KV-cache, embeddings, and attention state the previous agent already computed. AVP transfers that state directly — 51-78% fewer tokens, 1.5-5x faster, across models and families.

Cross-Model Communication (Phase 4)

  • Cross-family vocabulary overlap projection: Transfer hidden states between different model families (e.g. Qwen → Llama) via shared BPE tokens (~85% overlap). Zero training needed.
  • Handshake auto-discovery: CompatibilityResolver.resolve() now auto-detects vocab overlap and selects the right projection method.
  • Pre-indexed lm_head optimization: ~15% faster projection by pre-indexing shared vocabulary at calibration time.
  • Configurable projection temperature: projection_temperature parameter for softmax tuning in cross-model projection.

Cross-Model Benchmark Results (A100, n=50)

Direction GSM8K 2-Agent HotpotQA Fan-out
Qwen 7B → Llama 3B 72% 10% 34%
Llama 3B → Qwen 7B 88% 22% 48%
Qwen 7B → Qwen 1.5B 74% 8% 34%
Qwen 1.5B → Qwen 7B 88% 22% 50%

Cross-model accuracy tracks solver (target model) capability. Full results: BENCHMARKS.md

Developer Experience

  • Fixed Connector API docs: think() and generate() examples now use consistent prompts (mismatched prompts caused empty output)
  • CommunicationMode display: Now shows LATENT instead of 0
  • API reference: Added generate(), ContextStore to docs
  • Dead code cleanup: Removed unused imports, functions, and duplicate helpers
  • Fixed vLLM dependency: >=0.15.0 (was >=0.8.0)
  • Expanded __all__: All cross-model exports accessible via avp.*

Stats

  • 398 tests passing
  • 5 models validated: Qwen2.5 (1.5B, 7B), DeepSeek-R1 (1.5B), Llama 3.2 (1B, 3B)
  • 2 model families: Qwen, Llama

Install

pip install avp                    # core
pip install "avp[latent]"          # + torch/transformers
pip install "avp[vllm]"            # + vLLM 0.15+ connector

Full documentation: README · Benchmarks · Spec