Releases · VectorArc/avp-python

05 Apr 00:21

SStas

v0.6.1

8863db0

v0.6.1 Latest

Latest

Added

generate_on_context() — Third latent primitive on LlamaCppConnector. Autoregressive generation on a caller-owned context with streaming via token_callback, n_ctx awareness for capacity checking, extra_stop_strings for custom stops, and generated_ids in the return tuple. Completes the create/think/generate primitive set alongside create_inference_context() and run_latent_steps().
tokenize(add_bos=True) — Optional add_bos parameter on tokenize() across all connectors (ABC, LlamaCpp, HuggingFace, vLLM). Default False preserves backward compatibility. Use True when tokenizing for manual decoding onto a fresh context.

Changed

_generate_on_think_ctx refactored — Now delegates to generate_on_context() for the generation loop. Context lifecycle (free/keep) still managed by the wrapper. No behavior change for existing callers.

Fixed

run_latent_steps docstring — Fixed duplicate Args section and incorrect default value (was 10, should be 20).
_generate_on_think_ctx n_cur scoping — Fixed potential UnboundLocalError in finally block if generate_on_context raised.
Closed context validation — generate_on_context raises ValueError on closed LlamaCppInferenceContext instead of segfaulting.
HF add_bos semantics — tokenize(add_bos=True) on HuggingFace now prepends only BOS token (not all special tokens via add_special_tokens=True).

Assets 2

04 Apr 16:02

SStas

v0.6.0

394aac1

v0.6.0

v0.6.0 — Public Connector API

Highlights

Public connector API — hidden_dim, num_layers, context_length, vocab_size, device, dtype, tokenize(), detokenize(), apply_chat_template(), stop_token_ids, stop_strings on the EngineConnector ABC
OutputType/PayloadType split — API-level enum (OutputType) separated from wire-level enum (PayloadType). PayloadType.AUTO removed.
LlamaCpp latent primitives — create_inference_context(), run_latent_steps(), keep_context=True, grammar= for constrained generation
11 bug fixes including critical HF inject_and_generate() crash and LlamaCpp use-after-free

Breaking changes

tokenize() returns List[int] (was Any/torch.Tensor)
PayloadType.AUTO removed — use OutputType.AUTO
extract_hidden_state()/inject_and_generate() removed from ABC
LlamaCpp tokenize() no longer adds BOS

See CHANGELOG.md for full details.

Assets 2

03 Apr 06:49

SStas

v0.5.1

2f05ab6

v0.5.1

v0.5.1 — Flexible Transfer Granularity

Added

output=PayloadType on think() — Controls what the returned context contains:
- PayloadType.AUTO (default): system decides
- PayloadType.KV_CACHE: full KV-cache + hidden state
- PayloadType.HIDDEN_STATE: only last hidden state [1, D] (~14KB vs ~76MB, KV freed)
PayloadType.AUTO — SDK-only sentinel (-1), resolves at runtime, never serialized
AVPContext.payload_type property — Derived from data present
Same-model hidden state injection — generate() accepts hidden-state-only contexts for same-model (previously cross-model only)
Type validation for output= with actionable ConfigurationError

Removed

PayloadType.EMBEDDING — Removed from SDK and proto schema (dead code, never used)

Install

pip install avp==0.5.1

Full changelog: https://github.com/VectorArc/avp-python/blob/main/CHANGELOG.md

Assets 2

30 Mar 06:25

SStas

v0.4.2

99e5daf

v0.4.2

What's New

model= accepts Union[str, EngineConnector] — The Easy API (think(), generate()) now accepts either a model name string or a pre-built EngineConnector instance. All backends (Ollama, llama.cpp, vLLM, HuggingFace) are now first-class in the Easy API.

import avp
from avp import OllamaConnector

# With any connector
conn = OllamaConnector.from_ollama("qwen2.5:7b")
context = avp.think("Analyze this", model=conn)
answer = avp.generate("Solve it", model=conn, context=context)

# With a model name (still works, auto-creates HuggingFace backend)
context = avp.think("Analyze this", model="Qwen/Qwen2.5-7B-Instruct")

Added

ModelSpec type alias — Union[str, EngineConnector], importable from top-level avp
EngineConnector top-level export — from avp import EngineConnector now works
can_think validation in generate() — clear error with actionable message when a connector without think support is passed with steps > 0
transformers 5.4 compatibility — removed explicit cache_position kwarg (now managed internally by generate())
19 new tests for connector parameter handling and backward compatibility

Changed

generate() reduced from 17 to 15 parameters
source_model= also accepts Union[str, EngineConnector] for cross-model projection
Framework integrations (ChatAVP, AVPLLM, AVPChatCompletionClient) resolve connectors internally

Full Changelog

v0.4.1...v0.4.2

Assets 2

26 Mar 04:09

SStas

v0.4.1

cbdfeb5

v0.4.1

API stability release. All public APIs audited against stable protocol design principles (Protobuf, Arrow, gRPC). 33 issues found and fixed. 500 tests pass, cloud validated on A100.

Highlights

Stable return types — think() and generate() now return ThinkResult and GenerateResult objects instead of Union types. GenerateResult is a str subclass, so all existing string operations work. Access metrics via result.metrics instead of tuple unpacking.

result = avp.generate("Solve: 2+2", model="Qwen/Qwen2.5-7B-Instruct", collect_metrics=True)
print(result)          # works — it's a str
print(result.metrics)  # GenerateMetrics

Payload integrity — CRC32 checksum on all wire payloads. Catches corruption and truncation. Zero overhead for same-process transfers (optional field).

Simpler connector API — EngineConnector ABC reduced from 6 required methods to 1. Writing a custom connector now requires only get_model_identity(). Extension policy documented: new methods will always have defaults.

Breaking changes

These are pre-launch changes with zero known external users affected.

generate(content=) renamed to generate(prompt=). Old name works with deprecation warning.
think() returns ThinkResult (delegates to AVPContext via __getattr__). Tuple unpacking still works: ctx, metrics = avp.think(...).
generate() returns GenerateResult (str subclass). text, metrics = avp.generate(...) tuple unpacking no longer works — use result.metrics.
AVPContext requires keyword-only construction.
ConfigurationError replaces bare TypeError/ValueError in easy API. Catchable via except avp.AVPError.

Bug fixes

OllamaConnector.get_model_identity() used wrong field names — runtime crash
LlamaCppConnector.get_model_identity() — same bug
Codec silently corrupted data on unknown dtype values — now raises DecodeError
to_bytes() hardcoded FLOAT32 regardless of actual tensor dtype
Framework integrations (LangChain, CrewAI, AutoGen) stored wrong type in ContextStore

Install

pip install --upgrade avp

Full changelog: CHANGELOG.md

Assets 2

23 Mar 06:58

SStas

v0.4.0

6f355d3

v0.4.0

Ollama, llama.cpp, vLLM, LangChain, CrewAI, AutoGen – all shipped. torch is now optional.

AVP v0.4.0 ships 4 engine backends, 3 framework integrations, and makes torch an optional dependency. pip install avp[ollama] is 85 MB instead of 3 GB.

New engines

Ollama – use models you already have:

from avp.connectors.ollama import OllamaConnector

researcher = OllamaConnector.from_ollama("qwen2.5:7b")
solver = OllamaConnector.from_ollama("llama3.2:3b")
ctx = researcher.think("Analyze this", steps=10)
answer = solver.generate("Solve it", context=ctx, source=researcher, cross_model=True)

llama.cpp – any GGUF file, CPU or GPU. No torch, no forks, no custom builds.

vLLM – production latent communication via KV connector + model plugin. Qwen2, Llama, Mistral, Gemma. CUDA graphs validated.

New frameworks

Framework	Integration	Install
LangChain	`ChatAVP`	`avp[langchain]`
CrewAI	`AVPLLM`	`avp[crewai]`
AutoGen	`AVPChatCompletionClient`	`avp[autogen]`

torch is optional

Projection math rewritten in numpy. Pick what you need:

pip install avp[ollama]     # 85 MB – local GGUF models
pip install avp[hf]         # 625 MB – HuggingFace models
pip install avp[vllm]       # ~2 GB – production serving

Breaking changes

pack(), unpack(), PackedMessage removed (deprecated since v0.3.0 – use think()/generate())
PackMetrics, UnpackMetrics removed (use ThinkMetrics/GenerateMetrics)
Python >=3.10 required (was >=3.9)
transformers>=5.0 required for [hf] extra (was >=4.36)
RIDGE and PROCRUSTES removed from ProjectionMethod enum
Base pip install avp no longer includes torch – use avp[hf] for HuggingFace models

Also in this release

Docs rewritten with per-engine code examples for every backend
Protocol spec synced to v0.4
493 tests, all CI green

Full changelog: CHANGELOG.md

Assets 2

13 Mar 06:40

SStas

v0.3.2

69f0b9f

v0.3.2

What's New

Colab quickstart notebook – notebooks/avp_quick_start.ipynb. Runs on a free T4 GPU in ~8 minutes. Compares direct, latent, and text chain on 10 GSM8K problems.
think() and generate() can now use different prompts – e.g., researcher prompt for think(), solver prompt for generate().
Cross-model projection is now opt-in – pass cross_model=True to enable Rosetta Stone projection.

Bug Fixes

Critical: prompt_len bug in connector.generate() – prompt length was computed after extending the attention mask with KV-cache entries, causing empty or truncated output when using context=.
Easy API cross-model path dropped user-provided context= and ignored store/store_key/prior_key.

Install

pip install avp==0.3.2

Full changelog: https://github.com/VectorArc/avp-python/blob/main/CHANGELOG.md

Assets 2

08 Mar 07:36

SStas

v0.3.1

59813a8

v0.3.1

Fix protobuf compatibility

Removes the protobuf gencode version check from avp_pb2.py that required protobuf >=6.31.1 at runtime. AVP now works with protobuf >=4.21 as declared in dependencies.

This fixes pip install avp on Google Colab and other environments running protobuf 4.x or 5.x.

Install

pip install avp==0.3.1

Full changelog: CHANGELOG.md

Assets 2

07 Mar 20:53

SStas

v0.3.0

33e8710

v0.3.0

AVP v0.3.0 — the think() / generate() release.

Highlights

New API. think() and generate() replace pack() / unpack(). Zero-friction entry point:

import avp

answer = avp.generate("Solve: 24 * 17 + 3", model="Qwen/Qwen2.5-7B-Instruct")

Cross-model transfer, zero ceremony. One parameter handles model loading, handshake, calibration, and projection:

answer = avp.generate("Solve: 24 * 17 + 3",
                       model="meta-llama/Llama-3.2-3B-Instruct",
                       source_model="Qwen/Qwen2.5-7B-Instruct")

Install just works. pip install avp — torch and transformers are now required deps. No extras needed for core functionality.

Results

	Direct	Latent (AVP)	Text
HumanEval (Qwen 7B, n=164)	58.5%	67.1%	53.0%
GSM8K (Qwen 7B, n=200)	91.0%	90.5%	87.0%
DebugBench (Qwen 7B, n=100)	50.0%	51.0%	49.0%

+8.6pp on code generation (p=0.029). 46-78% fewer tokens. 2-4x faster.

Cross-model (zero training, 6 KB wire):

Source → Target	GSM8K	HumanEval
Llama 3B → Qwen 7B	90.0%	79.3%
Qwen 7B → Llama 3B	74.5%	47.0%

What's New

Added

think() / generate() API — replaces pack() / unpack()
Cross-model source= parameter — connector.generate(prompt, context=ctx, source=other)
Easy API cross-model — avp.generate(prompt, model=target, source_model=source)
ContextStore — thread-safe, TTL-backed store for multi-turn latent conversations
avp.inspect(data) — decode AVP binary header/metadata without loading models
Debug mode — debug=True surfaces TransferDiagnostics: norm trajectory, projection metrics, quality gate
Always-on warnings — RuntimeWarning for empty output, NaN/Inf in hidden states
Vocabulary-overlap projection — cross-family zero-parameter projection (~85% shared BPE tokens for Qwen/Llama)
Per-transfer quality gate — assess_transfer() recommends latent vs JSON based on prompt length
Projection validation — cosine similarity + pseudo-perplexity two-tier gate
vLLM connector (experimental) — text generation and identity extraction work; KV-cache transfer plugin not yet validated end-to-end
8 benchmark suites — GSM8K, HotpotQA, MATH, HumanEval, ClassEval, DebugBench with cloud results

Changed

API rename: pack() → think(), unpack() → generate() (old names still work with deprecation warnings)
Protocol version bumped to 0.3.0
CommunicationMode simplified to LATENT = 0, JSON = 1
Package extras — torch/transformers now required. [vllm] extra for production serving. Removed [latent], [hf], [demo], [all]

Removed

Hybrid mode — wire format bundling latent + text fallback (never consumed)
Universal representation mode — learned cross-model adapters (validated negative: 0% accuracy)
FallbackRequest, FallbackRequested, bytes_to_embedding(), confidence_score — unused code
v0.1.0 proto backward-compat fields

Fixed

Tied-weight models — softmax projection fixes cosine similarity from ~0.24 to ~1.0
Vocab size mismatch — truncation to shared prefix for Qwen 7B vs 1.5B
KV-cache serialization — bfloat16 support and transformers 5.x compatibility
Cross-platform — Windows console encoding, MPS device detection, pre-Ampere GPU support

Full changelog

See CHANGELOG.md for all versions.

Assets 2

02 Mar 05:49

SStas

v0.2.3

ac1c6ed

v0.2.3

AVP Python SDK v0.2.3

Multi-agent text handoffs discard KV-cache, embeddings, and attention state the previous agent already computed. AVP transfers that state directly — 51-78% fewer tokens, 1.5-5x faster, across models and families.

Cross-Model Communication (Phase 4)

Cross-family vocabulary overlap projection: Transfer hidden states between different model families (e.g. Qwen → Llama) via shared BPE tokens (~85% overlap). Zero training needed.
Handshake auto-discovery: CompatibilityResolver.resolve() now auto-detects vocab overlap and selects the right projection method.
Pre-indexed lm_head optimization: ~15% faster projection by pre-indexing shared vocabulary at calibration time.
Configurable projection temperature: projection_temperature parameter for softmax tuning in cross-model projection.

Cross-Model Benchmark Results (A100, n=50)

Direction	GSM8K 2-Agent	HotpotQA	Fan-out
Qwen 7B → Llama 3B	72%	10%	34%
Llama 3B → Qwen 7B	88%	22%	48%
Qwen 7B → Qwen 1.5B	74%	8%	34%
Qwen 1.5B → Qwen 7B	88%	22%	50%

Cross-model accuracy tracks solver (target model) capability. Full results: BENCHMARKS.md

Developer Experience

Fixed Connector API docs: think() and generate() examples now use consistent prompts (mismatched prompts caused empty output)
CommunicationMode display: Now shows LATENT instead of 0
API reference: Added generate(), ContextStore to docs
Dead code cleanup: Removed unused imports, functions, and duplicate helpers
Fixed vLLM dependency: >=0.15.0 (was >=0.8.0)
Expanded __all__: All cross-model exports accessible via avp.*

Stats

398 tests passing
5 models validated: Qwen2.5 (1.5B, 7B), DeepSeek-R1 (1.5B), Llama 3.2 (1B, 3B)
2 model families: Qwen, Llama

Install

pip install avp                    # core
pip install "avp[latent]"          # + torch/transformers
pip install "avp[vllm]"            # + vLLM 0.15+ connector

Full documentation: README · Benchmarks · Spec

Assets 2

Releases: VectorArc/avp-python

v0.6.1

Added

Changed

Fixed

Uh oh!

v0.6.0

v0.6.0 — Public Connector API

Highlights

Breaking changes

Uh oh!

v0.5.1

v0.5.1 — Flexible Transfer Granularity

Added

Removed

Install

Uh oh!

v0.4.2

What's New

Added

Changed

Full Changelog

Uh oh!

v0.4.1

Highlights

Breaking changes

Bug fixes

Install

Uh oh!

v0.4.0

Ollama, llama.cpp, vLLM, LangChain, CrewAI, AutoGen – all shipped. torch is now optional.

New engines

New frameworks

torch is optional

Breaking changes

Also in this release

Uh oh!

v0.3.2

What's New

Bug Fixes

Install

Uh oh!

v0.3.1

Fix protobuf compatibility

Install

Uh oh!

v0.3.0

Highlights

Results

What's New

Added

Changed

Removed

Fixed

Full changelog

Uh oh!

v0.2.3

AVP Python SDK v0.2.3

Cross-Model Communication (Phase 4)

Cross-Model Benchmark Results (A100, n=50)

Developer Experience

Stats

Install

Uh oh!