Releases: VectorArc/avp-python
v0.6.1
Added
generate_on_context()— Third latent primitive onLlamaCppConnector. Autoregressive generation on a caller-owned context with streaming viatoken_callback,n_ctxawareness for capacity checking,extra_stop_stringsfor custom stops, andgenerated_idsin the return tuple. Completes the create/think/generate primitive set alongsidecreate_inference_context()andrun_latent_steps().tokenize(add_bos=True)— Optionaladd_bosparameter ontokenize()across all connectors (ABC, LlamaCpp, HuggingFace, vLLM). DefaultFalsepreserves backward compatibility. UseTruewhen tokenizing for manual decoding onto a fresh context.
Changed
_generate_on_think_ctxrefactored — Now delegates togenerate_on_context()for the generation loop. Context lifecycle (free/keep) still managed by the wrapper. No behavior change for existing callers.
Fixed
run_latent_stepsdocstring — Fixed duplicate Args section and incorrect default value (was 10, should be 20)._generate_on_think_ctxn_curscoping — Fixed potentialUnboundLocalErrorin finally block ifgenerate_on_contextraised.- Closed context validation —
generate_on_contextraisesValueErroron closedLlamaCppInferenceContextinstead of segfaulting. - HF
add_bossemantics —tokenize(add_bos=True)on HuggingFace now prepends only BOS token (not all special tokens viaadd_special_tokens=True).
v0.6.0
v0.6.0 — Public Connector API
Highlights
- Public connector API —
hidden_dim,num_layers,context_length,vocab_size,device,dtype,tokenize(),detokenize(),apply_chat_template(),stop_token_ids,stop_stringson the EngineConnector ABC - OutputType/PayloadType split — API-level enum (
OutputType) separated from wire-level enum (PayloadType).PayloadType.AUTOremoved. - LlamaCpp latent primitives —
create_inference_context(),run_latent_steps(),keep_context=True,grammar=for constrained generation - 11 bug fixes including critical HF
inject_and_generate()crash and LlamaCpp use-after-free
Breaking changes
tokenize()returnsList[int](wasAny/torch.Tensor)PayloadType.AUTOremoved — useOutputType.AUTOextract_hidden_state()/inject_and_generate()removed from ABC- LlamaCpp
tokenize()no longer adds BOS
See CHANGELOG.md for full details.
v0.5.1
v0.5.1 — Flexible Transfer Granularity
Added
output=PayloadTypeonthink()— Controls what the returned context contains:PayloadType.AUTO(default): system decidesPayloadType.KV_CACHE: full KV-cache + hidden statePayloadType.HIDDEN_STATE: only last hidden state[1, D](~14KB vs ~76MB, KV freed)
PayloadType.AUTO— SDK-only sentinel (-1), resolves at runtime, never serializedAVPContext.payload_typeproperty — Derived from data present- Same-model hidden state injection —
generate()accepts hidden-state-only contexts for same-model (previously cross-model only) - Type validation for
output=with actionableConfigurationError
Removed
PayloadType.EMBEDDING— Removed from SDK and proto schema (dead code, never used)
Install
pip install avp==0.5.1Full changelog: https://github.com/VectorArc/avp-python/blob/main/CHANGELOG.md
v0.4.2
What's New
model= accepts Union[str, EngineConnector] — The Easy API (think(), generate()) now accepts either a model name string or a pre-built EngineConnector instance. All backends (Ollama, llama.cpp, vLLM, HuggingFace) are now first-class in the Easy API.
import avp
from avp import OllamaConnector
# With any connector
conn = OllamaConnector.from_ollama("qwen2.5:7b")
context = avp.think("Analyze this", model=conn)
answer = avp.generate("Solve it", model=conn, context=context)
# With a model name (still works, auto-creates HuggingFace backend)
context = avp.think("Analyze this", model="Qwen/Qwen2.5-7B-Instruct")Added
ModelSpectype alias —Union[str, EngineConnector], importable from top-levelavpEngineConnectortop-level export —from avp import EngineConnectornow workscan_thinkvalidation ingenerate()— clear error with actionable message when a connector without think support is passed withsteps > 0- transformers 5.4 compatibility — removed explicit
cache_positionkwarg (now managed internally bygenerate()) - 19 new tests for connector parameter handling and backward compatibility
Changed
generate()reduced from 17 to 15 parameterssource_model=also acceptsUnion[str, EngineConnector]for cross-model projection- Framework integrations (ChatAVP, AVPLLM, AVPChatCompletionClient) resolve connectors internally
Full Changelog
v0.4.1
API stability release. All public APIs audited against stable protocol design principles (Protobuf, Arrow, gRPC). 33 issues found and fixed. 500 tests pass, cloud validated on A100.
Highlights
Stable return types — think() and generate() now return ThinkResult and GenerateResult objects instead of Union types. GenerateResult is a str subclass, so all existing string operations work. Access metrics via result.metrics instead of tuple unpacking.
result = avp.generate("Solve: 2+2", model="Qwen/Qwen2.5-7B-Instruct", collect_metrics=True)
print(result) # works — it's a str
print(result.metrics) # GenerateMetricsPayload integrity — CRC32 checksum on all wire payloads. Catches corruption and truncation. Zero overhead for same-process transfers (optional field).
Simpler connector API — EngineConnector ABC reduced from 6 required methods to 1. Writing a custom connector now requires only get_model_identity(). Extension policy documented: new methods will always have defaults.
Breaking changes
These are pre-launch changes with zero known external users affected.
generate(content=)renamed togenerate(prompt=). Old name works with deprecation warning.think()returnsThinkResult(delegates to AVPContext via__getattr__). Tuple unpacking still works:ctx, metrics = avp.think(...).generate()returnsGenerateResult(str subclass).text, metrics = avp.generate(...)tuple unpacking no longer works — useresult.metrics.AVPContextrequires keyword-only construction.ConfigurationErrorreplaces bareTypeError/ValueErrorin easy API. Catchable viaexcept avp.AVPError.
Bug fixes
OllamaConnector.get_model_identity()used wrong field names — runtime crashLlamaCppConnector.get_model_identity()— same bug- Codec silently corrupted data on unknown dtype values — now raises
DecodeError to_bytes()hardcoded FLOAT32 regardless of actual tensor dtype- Framework integrations (LangChain, CrewAI, AutoGen) stored wrong type in ContextStore
Install
pip install --upgrade avp
Full changelog: CHANGELOG.md
v0.4.0
Ollama, llama.cpp, vLLM, LangChain, CrewAI, AutoGen – all shipped. torch is now optional.
AVP v0.4.0 ships 4 engine backends, 3 framework integrations, and makes torch an optional dependency. pip install avp[ollama] is 85 MB instead of 3 GB.
New engines
Ollama – use models you already have:
from avp.connectors.ollama import OllamaConnector
researcher = OllamaConnector.from_ollama("qwen2.5:7b")
solver = OllamaConnector.from_ollama("llama3.2:3b")
ctx = researcher.think("Analyze this", steps=10)
answer = solver.generate("Solve it", context=ctx, source=researcher, cross_model=True)llama.cpp – any GGUF file, CPU or GPU. No torch, no forks, no custom builds.
vLLM – production latent communication via KV connector + model plugin. Qwen2, Llama, Mistral, Gemma. CUDA graphs validated.
New frameworks
| Framework | Integration | Install |
|---|---|---|
| LangChain | ChatAVP |
avp[langchain] |
| CrewAI | AVPLLM |
avp[crewai] |
| AutoGen | AVPChatCompletionClient |
avp[autogen] |
torch is optional
Projection math rewritten in numpy. Pick what you need:
pip install avp[ollama] # 85 MB – local GGUF models
pip install avp[hf] # 625 MB – HuggingFace models
pip install avp[vllm] # ~2 GB – production serving
Breaking changes
pack(),unpack(),PackedMessageremoved (deprecated since v0.3.0 – usethink()/generate())PackMetrics,UnpackMetricsremoved (useThinkMetrics/GenerateMetrics)- Python >=3.10 required (was >=3.9)
transformers>=5.0required for[hf]extra (was >=4.36)RIDGEandPROCRUSTESremoved fromProjectionMethodenum- Base
pip install avpno longer includes torch – useavp[hf]for HuggingFace models
Also in this release
- Docs rewritten with per-engine code examples for every backend
- Protocol spec synced to v0.4
- 493 tests, all CI green
Full changelog: CHANGELOG.md
v0.3.2
What's New
- Colab quickstart notebook –
notebooks/avp_quick_start.ipynb. Runs on a free T4 GPU in ~8 minutes. Compares direct, latent, and text chain on 10 GSM8K problems. think()andgenerate()can now use different prompts – e.g., researcher prompt forthink(), solver prompt forgenerate().- Cross-model projection is now opt-in – pass
cross_model=Trueto enable Rosetta Stone projection.
Bug Fixes
- Critical:
prompt_lenbug inconnector.generate()– prompt length was computed after extending the attention mask with KV-cache entries, causing empty or truncated output when usingcontext=. - Easy API cross-model path dropped user-provided
context=and ignoredstore/store_key/prior_key.
Install
pip install avp==0.3.2Full changelog: https://github.com/VectorArc/avp-python/blob/main/CHANGELOG.md
v0.3.1
Fix protobuf compatibility
Removes the protobuf gencode version check from avp_pb2.py that required protobuf >=6.31.1 at runtime. AVP now works with protobuf >=4.21 as declared in dependencies.
This fixes pip install avp on Google Colab and other environments running protobuf 4.x or 5.x.
Install
pip install avp==0.3.1Full changelog: CHANGELOG.md
v0.3.0
AVP v0.3.0 — the think() / generate() release.
Highlights
New API. think() and generate() replace pack() / unpack(). Zero-friction entry point:
import avp
answer = avp.generate("Solve: 24 * 17 + 3", model="Qwen/Qwen2.5-7B-Instruct")Cross-model transfer, zero ceremony. One parameter handles model loading, handshake, calibration, and projection:
answer = avp.generate("Solve: 24 * 17 + 3",
model="meta-llama/Llama-3.2-3B-Instruct",
source_model="Qwen/Qwen2.5-7B-Instruct")Install just works. pip install avp — torch and transformers are now required deps. No extras needed for core functionality.
Results
| Direct | Latent (AVP) | Text | |
|---|---|---|---|
| HumanEval (Qwen 7B, n=164) | 58.5% | 67.1% | 53.0% |
| GSM8K (Qwen 7B, n=200) | 91.0% | 90.5% | 87.0% |
| DebugBench (Qwen 7B, n=100) | 50.0% | 51.0% | 49.0% |
+8.6pp on code generation (p=0.029). 46-78% fewer tokens. 2-4x faster.
Cross-model (zero training, 6 KB wire):
| Source → Target | GSM8K | HumanEval |
|---|---|---|
| Llama 3B → Qwen 7B | 90.0% | 79.3% |
| Qwen 7B → Llama 3B | 74.5% | 47.0% |
What's New
Added
think()/generate()API — replacespack()/unpack()- Cross-model
source=parameter —connector.generate(prompt, context=ctx, source=other) - Easy API cross-model —
avp.generate(prompt, model=target, source_model=source) ContextStore— thread-safe, TTL-backed store for multi-turn latent conversationsavp.inspect(data)— decode AVP binary header/metadata without loading models- Debug mode —
debug=TruesurfacesTransferDiagnostics: norm trajectory, projection metrics, quality gate - Always-on warnings —
RuntimeWarningfor empty output, NaN/Inf in hidden states - Vocabulary-overlap projection — cross-family zero-parameter projection (~85% shared BPE tokens for Qwen/Llama)
- Per-transfer quality gate —
assess_transfer()recommends latent vs JSON based on prompt length - Projection validation — cosine similarity + pseudo-perplexity two-tier gate
- vLLM connector (experimental) — text generation and identity extraction work; KV-cache transfer plugin not yet validated end-to-end
- 8 benchmark suites — GSM8K, HotpotQA, MATH, HumanEval, ClassEval, DebugBench with cloud results
Changed
- API rename:
pack()→think(),unpack()→generate()(old names still work with deprecation warnings) - Protocol version bumped to 0.3.0
CommunicationModesimplified toLATENT = 0,JSON = 1- Package extras — torch/transformers now required.
[vllm]extra for production serving. Removed[latent],[hf],[demo],[all]
Removed
- Hybrid mode — wire format bundling latent + text fallback (never consumed)
- Universal representation mode — learned cross-model adapters (validated negative: 0% accuracy)
FallbackRequest,FallbackRequested,bytes_to_embedding(),confidence_score— unused code- v0.1.0 proto backward-compat fields
Fixed
- Tied-weight models — softmax projection fixes cosine similarity from ~0.24 to ~1.0
- Vocab size mismatch — truncation to shared prefix for Qwen 7B vs 1.5B
- KV-cache serialization — bfloat16 support and transformers 5.x compatibility
- Cross-platform — Windows console encoding, MPS device detection, pre-Ampere GPU support
Full changelog
See CHANGELOG.md for all versions.
v0.2.3
AVP Python SDK v0.2.3
Multi-agent text handoffs discard KV-cache, embeddings, and attention state the previous agent already computed. AVP transfers that state directly — 51-78% fewer tokens, 1.5-5x faster, across models and families.
Cross-Model Communication (Phase 4)
- Cross-family vocabulary overlap projection: Transfer hidden states between different model families (e.g. Qwen → Llama) via shared BPE tokens (~85% overlap). Zero training needed.
- Handshake auto-discovery:
CompatibilityResolver.resolve()now auto-detects vocab overlap and selects the right projection method. - Pre-indexed lm_head optimization: ~15% faster projection by pre-indexing shared vocabulary at calibration time.
- Configurable projection temperature:
projection_temperatureparameter for softmax tuning in cross-model projection.
Cross-Model Benchmark Results (A100, n=50)
| Direction | GSM8K 2-Agent | HotpotQA | Fan-out |
|---|---|---|---|
| Qwen 7B → Llama 3B | 72% | 10% | 34% |
| Llama 3B → Qwen 7B | 88% | 22% | 48% |
| Qwen 7B → Qwen 1.5B | 74% | 8% | 34% |
| Qwen 1.5B → Qwen 7B | 88% | 22% | 50% |
Cross-model accuracy tracks solver (target model) capability. Full results: BENCHMARKS.md
Developer Experience
- Fixed Connector API docs:
think()andgenerate()examples now use consistent prompts (mismatched prompts caused empty output) CommunicationModedisplay: Now showsLATENTinstead of0- API reference: Added
generate(),ContextStoreto docs - Dead code cleanup: Removed unused imports, functions, and duplicate helpers
- Fixed vLLM dependency:
>=0.15.0(was>=0.8.0) - Expanded
__all__: All cross-model exports accessible viaavp.*
Stats
- 398 tests passing
- 5 models validated: Qwen2.5 (1.5B, 7B), DeepSeek-R1 (1.5B), Llama 3.2 (1B, 3B)
- 2 model families: Qwen, Llama
Install
pip install avp # core
pip install "avp[latent]" # + torch/transformers
pip install "avp[vllm]" # + vLLM 0.15+ connectorFull documentation: README · Benchmarks · Spec