All notable changes to the AVP Python SDK are documented in this file.
Format follows Keep a Changelog. Versions follow Semantic Versioning.
- Public connector API —
hidden_dim,num_layers,context_length,vocab_size,device,dtypeproperties onEngineConnectorABC. - Tokenization API —
tokenize()(returnsList[int]),detokenize(),apply_chat_template()on ABC.can_tokenizecapability flag. - Stop conditions —
stop_token_ids,stop_stringsproperties on ABC. OutputTypeenum — API-level enum (AUTO/KV_CACHE/HIDDEN_STATE) forthink(output=). ReplacesPayloadType.AUTO.- LlamaCpp
think(context=)— Continue thinking from a prior context (ABC compliance). - LlamaCpp
think(prompt=Union[str, List[Dict]])— Accept chat message format (ABC compliance). - LlamaCpp
grammar=— GBNF grammar ongenerate()for constrained generation. All code paths. - LlamaCpp
keep_context=True— Preserve live context aftergenerate(). - LlamaCpp latent primitives —
create_inference_context(),release_inference_context(),run_latent_steps()for consumers managing own context lifecycle. LlamaCppInferenceContext— Dataclass withclose()/context manager for C context handles.OutputType.resolve()— Single method for OutputType → PayloadType mapping.**kwargsforwarding — Easy API and all connectors forward engine-specific kwargs.- Ollama
/api/showfallback — Resolves custom models created viaollama create. - ContextStore TTL warning — Logs expiry instead of silent
Nonereturn.
- HF
inject_and_generate()crash — Was missingdo_sampleparameter. TypeError on anyHIDDEN_STATEcontext (CRITICAL). - LlamaCpp
model_hash=""— Now computed from GGUF metadata. - LlamaCpp
get_embedding_weights()— Was returning(None, None). Now returns actual GGUF weights. - LlamaCpp
tokenize()— Now usesadd_bos=False, special=True(wasadd_bos=True). to_bytes()hardcodedPayloadType.KV_CACHE— Now usesself.payload_type. Rejects HIDDEN_STATE contexts.- HF
think(steps=0)ValueError — Now allowed for pure KV prefill. - HF NaN latent steps —
num_stepsreflects actual completed, not requested. ThinkResult.__getattr__recursion — Guard forcontext=None.GenerateResultpickle — Added__reduce__for serialization.ModelIdentity.hidden_dim=0warning — Warns inextract_model_identity().- LlamaCpp
think(context=)use-after-free — Finalizer ownership transfer on context reuse. steps_completedNameError — After dedup refactor ofthink()callingrun_latent_steps().
PayloadType.AUTOremoved — UseOutputType.AUTOinstead.PayloadTypeis now wire-only (0 and 1).tokenize()returnsList[int]— WasAny/torch.Tensoron HF/vLLM. Breaking.has_tokenizer→can_tokenize— Matchescan_thinknaming pattern.- HF
self.device→ property — Internalself._devicewith property wrapper. Read access unchanged. extract_hidden_state()/inject_and_generate()removed from ABC — Kept on HF connector only.steps=20harmonized — LlamaCpp was 10, now 20 everywhere.
- Stale
0.4.2version string inhandshake.py— now usesAVP_VERSION_STRINGconstant.
output=PayloadTypeonthink()— Controls what the returned context contains.PayloadType.AUTO(default) lets the system decide,PayloadType.KV_CACHEreturns full KV-cache,PayloadType.HIDDEN_STATEreturns only the last hidden state vector[1, D](KV-cache freed immediately, ~14KB vs ~76MB). Enables flexible transfer granularity for bandwidth-constrained or cross-process scenarios.PayloadType.AUTO— SDK-only sentinel (-1) that resolves to the optimal payload type at runtime. Never serialized.AVPContext.payload_typeproperty — Derived from data present:KV_CACHEifpast_key_valuesis set,HIDDEN_STATEif only hidden state. RaisesValueErroron empty context.- Same-model hidden state injection in
generate()—generate()now accepts same-model contexts with only a hidden state (previously gated behindcross_model=True). Uses existing embedding injection path. - Type validation for
output=in Easy API — rejects non-PayloadTypevalues with actionableConfigurationError.
PayloadType.EMBEDDING— Removed from SDK enum and proto schema. Was never used in production code. Proto value 2 is available for future use.
PayloadTypeenum reordered:AUTO = -1,HIDDEN_STATE = 0,KV_CACHE = 1.- Proto schema cleaned:
EMBEDDINGremoved from bothavp-python/proto/avp.protoandavp-spec/schemas/avp.proto.
model=acceptsUnion[str, EngineConnector]— The Easy API (think(),generate()) now accepts either a model name string (auto-creates HuggingFaceConnector) or a pre-builtEngineConnectorinstance. All backends (Ollama, llama.cpp, vLLM) are now first-class in the Easy API.source_model=widened the same way for cross-model projection.ModelSpectype alias —Union[str, EngineConnector], importable from top-levelavp.EngineConnectortop-level export —from avp import EngineConnectornow works (previously requiredfrom avp.connectors.base import EngineConnector).can_thinkvalidation ingenerate()— RaisesConfigurationErrorwith actionable message (suggestssteps=0) when a connector without think support is passed withsteps > 0.- 19 new tests for connector parameter handling and backward compatibility.
generate()parameter count reduced — 15 parameters (down from 17 in the unreleased connector= approach). Removedconnector=andsource_connector=before they shipped to PyPI.- Framework integrations (ChatAVP, AVPLLM, AVPChatCompletionClient) keep separate
model/connectorfields for Pydantic compatibility, resolve internally before calling Easy API.
- Result objects —
think()returnsThinkResult,generate()returnsGenerateResult(str subclass). No more Union return types. Metrics accessible viaresult.metricsinstead of tuple unpacking. InspectResult—avp.inspect()returns a typed dataclass instead ofDict[str, Any].- CRC32 payload checksum — Optional integrity check on wire payloads (proto field 15). Encode always writes, decode verifies when present. Catches corruption and truncation.
ConfigurationError— New error class for invalid arguments tothink()/generate(). Subclass ofAVPError, catchable byexcept AVPError.ProjectionError— New error class for cross-model projection failures.RealignmentErrorkept as alias.TensorLike,KVCache— Type aliases onEngineConnectorfor documentation.ContextStore.__contains__and__len__—"key" in storeandlen(store)now work.- Spec test vector validation — 8 new tests cross-validate SDK against published spec hex baselines.
generate(prompt=)replacesgenerate(content=)—content=is deprecated withDeprecationWarning, will be removed in v2.0.EngineConnectorABC simplified — 1 abstract method (get_model_identity) instead of 6. All others have concrete defaults. Third-party connectors now need minimal boilerplate.AVPContextiskw_only— Positional construction no longer allowed.storeparameter typed —Optional[Any]changed toOptional[ContextStore]ingenerate().to_bytes()dtype fix — Reads actual dtype from KV-cache header instead of hardcoding FLOAT32.- Enum decode safety — Unknown wire values for
DataType,PayloadType,CommunicationModenow raiseDecodeErrorwith actionable message instead of crashing or silently corrupting. - Endianness enforced —
embedding_to_bytes()forces little-endian on big-endian hosts. inject_and_generatedefault —max_new_tokensaligned to 512 across all connectors (was 256 on some).
OllamaConnector.get_model_identity()— Used wrong field names (hidden_sizeinstead ofhidden_dim, nonexistentvocab_size). Runtime crash.LlamaCppConnector.get_model_identity()— Same bug as Ollama.- Integrations stored
ThinkResultinstead ofAVPContext— LangChain, CrewAI, AutoGen now unwrap before storing.
AVPMetadatacompat properties —.embedding_dim,.data_type,.agent_id,.task_idremoved (never on PyPI, zero external users).- Dead code —
_get_local_identity()and cache (~37 lines), tuple guards in integrations.
- llama.cpp connector –
LlamaCppConnector: full think/generate latent pipeline on GGUF models via llama.cpp's embeddings API. Stop token fix, Jinja2 chat templates, memory leak fix (weakref.finalize), GGUF weight caching. GPU validated on A100.pip install avp[llamacpp]. - Ollama connector –
OllamaConnector.from_ollama("qwen2.5:7b"): resolves Ollama model names to GGUF blobs on disk, auto-unloads from Ollama server to free VRAM. Inherits full latent pipeline from LlamaCppConnector.pip install avp[ollama]. - vLLM latent communication – Full integration: same-model KV transfer, cross-model rosetta, 4 model architectures (Qwen2, Llama, Mistral, Gemma), CUDA graph support, prefix caching, explicit store key API. GPU validated on A100.
pip install avp[vllm]. - Framework integrations – LangChain (
ChatAVPBaseChatModel), CrewAI (AVPLLMBaseLLM), AutoGen (AVPChatCompletionClient). All support same-model latent + cross-model rosetta. [huggingface]extra – Discoverable alias for[hf].[all]extra – Convenience bundle: hf + llamacpp + frameworks + transport (excludes vLLM).
- torch is now optional – Projection math (rosetta/project.py, realign.py) rewritten in numpy.
pip install avp[ollama]drops from ~3 GB to ~85 MB. torch only required for HuggingFace connector ([hf]) and vLLM plugin. - Base install is lightweight –
pip install avpinstalls only numpy, protobuf, zstandard (~25 MB). Engine-specific deps via extras. - Protocol version bumped to 0.4.0.
- Python requirement raised from >=3.9 to >=3.10 (3.9 EOL October 2025).
- transformers requirement raised from >=4.36 to >=5.0 (4.x line is dead).
- huggingface-hub requirement raised from >=0.20 to >=1.0.
- README rewritten – 361 to 139 lines. Single install command, no redundant sections, SVG diagram, handshake negotiation prominent.
- Framework Integration Guide rewritten – Per-engine code examples for all 4 engines + 4 frameworks.
calibrate()simplified – Signature reduced to(source_model, target_model, source_tokenizer, target_tokenizer, device, auto_save). RaisesValueErrorfor incompatible models instead of falling through to broken ridge regression.
pack()/unpack()/PackedMessage– Deprecated in v0.3.0, now removed. Usethink()/generate().PackMetrics/UnpackMetrics– Deprecated aliases, now removed. UseThinkMetrics/GenerateMetrics.HandshakeMetrics– Exported but never instantiated. Removed.AVPAsyncClient– Exported but never used. Removed.encode_hidden_state()– Unused convenience wrapper. Removed.- Ridge regression / Procrustes projection –
_ridge_regression(),_orthogonal_procrustes(),_extract_hidden_states(),DEFAULT_ANCHORS. Failed at 0.004 cosine similarity, never benchmarked on real tasks.RIDGEandPROCRUSTESremoved fromProjectionMethodenum. page_convert.py– PagedAttention conversion module, never imported.make_eval_callback()and ctypes tensor helpers in_llamacpp_compat.py– Old cb_eval approach replaced by embeddings API.
- llama.cpp stop token – Model emits
<|endoftext|>(token 151643) not<|im_end|>on embeddings context. Fix: token-level + text-based stop detection. GSM8K: 0% to 68%. - vLLM CUDA graph compatibility – Pass dummy
input_idsduring latent steps for graph capture. - vLLM projection performance – Cache numpy weights in setup instead of copying 600 MB+ GPU to CPU per latent iteration.
- Error messages – All "pip install avp should include this dependency" updated to "pip install avp[hf]".
- Version consistency – All Modal benchmarks updated from
@engine_integrationto@main, transformers >=5.0, vLLM upper bound added.
- Colab quickstart notebook –
notebooks/avp_quick_start.ipynb. Runs on a free T4 GPU in ~8 minutes. Compares direct, latent (AVP), and text chain on 10 GSM8K problems. - Open in Colab badge in README.
- Cross-model projection is now opt-in –
source_model=(easy API) andsource=(connector API) now requirecross_model=Truefor latent transfer. Without it, falls back to text-only generation with aUserWarningexplaining how to opt in. Rosetta Stone projection is experimental – accuracy varies by task type (structured tasks work well, comprehension may degrade). Same-model latent transfer is unaffected. think()andgenerate()can now use different prompts – e.g., researcher prompt forthink(), solver prompt forgenerate(). Previously returned empty output due to theprompt_lenbug below.
- Critical:
prompt_lenbug inconnector.generate()–prompt_lenwas computed after extending the attention mask with KV-cache entries, causing generated tokens to be sliced incorrectly. This madegenerate()withcontext=return empty or truncated output, especially with different prompts forthink()andgenerate(). - Easy API cross-model path dropped user-provided
context=– always ran a freshthink()instead of using the caller's context. - Easy API cross-model path ignored
store/store_key/prior_key– ContextStore was not consulted or updated in the cross-model code path.
- Protobuf compatibility – Removed gencode version check from
avp_pb2.pythat required protobuf >=6.31.1 at runtime. Now works with protobuf >=4.21 as declared in dependencies. Fixes installation on Google Colab and other environments with protobuf 4.x/5.x.
think()/generate()API – New primary API replacingpack()/unpack(). Zero-friction entry point:avp.generate("Solve: 2+2", model="Qwen/Qwen2.5-7B-Instruct").- Cross-model
source=parameter –connector.generate(prompt, context=ctx, source=other)automatically calibrates and projects across models. Zero ceremony. - Easy API cross-model –
avp.generate(prompt, model=target, source_model=source)handles everything: model loading, handshake, projection. ContextStore– Thread-safe, TTL-backed store forAVPContextobjects. Enables multi-turn latent conversations.avp.inspect(data)– Decode AVP binary header/metadata without loading models. Returns dict with version, flags, model_id, dimensions, etc.- Debug mode –
debug=Trueonthink()/generate()surfacesTransferDiagnostics: norm trajectory, projection metrics, quality gate result, text baseline comparison. - Always-on warnings –
RuntimeWarningfor empty output, NaN/Inf in hidden states. NaN early exit in latent loop. - Vocabulary-overlap projection – Cross-family zero-parameter projection through shared BPE tokens (~85% overlap for Qwen/Llama). Strict generalization of vocab-mediated projection.
- Per-transfer quality gate –
assess_transfer(prompt_tokens)recommends latent vs JSON based on prompt length. Advisory only. Default threshold: 300 tokens. - Projection validation – Two-tier gate: cosine similarity (fast, ~1ms) + pseudo-perplexity (~30ms).
validate_projection()for model-pair diagnostics. resolution_pathonSessionInfo– Exposes which handshake rule matched:hash_match,structural_match,shared_tokenizer,avp_map_file,vocab_overlap,json_fallback.tokenizer_hashonModelIdentity– SHA-256 of sorted tokenizer vocabulary. Enables automatic cross-model projection via shared tokenizer detection.- vLLM connector (experimental) –
VLLMConnector(SDK wrapper) +AVPKVConnectorV1Dynamic(KVConnectorBase_V1 plugin). Text generation and identity extraction work. KV-cache transfer plugin has not been validated end-to-end with a real vLLM engine – known issues with PagedAttention format conversion, CUDA graph compatibility, and concurrent request isolation. UseHuggingFaceConnectorfor production latent transfer. GenerateMetrics– Observability forgenerate(): think + generate durations, context/store flags, debug diagnostics.HandshakeMetrics– Resolution path, mode, avp_map_id, duration.- 7 benchmark suites – GSM8K (4-agent, 2-agent), HotpotQA, fan-out, MATH 2-agent, HumanEval, DebugBench. Cloud results on all.
- API rename:
pack()→think(),unpack()→generate(). Old names still work with deprecation warnings. PackMetricsis now an alias forThinkMetrics.- Protocol version bumped to 0.3.0.
CommunicationModeenum –LATENT = 0,JSON = 1. Simplified from three values.- Flag bits renumbered –
FLAG_COMPRESSED = 0x01,FLAG_HAS_MAP = 0x02,FLAG_KV_CACHE = 0x04. - Handshake resolution now checks vocabulary overlap (>= 100 shared tokens) before falling back to JSON.
- Quality gate threshold lowered from 512 to 300 tokens based on cross-benchmark validation.
- Package extras – torch and transformers are now required deps.
pip install avpjust works.[vllm]extra for production serving. Removed[latent],[hf],[demo],[all].
- Hybrid mode – Wire format bundling latent + text fallback. Never consumed by any pipeline.
encode_hybrid(),FLAG_HYBRID,CommunicationMode.HYBRID,HybridPayload,HybridChunk,ChunkTypeall removed. - Universal representation mode – Learned cross-model adapters via
inputs_embeds. Validated negative (0% same-model accuracy).src/avp/universal/deleted. FallbackRequest– Dataclass for requesting JSON fallback. Never used by any pipeline.FallbackRequested– Exception for fallback signaling. Never raised.bytes_to_embedding()– Utility function, never called.confidence_score– Metadata field, never set to non-zero. Removed from wire format.- v0.1.0 proto backward-compat fields –
embedding_dim(100),data_type(101),agent_id(102),task_id(103) removed from protobuf schema.
- Tied-weight models – Softmax projection (
project_to_embedding_space()) instead of simple normalize. Fixes cosine similarity from ~0.24 to ~1.0. - Vocab size mismatch –
vocabulary_mediated_projection()truncates to shared prefix when embedding tables differ (e.g. Qwen 7B vs 1.5B: 152,064 vs 151,936). - Pseudo-perplexity alignment – Compare
projected[i]totarget_embed[token_ids[i+1]]for next-token prediction. Castinputs_embedsto model dtype. - KV-cache serialization – Fix bfloat16 support and transformers 5.x compatibility.
- Cross-platform – Windows console encoding, MPS device detection, pre-Ampere GPU support.
- Vocabulary-overlap projection for cross-family models (Qwen/Llama).
- Auto-discovery of vocabulary overlap in handshake.
- Configurable projection temperature with empirical default.
- Cross-model benchmark mode and cross-process demo.
- Rosetta mode for HotpotQA and fan-out benchmarks.
- Vocab size mismatch in vocabulary-mediated projection.
- New-user experience issues found during docs audit.
generate()API: combined think + store + generate in one call.ContextStorefor multi-turn latent context management.- Observability metrics (
ThinkMetrics,GenerateMetrics,HandshakeMetrics). - 3 new benchmarks (MATH-500, HumanEval, fan-out) with cloud runner.
- Confidence intervals and per-sample agreement analysis.
- MATH-500 answer normalization: strip LaTeX sizing commands.
- Zero-friction
pack()/unpack()easy API. - Model validation, cleaner exports, example improvements.
- Cold-start developer experience: model size warnings, deprecation notices.
- Rosetta Stone v2: vocabulary-mediated cross-model projection (zero learned parameters).
- Cross-model handshake with
tokenizer_hashandavp_map_id. - Projection quality validation (cosine similarity + pseudo-perplexity).
- Mixed-model demo with automatic LATENT/JSON handshake.
- KV-cache truncation modes (sequential, latent_only).
- GSM8K 4-agent benchmark harness.
- End-to-end integration tests.
- HTTP/2 transport (sync + async) with FastAPI server.
- Session management with TTL.
- zstd compression.
- Hidden state normalization before injection (LatentMAS port).
- KV-cache serialization for bfloat16.
- Initial release: same-model latent communication.
- Binary codec: encode/decode hidden states, KV-cache, embeddings.
- Protobuf metadata with 12-byte header.
- Realignment for untied-weight models.
- HuggingFace Transformers connector.