Describe the issue as clearly as possible:
I found two likely bugs in outlines.models.llamacpp.LlamaCppTokenizer:
encode() attention mask marks EOS tokens as padding.
- Fallback vocab extraction (when
hf_tokenizer is unavailable) can truncate token pieces and collapse distinct tokens to one vocab key.
I’m sharing a self-contained repro script below (no branch changes required).
Steps/code to reproduce the bug:
# save as: repro_llamacpp_tokenizer_bugs.py
from outlines.models.llamacpp import LlamaCppTokenizer
import llama_cpp
def repro_attention_mask_bug():
class FakeHFTokenizer:
eos_token_id = 2
eos_token = "</s>"
def get_vocab(self):
return {"a": 1, "</s>": 2, "b": 3}
class FakeTokenizerWrapper:
def __init__(self):
self.hf_tokenizer = FakeHFTokenizer()
class FakeInnerTokenizer:
def tokenize(self, _prompt_bytes, add_bos=True, special=True):
return [1, 2, 3] # EOS in middle of real prompt tokens
def detokenize(self, token_ids):
return b""
class FakeModel:
def __init__(self):
self.tokenizer_ = FakeTokenizerWrapper()
def tokenizer(self):
return FakeInnerTokenizer()
tok = LlamaCppTokenizer(FakeModel())
token_ids, attention_mask = tok.encode("anything")
print("token_ids:", token_ids)
print("attention_mask:", attention_mask)
# Expected for non-padded single prompt: all attended
assert attention_mask == [1, 1, 1], attention_mask
def repro_fallback_vocab_collision_bug():
# Two pieces share first 32 bytes; differ after that
prefix = b"a" * 32
pieces = {
0: prefix + b"X",
1: prefix + b"Y",
}
def fake_llama_model_get_vocab(_model):
return object()
def fake_llama_token_to_piece(_vocab, token_id, buffer, size, _lstrip, _special):
piece = pieces[token_id]
for i, byte in enumerate(piece[:size]):
buffer[i] = byte
# Simulate piece length > fixed buffer
return len(piece)
llama_cpp.llama_model_get_vocab = fake_llama_model_get_vocab
llama_cpp.llama_token_to_piece = fake_llama_token_to_piece
class FakeInnerTokenizer:
def tokenize(self, _prompt_bytes, add_bos=True, special=True):
return [0]
def detokenize(self, token_ids):
return b""
class FakeModel:
model = object()
def tokenizer(self):
return FakeInnerTokenizer()
def token_eos(self):
return 1
def n_vocab(self):
return 2
tok = LlamaCppTokenizer(FakeModel())
print("vocab:", tok.vocabulary)
# Expected: 2 unique entries preserved
assert len(tok.vocabulary) == 2, tok.vocabulary
if __name__ == "__main__":
repro_attention_mask_bug()
repro_fallback_vocab_collision_bug()
#Run as python repro_llamacpp_tokenizer_bugs.py
Expected result:
- `repro_attention_mask_bug()` should keep all tokens attended for a non-padded single prompt:
expected `attention_mask == [1, 1, 1]`.
- `repro_fallback_vocab_collision_bug()` should preserve both distinct token pieces:
expected `len(vocabulary) == 2`.
Error message:
AssertionError: [1, 0, 1]
AssertionError: {'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa': 1}
Outlines/Python version information:
Details
python -c "from outlines import _version; print(_version.version)"
python -c "import sys; print('Python', sys.version)"
pip freeze
### Context for the issue:
These look like correctness issues in tokenizer behavior and can surface as subtle constrained-decoding errors.
I may be wrong on fix details, but the repros above seem to show real behavioral problems.
If this direction is valid, I can open a focused PR with fixes split by issue.
cc @robinpicard
Describe the issue as clearly as possible:
I found two likely bugs in
outlines.models.llamacpp.LlamaCppTokenizer:encode()attention mask marks EOS tokens as padding.hf_tokenizeris unavailable) can truncate token pieces and collapse distinct tokens to one vocab key.I’m sharing a self-contained repro script below (no branch changes required).
Steps/code to reproduce the bug:
Expected result:
Error message:
AssertionError: [1, 0, 1] AssertionError: {'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa': 1}Outlines/Python version information:
Details