LlamaCppTokenizer bugs: EOS token incorrectly masked as padding and fallback vocab truncation collisions

### Describe the issue as clearly as possible:

I found two likely bugs in `outlines.models.llamacpp.LlamaCppTokenizer`:

1. `encode()` attention mask marks EOS tokens as padding.
2. Fallback vocab extraction (when `hf_tokenizer` is unavailable) can truncate token pieces and collapse distinct tokens to one vocab key.

I’m sharing a self-contained repro script below (no branch changes required).


### Steps/code to reproduce the bug:

```python
# save as: repro_llamacpp_tokenizer_bugs.py
from outlines.models.llamacpp import LlamaCppTokenizer
import llama_cpp


def repro_attention_mask_bug():
    class FakeHFTokenizer:
        eos_token_id = 2
        eos_token = "</s>"
        def get_vocab(self):
            return {"a": 1, "</s>": 2, "b": 3}

    class FakeTokenizerWrapper:
        def __init__(self):
            self.hf_tokenizer = FakeHFTokenizer()

    class FakeInnerTokenizer:
        def tokenize(self, _prompt_bytes, add_bos=True, special=True):
            return [1, 2, 3]  # EOS in middle of real prompt tokens
        def detokenize(self, token_ids):
            return b""

    class FakeModel:
        def __init__(self):
            self.tokenizer_ = FakeTokenizerWrapper()
        def tokenizer(self):
            return FakeInnerTokenizer()

    tok = LlamaCppTokenizer(FakeModel())
    token_ids, attention_mask = tok.encode("anything")
    print("token_ids:", token_ids)
    print("attention_mask:", attention_mask)

    # Expected for non-padded single prompt: all attended
    assert attention_mask == [1, 1, 1], attention_mask


def repro_fallback_vocab_collision_bug():
    # Two pieces share first 32 bytes; differ after that
    prefix = b"a" * 32
    pieces = {
        0: prefix + b"X",
        1: prefix + b"Y",
    }

    def fake_llama_model_get_vocab(_model):
        return object()

    def fake_llama_token_to_piece(_vocab, token_id, buffer, size, _lstrip, _special):
        piece = pieces[token_id]
        for i, byte in enumerate(piece[:size]):
            buffer[i] = byte
        # Simulate piece length > fixed buffer
        return len(piece)

    llama_cpp.llama_model_get_vocab = fake_llama_model_get_vocab
    llama_cpp.llama_token_to_piece = fake_llama_token_to_piece

    class FakeInnerTokenizer:
        def tokenize(self, _prompt_bytes, add_bos=True, special=True):
            return [0]
        def detokenize(self, token_ids):
            return b""

    class FakeModel:
        model = object()
        def tokenizer(self):
            return FakeInnerTokenizer()
        def token_eos(self):
            return 1
        def n_vocab(self):
            return 2

    tok = LlamaCppTokenizer(FakeModel())
    print("vocab:", tok.vocabulary)

    # Expected: 2 unique entries preserved
    assert len(tok.vocabulary) == 2, tok.vocabulary


if __name__ == "__main__":
    repro_attention_mask_bug()
    repro_fallback_vocab_collision_bug()


#Run as python repro_llamacpp_tokenizer_bugs.py
```

### Expected result:

```shell
- `repro_attention_mask_bug()` should keep all tokens attended for a non-padded single prompt:
  expected `attention_mask == [1, 1, 1]`.
- `repro_fallback_vocab_collision_bug()` should preserve both distinct token pieces:
  expected `len(vocabulary) == 2`.
```

### Error message:

```shell
AssertionError: [1, 0, 1]
AssertionError: {'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa': 1}
```

### Outlines/Python version information:

<details>

```bash
python -c "from outlines import _version; print(_version.version)"
python -c "import sys; print('Python', sys.version)"
pip freeze


### Context for the issue:

These look like correctness issues in tokenizer behavior and can surface as subtle constrained-decoding errors.

I may be wrong on fix details, but the repros above seem to show real behavioral problems.  
If this direction is valid, I can open a focused PR with fixes split by issue.

cc @robinpicard


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlamaCppTokenizer bugs: EOS token incorrectly masked as padding and fallback vocab truncation collisions #1819

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LlamaCppTokenizer bugs: EOS token incorrectly masked as padding and fallback vocab truncation collisions #1819

Description

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions