TypeError: expected str, bytes or os.PathLike object, not NoneType when initializing FoundationPredictor - Missing tokenizer files in layout

## 🧨 Describe the Bug

When attempting to use the layout predictor, the `FoundationPredictor` initialization fails with a `TypeError` because tokenizer vocabulary files are missing from the downloaded layout model checkpoint.

## 📤 Output Trace / Stack Trace

TypeError: expected str, bytes or os.PathLike object, not NoneType
  File "surya/foundation/__init__.py", line 96, in __init__
    super().__init__(checkpoint, device, dtype)
  File "surya/common/predictor.py", line 29, in __init__
    self.processor = loader.processor()
  File "surya/foundation/loader.py", line 64, in processor
    ocr_tokenizer = SuryaOCRTokenizer(
        special_tokens=config.special_ocr_tokens, model_checkpoint=self.checkpoint
    )
  File "surya/common/surya/processor/tokenizer.py", line 262, in __init__
    self.qwen_tokenizer = Qwen2Tokenizer.from_pretrained(model_checkpoint)
  File "transformers/tokenization_utils_base.py", line 2014, in from_pretrained
    return cls._from_pretrained(...)
  File "transformers/models/qwen2/tokenization_qwen2.py", line 172, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

## ⚙️ Environment
- **Surya version:** 0.15.4
- **Python version:** 3.13
- **OS:** Linux

## ✅ Expected Behavior

The layout model checkpoint should either:
Include all necessary tokenizer files in the S3 download, OR
Reference a separate tokenizer checkpoint that gets downloaded automatically, OR
The documentation should specify how to manually provide tokenizer files

## 📟 Command or Code Used

```python
from surya.layout import LayoutPredictor
from surya.foundation import FoundationPredictor
from surya.settings import settings

# This fails during initialization
foundation_predictor = FoundationPredictor(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
layout_predictor = LayoutPredictor(foundation_predictor)

## 📎 Additional Context

The layout model checkpoint (s3://layout/2025_02_18) downloads to ~/.cache/datalab/models/layout/2025_02_18/ but only includes:
.gitattributes
README.md
config.json
manifest.json
model.safetensors
preprocessor_config.json
Missing files required by Qwen2Tokenizer:
vocab.json or vocab.txt
tokenizer.json
tokenizer_config.json
merges.txt (if applicable)
When Qwen2Tokenizer.from_pretrained() tries to load the tokenizer from the layout model checkpoint path, it cannot find the vocabulary file and vocab_file becomes None, causing the crash.



Other model checkpoints (like ocr_error_detection/2025_02_18) do include tokenizer files, so this appears to be specific to the layout model.

---

**GitHub Repository:** https://github.com/VikParuchuri/surya

This issue clearly describes the problem, provides reproduction steps, shows the exact error, explains the root cause, and suggests potential solutions for the maintainers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TypeError: expected str, bytes or os.PathLike object, not NoneType when initializing FoundationPredictor - Missing tokenizer files in layout #472

🧨 Describe the Bug

📤 Output Trace / Stack Trace

⚙️ Environment

✅ Expected Behavior

📟 Command or Code Used

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TypeError: expected str, bytes or os.PathLike object, not NoneType when initializing FoundationPredictor - Missing tokenizer files in layout #472

Description

🧨 Describe the Bug

📤 Output Trace / Stack Trace

⚙️ Environment

✅ Expected Behavior

📟 Command or Code Used

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions