-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
🧨 Describe the Bug
When attempting to use the layout predictor, the FoundationPredictor initialization fails with a TypeError because tokenizer vocabulary files are missing from the downloaded layout model checkpoint.
📤 Output Trace / Stack Trace
TypeError: expected str, bytes or os.PathLike object, not NoneType
File "surya/foundation/init.py", line 96, in init
super().init(checkpoint, device, dtype)
File "surya/common/predictor.py", line 29, in init
self.processor = loader.processor()
File "surya/foundation/loader.py", line 64, in processor
ocr_tokenizer = SuryaOCRTokenizer(
special_tokens=config.special_ocr_tokens, model_checkpoint=self.checkpoint
)
File "surya/common/surya/processor/tokenizer.py", line 262, in init
self.qwen_tokenizer = Qwen2Tokenizer.from_pretrained(model_checkpoint)
File "transformers/tokenization_utils_base.py", line 2014, in from_pretrained
return cls._from_pretrained(...)
File "transformers/models/qwen2/tokenization_qwen2.py", line 172, in init
with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType
⚙️ Environment
- Surya version: 0.15.4
- Python version: 3.13
- OS: Linux
✅ Expected Behavior
The layout model checkpoint should either:
Include all necessary tokenizer files in the S3 download, OR
Reference a separate tokenizer checkpoint that gets downloaded automatically, OR
The documentation should specify how to manually provide tokenizer files
📟 Command or Code Used
from surya.layout import LayoutPredictor
from surya.foundation import FoundationPredictor
from surya.settings import settings
# This fails during initialization
foundation_predictor = FoundationPredictor(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
layout_predictor = LayoutPredictor(foundation_predictor)
## 📎 Additional Context
The layout model checkpoint (s3://layout/2025_02_18) downloads to ~/.cache/datalab/models/layout/2025_02_18/ but only includes:
.gitattributes
README.md
config.json
manifest.json
model.safetensors
preprocessor_config.json
Missing files required by Qwen2Tokenizer:
vocab.json or vocab.txt
tokenizer.json
tokenizer_config.json
merges.txt (if applicable)
When Qwen2Tokenizer.from_pretrained() tries to load the tokenizer from the layout model checkpoint path, it cannot find the vocabulary file and vocab_file becomes None, causing the crash.
Other model checkpoints (like ocr_error_detection/2025_02_18) do include tokenizer files, so this appears to be specific to the layout model.
---
**GitHub Repository:** https://github.com/VikParuchuri/surya
This issue clearly describes the problem, provides reproduction steps, shows the exact error, explains the root cause, and suggests potential solutions for the maintainers.