Skip to content

TypeError: expected str, bytes or os.PathLike object, not NoneType when initializing FoundationPredictor - Missing tokenizer files in layout #472

@mahdibpPersonal

Description

@mahdibpPersonal

🧨 Describe the Bug

When attempting to use the layout predictor, the FoundationPredictor initialization fails with a TypeError because tokenizer vocabulary files are missing from the downloaded layout model checkpoint.

📤 Output Trace / Stack Trace

TypeError: expected str, bytes or os.PathLike object, not NoneType
File "surya/foundation/init.py", line 96, in init
super().init(checkpoint, device, dtype)
File "surya/common/predictor.py", line 29, in init
self.processor = loader.processor()
File "surya/foundation/loader.py", line 64, in processor
ocr_tokenizer = SuryaOCRTokenizer(
special_tokens=config.special_ocr_tokens, model_checkpoint=self.checkpoint
)
File "surya/common/surya/processor/tokenizer.py", line 262, in init
self.qwen_tokenizer = Qwen2Tokenizer.from_pretrained(model_checkpoint)
File "transformers/tokenization_utils_base.py", line 2014, in from_pretrained
return cls._from_pretrained(...)
File "transformers/models/qwen2/tokenization_qwen2.py", line 172, in init
with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

⚙️ Environment

  • Surya version: 0.15.4
  • Python version: 3.13
  • OS: Linux

✅ Expected Behavior

The layout model checkpoint should either:
Include all necessary tokenizer files in the S3 download, OR
Reference a separate tokenizer checkpoint that gets downloaded automatically, OR
The documentation should specify how to manually provide tokenizer files

📟 Command or Code Used

from surya.layout import LayoutPredictor
from surya.foundation import FoundationPredictor
from surya.settings import settings

# This fails during initialization
foundation_predictor = FoundationPredictor(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
layout_predictor = LayoutPredictor(foundation_predictor)

## 📎 Additional Context

The layout model checkpoint (s3://layout/2025_02_18) downloads to ~/.cache/datalab/models/layout/2025_02_18/ but only includes:
.gitattributes
README.md
config.json
manifest.json
model.safetensors
preprocessor_config.json
Missing files required by Qwen2Tokenizer:
vocab.json or vocab.txt
tokenizer.json
tokenizer_config.json
merges.txt (if applicable)
When Qwen2Tokenizer.from_pretrained() tries to load the tokenizer from the layout model checkpoint path, it cannot find the vocabulary file and vocab_file becomes None, causing the crash.



Other model checkpoints (like ocr_error_detection/2025_02_18) do include tokenizer files, so this appears to be specific to the layout model.

---

**GitHub Repository:** https://github.com/VikParuchuri/surya

This issue clearly describes the problem, provides reproduction steps, shows the exact error, explains the root cause, and suggests potential solutions for the maintainers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug: breakingCrashes, errors, anything that stops execution or is runtime-breaking

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions