Releases: mideind/Tokenizer
Version 3.6.0
Changes
- Removed the deprecated
--handle_kludgy_ordinalsCLI flag - Removed the
handle_kludgy_ordinalsAPI option and related constants (KLUDGY_ORDINALS_PASS_THROUGH,KLUDGY_ORDINALS_MODIFY,KLUDGY_ORDINALS_TRANSLATE) - Kludgy ordinals (e.g.
1sti,3ja) are now always passed through unchanged as word tokens
CI Improvements
- Fixed PyPy 3.11 build by only installing dev dependencies (including mypy) on Python 3.9
- Added
[test]optional dependency for lightweight test-only installations
Version 3.5.5
Version 3.5.5: Maintenance Release
This maintenance release focuses on infrastructure improvements and enhanced package metadata, making Tokenizer more discoverable and easier to work with.
What's Changed
Package Metadata & Infrastructure:
- Modernized
pyproject.tomlwith PEP 517/518 build system configuration - Enhanced PyPI discoverability with comprehensive keywords (
nlp,natural-language-processing,sentence-segmentation, etc.) - Added project URLs for documentation, issues, and changelog
- Improved package description for better clarity
Development & Tooling:
- Enhanced development dependencies with version pins (
pytest>=7.0,ruff>=0.1.0,mypy>=1.0) - Consolidated tooling configuration (ruff, mypy, pytest) in
pyproject.toml - Type checking improvements across source and test files
- Removed obsolete
type: ignorecomments
CI/CD:
- Updated build matrix for setuptools compatibility with PyPy
- Now testing: CPython 3.9-3.14, PyPy 3.11
Installation
pip install tokenizer==3.5.5Full Changelog: 3.5.4...3.5.5
Version 3.5.4: Improved dash and hyphen handling
Summary
Version 3.5.4 improves dash and hyphen handling in the tokenizer, addressing GitHub issue #51 and several related edge cases.
Key Improvements
Dash Handling
- Free-standing hyphens: Hyphens with spaces around them (e.g.,
word - word) now preserve those spaces instead of being collapsed - Year ranges: Hyphens between years normalize to en-dash when
normalize=True, following Icelandic spelling rules (1914-1918→1914–1918) - Em-dashes: Always treated as centered punctuation with spaces on both sides (
word—word→word — word) - Consecutive dashes: Multiple identical dashes (
--,––,——) are handled as single tokens and preserve their spacing - Edge case fix: Correctly handles
1914 -1918where-1918might appear to be a negative number but is actually part of a year range
Documentation Updates
- Added comprehensive "Dash and Hyphen Handling" section explaining context-specific behavior
- Updated normalize option descriptions to be specific about dash handling
- Fixed CSV format documentation (was showing outdated 3-column format, now shows current 5-column format)
- Added CSV Output Format section matching JSON documentation style
Testing
- Added comprehensive test suite (
test/test_dashes.py) with 93 tests covering all dash scenarios - All 150 tests pass (93 new + 57 existing)
Installation
pip install --upgrade tokenizerView on PyPI: https://pypi.org/project/tokenizer/3.5.4/
Version 3.5.3
Changes in Version 3.5.3
This is a documentation fix release to ensure the project description displays correctly on PyPI.
Fix
- Updated
pyproject.tomlto correctly referenceREADME.mdinstead ofREADME.rst - This ensures the project description is properly displayed on PyPI
Installation
pip install --upgrade tokenizerView on PyPI: https://pypi.org/project/tokenizer/3.5.3/
Version 3.5.2
Changes in Version 3.5.2
Improvements
- BIN_Tuple representation: Now displays as plain tuple format in
__str__()and__repr__()methods, matching documentation examples - Cleaner JSON output: No longer includes empty
"t"field for non-text tokens (BEGIN SENT, END SENT) - Documentation updates: Fixed JSON output examples to reflect actual output including
"o"(original) and"s"(span) fields
Bug Fixes
- Fixed test cases to match new JSON output format
- Updated README.md to accurately document JSON output fields
Installation
pip install --upgrade tokenizerView on PyPI: https://pypi.org/project/tokenizer/3.5.2/
Version 3.5.1
- Fixed bug in composite glyph handling
Full Changelog: 3.5.0...3.5.1
Version 3.5.0
- Minor refactoring and all-round modernization
- Better handling of colon-separated timestamps
- Improved URI handling (more supported schemes)
- New
--versionflag in command line interface - Much more extensive test coverage
- Explicit compatibility with CPython 3.14 and PyPy 3.11
- Moved from
blacktorufffor linting and formatting - Moved to
pyproject.tomlfor project configuration and metadata
Full Changelog: 3.4.5...3.5.0
Version 3.4.5
- Compatibility with Python 3.13
- Now requires Python 3.9 or later
Full Changelog: 3.4.4...3.4.5
Version 3.4.4
- Better handling of abbreviations
Full Changelog: 3.4.3...3.4.4
Version 3.4.3
- Various minor fixes.
- Now requires Python 3.8 or later.
Full Changelog: 3.4.2...3.4.3