Skip to content

Releases: mideind/Tokenizer

Version 3.6.0

11 Dec 17:28

Choose a tag to compare

Changes

  • Removed the deprecated --handle_kludgy_ordinals CLI flag
  • Removed the handle_kludgy_ordinals API option and related constants (KLUDGY_ORDINALS_PASS_THROUGH, KLUDGY_ORDINALS_MODIFY, KLUDGY_ORDINALS_TRANSLATE)
  • Kludgy ordinals (e.g. 1sti, 3ja) are now always passed through unchanged as word tokens

CI Improvements

  • Fixed PyPy 3.11 build by only installing dev dependencies (including mypy) on Python 3.9
  • Added [test] optional dependency for lightweight test-only installations

Version 3.5.5

13 Nov 10:35

Choose a tag to compare

Version 3.5.5: Maintenance Release

This maintenance release focuses on infrastructure improvements and enhanced package metadata, making Tokenizer more discoverable and easier to work with.

What's Changed

Package Metadata & Infrastructure:

  • Modernized pyproject.toml with PEP 517/518 build system configuration
  • Enhanced PyPI discoverability with comprehensive keywords (nlp, natural-language-processing, sentence-segmentation, etc.)
  • Added project URLs for documentation, issues, and changelog
  • Improved package description for better clarity

Development & Tooling:

  • Enhanced development dependencies with version pins (pytest>=7.0, ruff>=0.1.0, mypy>=1.0)
  • Consolidated tooling configuration (ruff, mypy, pytest) in pyproject.toml
  • Type checking improvements across source and test files
  • Removed obsolete type: ignore comments

CI/CD:

  • Updated build matrix for setuptools compatibility with PyPy
  • Now testing: CPython 3.9-3.14, PyPy 3.11

Installation

pip install tokenizer==3.5.5

Full Changelog: 3.5.4...3.5.5

Version 3.5.4: Improved dash and hyphen handling

12 Nov 18:13

Choose a tag to compare

Summary

Version 3.5.4 improves dash and hyphen handling in the tokenizer, addressing GitHub issue #51 and several related edge cases.

Key Improvements

Dash Handling

  • Free-standing hyphens: Hyphens with spaces around them (e.g., word - word) now preserve those spaces instead of being collapsed
  • Year ranges: Hyphens between years normalize to en-dash when normalize=True, following Icelandic spelling rules (1914-19181914–1918)
  • Em-dashes: Always treated as centered punctuation with spaces on both sides (word—wordword — word)
  • Consecutive dashes: Multiple identical dashes (--, ––, ——) are handled as single tokens and preserve their spacing
  • Edge case fix: Correctly handles 1914 -1918 where -1918 might appear to be a negative number but is actually part of a year range

Documentation Updates

  • Added comprehensive "Dash and Hyphen Handling" section explaining context-specific behavior
  • Updated normalize option descriptions to be specific about dash handling
  • Fixed CSV format documentation (was showing outdated 3-column format, now shows current 5-column format)
  • Added CSV Output Format section matching JSON documentation style

Testing

  • Added comprehensive test suite (test/test_dashes.py) with 93 tests covering all dash scenarios
  • All 150 tests pass (93 new + 57 existing)

Installation

pip install --upgrade tokenizer

View on PyPI: https://pypi.org/project/tokenizer/3.5.4/

Version 3.5.3

26 Sep 15:33

Choose a tag to compare

Changes in Version 3.5.3

This is a documentation fix release to ensure the project description displays correctly on PyPI.

Fix

  • Updated pyproject.toml to correctly reference README.md instead of README.rst
  • This ensures the project description is properly displayed on PyPI

Installation

pip install --upgrade tokenizer

View on PyPI: https://pypi.org/project/tokenizer/3.5.3/

Version 3.5.2

26 Sep 14:41

Choose a tag to compare

Changes in Version 3.5.2

Improvements

  • BIN_Tuple representation: Now displays as plain tuple format in __str__() and __repr__() methods, matching documentation examples
  • Cleaner JSON output: No longer includes empty "t" field for non-text tokens (BEGIN SENT, END SENT)
  • Documentation updates: Fixed JSON output examples to reflect actual output including "o" (original) and "s" (span) fields

Bug Fixes

  • Fixed test cases to match new JSON output format
  • Updated README.md to accurately document JSON output fields

Installation

pip install --upgrade tokenizer

View on PyPI: https://pypi.org/project/tokenizer/3.5.2/

Version 3.5.1

28 Aug 12:51

Choose a tag to compare

  • Fixed bug in composite glyph handling

Full Changelog: 3.5.0...3.5.1

Version 3.5.0

27 Aug 11:05

Choose a tag to compare

  • Minor refactoring and all-round modernization
  • Better handling of colon-separated timestamps
  • Improved URI handling (more supported schemes)
  • New --version flag in command line interface
  • Much more extensive test coverage
  • Explicit compatibility with CPython 3.14 and PyPy 3.11
  • Moved from black to ruff for linting and formatting
  • Moved to pyproject.toml for project configuration and metadata

Full Changelog: 3.4.5...3.5.0

Version 3.4.5

23 Aug 15:58
340ecb7

Choose a tag to compare

  • Compatibility with Python 3.13
  • Now requires Python 3.9 or later

Full Changelog: 3.4.4...3.4.5

Version 3.4.4

07 Aug 14:05

Choose a tag to compare

  • Better handling of abbreviations

Full Changelog: 3.4.3...3.4.4

Version 3.4.3

11 Aug 16:21

Choose a tag to compare

  • Various minor fixes.
  • Now requires Python 3.8 or later.

Full Changelog: 3.4.2...3.4.3