Releases · mideind/Tokenizer

11 Dec 17:28

vthorsteinsson

v3.6.0

7bffb58

Version 3.6.0 Latest

Latest

Changes

Removed the deprecated --handle_kludgy_ordinals CLI flag
Removed the handle_kludgy_ordinals API option and related constants (KLUDGY_ORDINALS_PASS_THROUGH, KLUDGY_ORDINALS_MODIFY, KLUDGY_ORDINALS_TRANSLATE)
Kludgy ordinals (e.g. 1sti, 3ja) are now always passed through unchanged as word tokens

CI Improvements

Fixed PyPy 3.11 build by only installing dev dependencies (including mypy) on Python 3.9
Added [test] optional dependency for lightweight test-only installations

Assets 2

13 Nov 10:35

vthorsteinsson

3.5.5

28ce86e

Version 3.5.5

Version 3.5.5: Maintenance Release

This maintenance release focuses on infrastructure improvements and enhanced package metadata, making Tokenizer more discoverable and easier to work with.

What's Changed

Package Metadata & Infrastructure:

Modernized pyproject.toml with PEP 517/518 build system configuration
Enhanced PyPI discoverability with comprehensive keywords (nlp, natural-language-processing, sentence-segmentation, etc.)
Added project URLs for documentation, issues, and changelog
Improved package description for better clarity

Development & Tooling:

Enhanced development dependencies with version pins (pytest>=7.0, ruff>=0.1.0, mypy>=1.0)
Consolidated tooling configuration (ruff, mypy, pytest) in pyproject.toml
Type checking improvements across source and test files
Removed obsolete type: ignore comments

CI/CD:

Updated build matrix for setuptools compatibility with PyPy
Now testing: CPython 3.9-3.14, PyPy 3.11

Installation

pip install tokenizer==3.5.5

Full Changelog: 3.5.4...3.5.5

Assets 2

12 Nov 18:13

vthorsteinsson

3.5.4

5fcb625

Version 3.5.4: Improved dash and hyphen handling

Summary

Version 3.5.4 improves dash and hyphen handling in the tokenizer, addressing GitHub issue #51 and several related edge cases.

Key Improvements

Dash Handling

Free-standing hyphens: Hyphens with spaces around them (e.g., word - word) now preserve those spaces instead of being collapsed
Year ranges: Hyphens between years normalize to en-dash when normalize=True, following Icelandic spelling rules (1914-1918 → 1914–1918)
Em-dashes: Always treated as centered punctuation with spaces on both sides (word—word → word — word)
Consecutive dashes: Multiple identical dashes (--, ––, ——) are handled as single tokens and preserve their spacing
Edge case fix: Correctly handles 1914 -1918 where -1918 might appear to be a negative number but is actually part of a year range

Documentation Updates

Added comprehensive "Dash and Hyphen Handling" section explaining context-specific behavior
Updated normalize option descriptions to be specific about dash handling
Fixed CSV format documentation (was showing outdated 3-column format, now shows current 5-column format)
Added CSV Output Format section matching JSON documentation style

Testing

Added comprehensive test suite (test/test_dashes.py) with 93 tests covering all dash scenarios
All 150 tests pass (93 new + 57 existing)

Installation

pip install --upgrade tokenizer

View on PyPI: https://pypi.org/project/tokenizer/3.5.4/

Assets 2

26 Sep 15:33

vthorsteinsson

v3.5.3

0fc2853

Version 3.5.3

Changes in Version 3.5.3

This is a documentation fix release to ensure the project description displays correctly on PyPI.

Fix

Updated pyproject.toml to correctly reference README.md instead of README.rst
This ensures the project description is properly displayed on PyPI

Installation

pip install --upgrade tokenizer

View on PyPI: https://pypi.org/project/tokenizer/3.5.3/

Assets 2

26 Sep 14:41

vthorsteinsson

v3.5.2

407b609

Version 3.5.2

Changes in Version 3.5.2

Improvements

BIN_Tuple representation: Now displays as plain tuple format in __str__() and __repr__() methods, matching documentation examples
Cleaner JSON output: No longer includes empty "t" field for non-text tokens (BEGIN SENT, END SENT)
Documentation updates: Fixed JSON output examples to reflect actual output including "o" (original) and "s" (span) fields

Bug Fixes

Fixed test cases to match new JSON output format
Updated README.md to accurately document JSON output fields

Installation

pip install --upgrade tokenizer

View on PyPI: https://pypi.org/project/tokenizer/3.5.2/

Assets 2

28 Aug 12:51

sveinbjornt

3.5.1

4912e97

Version 3.5.1

Fixed bug in composite glyph handling

Full Changelog: 3.5.0...3.5.1

Assets 2

27 Aug 11:05

sveinbjornt

3.5.0

497d44d

Version 3.5.0

Minor refactoring and all-round modernization
Better handling of colon-separated timestamps
Improved URI handling (more supported schemes)
New --version flag in command line interface
Much more extensive test coverage
Explicit compatibility with CPython 3.14 and PyPy 3.11
Moved from black to ruff for linting and formatting
Moved to pyproject.toml for project configuration and metadata

Full Changelog: 3.4.5...3.5.0

Assets 2

23 Aug 15:58

sveinbjornt

3.4.5

340ecb7

Version 3.4.5

Compatibility with Python 3.13
Now requires Python 3.9 or later

Full Changelog: 3.4.4...3.4.5

Assets 2

07 Aug 14:05

sveinbjornt

3.4.4

8750e9c

Version 3.4.4

Better handling of abbreviations

Full Changelog: 3.4.3...3.4.4

Assets 2

11 Aug 16:21

sveinbjornt

3.4.3

23db0b2

Version 3.4.3

Various minor fixes.
Now requires Python 3.8 or later.

Full Changelog: 3.4.2...3.4.3

Assets 2

Releases: mideind/Tokenizer

Version 3.6.0

Changes

CI Improvements

Uh oh!

Version 3.5.5

Version 3.5.5: Maintenance Release

What's Changed

Installation

Uh oh!

Version 3.5.4: Improved dash and hyphen handling

Summary

Key Improvements

Dash Handling

Documentation Updates

Testing

Installation

Uh oh!

Version 3.5.3

Changes in Version 3.5.3

Fix

Installation

Uh oh!

Version 3.5.2

Changes in Version 3.5.2

Improvements

Bug Fixes

Installation

Uh oh!

Version 3.5.1

Uh oh!

Version 3.5.0

Uh oh!

Version 3.4.5

Uh oh!

Version 3.4.4

Uh oh!

Version 3.4.3

Uh oh!