Releases: rager306/b-pdf-parser
Releases · rager306/b-pdf-parser
v1.6.0 - Batch Processing Optimization
Batch Processing Optimization (v1.6.0)
Performance Improvements
- Dynamic Worker Scaling: Auto-detect CPU cores using
os.cpu_count(), capped at 16 workers - Batch Processing Optimization: New
chunk_sizeandinit_strategyparameters for throughput optimization - Input Validation: New
validate_batch_params()function for parameter validation
New APIs
get_optimal_workers(parser_name)- Calculate optimal worker count based on system resourcesget_worker_config(parser_name, max_workers, init_strategy)- Create WorkerConfig dataclassvalidate_batch_params(parser_name, max_workers, chunk_size, init_strategy)- Validate inputs
Benchmark Results (500 PDFs)
| Metric | Target | Actual | Status |
|---|---|---|---|
| Throughput | 500+ docs/sec | 511 docs/sec | ✅ |
| Worker overhead | <5% | 0.00% | ✅ |
| Validation rate | 100% | 100% | ✅ |
Changes
- Enhanced
batch_parse()withchunk_sizeandinit_strategyparameters - Enhanced
batch_parse_from_directory()with optimization parameters - Added
WorkerConfigandBatchResultdataclasses - Added 40 new tests in
tests/test_batch.py - Total test count: 112+ tests
Files Changed
pdfparser/batch.py- Batch processing module (MODIFIED)tests/test_batch.py- Batch processing tests (NEW)CHANGELOG.md- Version history (MODIFIED)README.md- Documentation (MODIFIED)
v1.5.0 - Complete Implementation
Indonesian Bank Statement PDF Parser v1.5.0
Performance Optimization
- Pre-compiled regex patterns at module level (~3% improvement)
- lru_cache for dynamic pattern generation
- frozenset for O(1) label membership testing
Features Added
- PDFParser class - Class-based interface with configurable parser
- Turnover verification - Verify PDF summary totals against calculated sums
- Extended metadata - valuta, transaction_period, unit_address, totals
- Batch processing - Process 1000+ files in parallel with ProcessPoolExecutor
- CSV format - Semicolon delimiter, standard number format
Benchmark Results (2000 PDFs, 10 workers)
| Parser | Speed | Success Rate |
|---|---|---|
| PyMuPDF | ~468 docs/sec | 100% |
| pypdf | ~15 docs/sec | 100% |
| pdf_oxide | ~22 docs/sec | 0%* |
| pdfplumber | ~9 docs/sec | 100% |
*pdf_oxide parses successfully but fails validation (structure mismatch)
Testing
- 72+ tests with property-based testing (hypothesis)
- All tests pass:
uv run pytest tests/
Updated
- README.md with complete API documentation
- All parser implementations optimized
- Comprehensive test suite
Installation
uv sync --python python3.9
uv run python -c "from pdfparser import parse_pdf; print(parse_pdf('statement.pdf'))"v1.3.0 - Complete Remaining Features
Release v1.3.0 - Complete Remaining Features
Added
-
pdf_oxide parser implementation (
pdfparser/pdfoxide_parser.py)parse_pdf_pdfoxide()- Main parser using Rust-based pdf_oxide library- Fourth parser option for users to choose from
- Rust-based PDF parsing for modern PDF handling
- Multiprocessing safe with no global state
- Compatible with Python 3.9
-
UV package management support
pyproject.toml- Project configuration for UVuv sync --python python3.9- Fast dependency installation- Dev dependencies: pytest, hypothesis, ruff, pyrefly
- Reproducible environments with lock file support
-
Test suite framework (
tests/)- pytest and hypothesis for property-based testing
tests/__init__.py- Test module with shared fixturestests/test_parsers.py- Parser integration tests (44 tests)tests/test_utils.py- Utility function tests with hypothesis- Tests cover all 4 parsers with parametrized test cases
-
Benchmark tool (
benchmark.py)- CLI interface with argparse for --parsers, --test-dir, --max-files, --max-workers
- ProcessPoolExecutor for parallel parsing
- Metrics collection: time per file, time per page, throughput
- Success rate calculation using is_valid_parse()
- Output to benchmark_results.csv
- Tabulate table display for results
-
Batch processing module (
pdfparser/batch.py)batch_parse()- Parallel processing of multiple PDF filesbatch_parse_from_directory()- Process all PDFs in a directory- ProcessPoolExecutor for parallel file processing
- Per-file CSV saving to metadata/ and transactions/ directories
- Error handling with failure information in results
-
Test data generator (
generate_test_pdfs.py)- CLI interface with argparse for --num, --output-dir, --min-pages, --max-pages, --min-transactions, --max-transactions
- Random realistic data (account numbers, names, amounts)
- reportlab-based PDF generation with bank statement format
- Configurable page count (1-10) and transactions per page (100-500)
Test Results
| Parser | Example_statement.pdf | REKENING_KORAN...pdf | JAN-2024.pdf |
|---|---|---|---|
| PyMuPDF | 47 txns, valid=True | 14 txns, valid=True | 15 txns, valid=True |
| pdfplumber | 47 txns, valid=True | 14 txns, valid=True | 15 txns, valid=True |
| pypdf | 47 txns, valid=True | 14 txns, valid=True | 15 txns, valid=True |
| pdf_oxide | 47 txns, valid=True | 14 txns, valid=True | 15 txns, valid=True |
All 44 tests pass with pytest and hypothesis property-based testing.
Release v1.1.1
Release v1.1.1 - 2024-12-28
Fixed
- pdfplumber metadata extraction - Added fallback to English patterns when Indonesian patterns yield fewer than 2 fields
- Ensures
is_valid_parse()returns True for English-labelled PDFs - Metadata coverage now matches PyMuPDF parser (4/4 fields)
Test Results
All sample PDFs now parse successfully with both parsers:
| Parser | Example_statement.pdf | REKENING_KORAN.pdf | JAN-2024.pdf |
|---|---|---|---|
| PyMuPDF | 47 txns, valid=True | 14 txns, valid=True | 15 txns, valid=True |
| pdfplumber | 47 txns, valid=True | 14 txns, valid=True | 15 txns, valid=True |
See CHANGELOG.md for full history.