30 Dec 07:46

rager306

924ecb9

v1.6.0 - Batch Processing Optimization Latest

Latest

Batch Processing Optimization (v1.6.0)

Performance Improvements

Dynamic Worker Scaling: Auto-detect CPU cores using os.cpu_count(), capped at 16 workers
Batch Processing Optimization: New chunk_size and init_strategy parameters for throughput optimization
Input Validation: New validate_batch_params() function for parameter validation

New APIs

get_optimal_workers(parser_name) - Calculate optimal worker count based on system resources
get_worker_config(parser_name, max_workers, init_strategy) - Create WorkerConfig dataclass
validate_batch_params(parser_name, max_workers, chunk_size, init_strategy) - Validate inputs

Benchmark Results (500 PDFs)

Metric	Target	Actual	Status
Throughput	500+ docs/sec	511 docs/sec	✅
Worker overhead	<5%	0.00%	✅
Validation rate	100%	100%	✅

Changes

Enhanced batch_parse() with chunk_size and init_strategy parameters
Enhanced batch_parse_from_directory() with optimization parameters
Added WorkerConfig and BatchResult dataclasses
Added 40 new tests in tests/test_batch.py
Total test count: 112+ tests

Files Changed

pdfparser/batch.py - Batch processing module (MODIFIED)
tests/test_batch.py - Batch processing tests (NEW)
CHANGELOG.md - Version history (MODIFIED)
README.md - Documentation (MODIFIED)

Assets 2

28 Dec 14:30

rager306

v1.5.0

36eb3c7

v1.5.0 - Complete Implementation

Indonesian Bank Statement PDF Parser v1.5.0

Performance Optimization

Pre-compiled regex patterns at module level (~3% improvement)
lru_cache for dynamic pattern generation
frozenset for O(1) label membership testing

Features Added

PDFParser class - Class-based interface with configurable parser
Turnover verification - Verify PDF summary totals against calculated sums
Extended metadata - valuta, transaction_period, unit_address, totals
Batch processing - Process 1000+ files in parallel with ProcessPoolExecutor
CSV format - Semicolon delimiter, standard number format

Benchmark Results (2000 PDFs, 10 workers)

Parser	Speed	Success Rate
PyMuPDF	~468 docs/sec	100%
pypdf	~15 docs/sec	100%
pdf_oxide	~22 docs/sec	0%*
pdfplumber	~9 docs/sec	100%

*pdf_oxide parses successfully but fails validation (structure mismatch)

Testing

72+ tests with property-based testing (hypothesis)
All tests pass: uv run pytest tests/

Updated

README.md with complete API documentation
All parser implementations optimized
Comprehensive test suite

Installation

uv sync --python python3.9
uv run python -c "from pdfparser import parse_pdf; print(parse_pdf('statement.pdf'))"

Assets 2

28 Dec 10:59

rager306

v1.3.0

36eb3c7

v1.3.0 - Complete Remaining Features

Release v1.3.0 - Complete Remaining Features

Added

pdf_oxide parser implementation (pdfparser/pdfoxide_parser.py)
- parse_pdf_pdfoxide() - Main parser using Rust-based pdf_oxide library
- Fourth parser option for users to choose from
- Rust-based PDF parsing for modern PDF handling
- Multiprocessing safe with no global state
- Compatible with Python 3.9
UV package management support
- pyproject.toml - Project configuration for UV
- uv sync --python python3.9 - Fast dependency installation
- Dev dependencies: pytest, hypothesis, ruff, pyrefly
- Reproducible environments with lock file support
Test suite framework (tests/)
- pytest and hypothesis for property-based testing
- tests/__init__.py - Test module with shared fixtures
- tests/test_parsers.py - Parser integration tests (44 tests)
- tests/test_utils.py - Utility function tests with hypothesis
- Tests cover all 4 parsers with parametrized test cases
Benchmark tool (benchmark.py)
- CLI interface with argparse for --parsers, --test-dir, --max-files, --max-workers
- ProcessPoolExecutor for parallel parsing
- Metrics collection: time per file, time per page, throughput
- Success rate calculation using is_valid_parse()
- Output to benchmark_results.csv
- Tabulate table display for results
Batch processing module (pdfparser/batch.py)
- batch_parse() - Parallel processing of multiple PDF files
- batch_parse_from_directory() - Process all PDFs in a directory
- ProcessPoolExecutor for parallel file processing
- Per-file CSV saving to metadata/ and transactions/ directories
- Error handling with failure information in results
Test data generator (generate_test_pdfs.py)
- CLI interface with argparse for --num, --output-dir, --min-pages, --max-pages, --min-transactions, --max-transactions
- Random realistic data (account numbers, names, amounts)
- reportlab-based PDF generation with bank statement format
- Configurable page count (1-10) and transactions per page (100-500)

Test Results

Parser	Example_statement.pdf	REKENING_KORAN...pdf	JAN-2024.pdf
PyMuPDF	47 txns, valid=True	14 txns, valid=True	15 txns, valid=True
pdfplumber	47 txns, valid=True	14 txns, valid=True	15 txns, valid=True
pypdf	47 txns, valid=True	14 txns, valid=True	15 txns, valid=True
pdf_oxide	47 txns, valid=True	14 txns, valid=True	15 txns, valid=True

All 44 tests pass with pytest and hypothesis property-based testing.

Assets 2

28 Dec 08:43

rager306

v1.1.1

39ff072

Release v1.1.1

Release v1.1.1 - 2024-12-28

Fixed

pdfplumber metadata extraction - Added fallback to English patterns when Indonesian patterns yield fewer than 2 fields
Ensures is_valid_parse() returns True for English-labelled PDFs
Metadata coverage now matches PyMuPDF parser (4/4 fields)

Test Results

All sample PDFs now parse successfully with both parsers:

Parser	Example_statement.pdf	REKENING_KORAN.pdf	JAN-2024.pdf
PyMuPDF	47 txns, valid=True	14 txns, valid=True	15 txns, valid=True
pdfplumber	47 txns, valid=True	14 txns, valid=True	15 txns, valid=True

See CHANGELOG.md for full history.

Assets 2

Releases: rager306/b-pdf-parser

v1.6.0 - Batch Processing Optimization

Batch Processing Optimization (v1.6.0)

Performance Improvements

New APIs

Benchmark Results (500 PDFs)

Changes

Files Changed

Uh oh!

v1.5.0 - Complete Implementation

Indonesian Bank Statement PDF Parser v1.5.0

Performance Optimization

Features Added

Benchmark Results (2000 PDFs, 10 workers)

Testing

Updated

Installation

Uh oh!

v1.3.0 - Complete Remaining Features

Release v1.3.0 - Complete Remaining Features

Added

Test Results

Uh oh!

Release v1.1.1

Release v1.1.1 - 2024-12-28

Fixed

Test Results

Uh oh!