Skip to content

Releases: rager306/b-pdf-parser

v1.6.0 - Batch Processing Optimization

30 Dec 07:46

Choose a tag to compare

Batch Processing Optimization (v1.6.0)

Performance Improvements

  • Dynamic Worker Scaling: Auto-detect CPU cores using os.cpu_count(), capped at 16 workers
  • Batch Processing Optimization: New chunk_size and init_strategy parameters for throughput optimization
  • Input Validation: New validate_batch_params() function for parameter validation

New APIs

  • get_optimal_workers(parser_name) - Calculate optimal worker count based on system resources
  • get_worker_config(parser_name, max_workers, init_strategy) - Create WorkerConfig dataclass
  • validate_batch_params(parser_name, max_workers, chunk_size, init_strategy) - Validate inputs

Benchmark Results (500 PDFs)

Metric Target Actual Status
Throughput 500+ docs/sec 511 docs/sec
Worker overhead <5% 0.00%
Validation rate 100% 100%

Changes

  • Enhanced batch_parse() with chunk_size and init_strategy parameters
  • Enhanced batch_parse_from_directory() with optimization parameters
  • Added WorkerConfig and BatchResult dataclasses
  • Added 40 new tests in tests/test_batch.py
  • Total test count: 112+ tests

Files Changed

  • pdfparser/batch.py - Batch processing module (MODIFIED)
  • tests/test_batch.py - Batch processing tests (NEW)
  • CHANGELOG.md - Version history (MODIFIED)
  • README.md - Documentation (MODIFIED)

v1.5.0 - Complete Implementation

28 Dec 14:30

Choose a tag to compare

Indonesian Bank Statement PDF Parser v1.5.0

Performance Optimization

  • Pre-compiled regex patterns at module level (~3% improvement)
  • lru_cache for dynamic pattern generation
  • frozenset for O(1) label membership testing

Features Added

  • PDFParser class - Class-based interface with configurable parser
  • Turnover verification - Verify PDF summary totals against calculated sums
  • Extended metadata - valuta, transaction_period, unit_address, totals
  • Batch processing - Process 1000+ files in parallel with ProcessPoolExecutor
  • CSV format - Semicolon delimiter, standard number format

Benchmark Results (2000 PDFs, 10 workers)

Parser Speed Success Rate
PyMuPDF ~468 docs/sec 100%
pypdf ~15 docs/sec 100%
pdf_oxide ~22 docs/sec 0%*
pdfplumber ~9 docs/sec 100%

*pdf_oxide parses successfully but fails validation (structure mismatch)

Testing

  • 72+ tests with property-based testing (hypothesis)
  • All tests pass: uv run pytest tests/

Updated

  • README.md with complete API documentation
  • All parser implementations optimized
  • Comprehensive test suite

Installation

uv sync --python python3.9
uv run python -c "from pdfparser import parse_pdf; print(parse_pdf('statement.pdf'))"

v1.3.0 - Complete Remaining Features

28 Dec 10:59

Choose a tag to compare

Release v1.3.0 - Complete Remaining Features

Added

  • pdf_oxide parser implementation (pdfparser/pdfoxide_parser.py)

    • parse_pdf_pdfoxide() - Main parser using Rust-based pdf_oxide library
    • Fourth parser option for users to choose from
    • Rust-based PDF parsing for modern PDF handling
    • Multiprocessing safe with no global state
    • Compatible with Python 3.9
  • UV package management support

    • pyproject.toml - Project configuration for UV
    • uv sync --python python3.9 - Fast dependency installation
    • Dev dependencies: pytest, hypothesis, ruff, pyrefly
    • Reproducible environments with lock file support
  • Test suite framework (tests/)

    • pytest and hypothesis for property-based testing
    • tests/__init__.py - Test module with shared fixtures
    • tests/test_parsers.py - Parser integration tests (44 tests)
    • tests/test_utils.py - Utility function tests with hypothesis
    • Tests cover all 4 parsers with parametrized test cases
  • Benchmark tool (benchmark.py)

    • CLI interface with argparse for --parsers, --test-dir, --max-files, --max-workers
    • ProcessPoolExecutor for parallel parsing
    • Metrics collection: time per file, time per page, throughput
    • Success rate calculation using is_valid_parse()
    • Output to benchmark_results.csv
    • Tabulate table display for results
  • Batch processing module (pdfparser/batch.py)

    • batch_parse() - Parallel processing of multiple PDF files
    • batch_parse_from_directory() - Process all PDFs in a directory
    • ProcessPoolExecutor for parallel file processing
    • Per-file CSV saving to metadata/ and transactions/ directories
    • Error handling with failure information in results
  • Test data generator (generate_test_pdfs.py)

    • CLI interface with argparse for --num, --output-dir, --min-pages, --max-pages, --min-transactions, --max-transactions
    • Random realistic data (account numbers, names, amounts)
    • reportlab-based PDF generation with bank statement format
    • Configurable page count (1-10) and transactions per page (100-500)

Test Results

Parser Example_statement.pdf REKENING_KORAN...pdf JAN-2024.pdf
PyMuPDF 47 txns, valid=True 14 txns, valid=True 15 txns, valid=True
pdfplumber 47 txns, valid=True 14 txns, valid=True 15 txns, valid=True
pypdf 47 txns, valid=True 14 txns, valid=True 15 txns, valid=True
pdf_oxide 47 txns, valid=True 14 txns, valid=True 15 txns, valid=True

All 44 tests pass with pytest and hypothesis property-based testing.

Release v1.1.1

28 Dec 08:43

Choose a tag to compare

Release v1.1.1 - 2024-12-28

Fixed

  • pdfplumber metadata extraction - Added fallback to English patterns when Indonesian patterns yield fewer than 2 fields
  • Ensures is_valid_parse() returns True for English-labelled PDFs
  • Metadata coverage now matches PyMuPDF parser (4/4 fields)

Test Results

All sample PDFs now parse successfully with both parsers:

Parser Example_statement.pdf REKENING_KORAN.pdf JAN-2024.pdf
PyMuPDF 47 txns, valid=True 14 txns, valid=True 15 txns, valid=True
pdfplumber 47 txns, valid=True 14 txns, valid=True 15 txns, valid=True

See CHANGELOG.md for full history.