Skip to content

Add Python bindings for ArcadeDB#2696

Merged
robfrank merged 1 commit intoArcadeData:mainfrom
humemai:python-embedded
Nov 5, 2025
Merged

Add Python bindings for ArcadeDB#2696
robfrank merged 1 commit intoArcadeData:mainfrom
humemai:python-embedded

Conversation

@tae898
Copy link
Contributor

@tae898 tae898 commented Oct 23, 2025

Previous discussion (outdated - click to expand)

Python Embedded Bindings for ArcadeDB

What does this PR do?

This PR introduces native Python bindings for ArcadeDB that embed the Java database engine directly in Python processes using JPype. It provides a Pythonic API to ArcadeDB's multi-model database capabilities (Graph, Document, Key/Value, Vector, Time Series) with three distribution options tailored to different use cases.

Key Additions:

  • Complete Python package (arcadedb_embedded) with ~3,200 lines of production code
  • Three distribution variants: headless (~94MB), minimal (~97MB with Studio UI), full (~158MB with Gremlin + GraphQL)
  • Comprehensive test suite: 41 tests across 6 test files, 1,847 lines of test code, 100% passing
  • Full documentation site: 63 markdown files with MkDocs, including API reference, user guides, and examples
  • Automated build system: Docker-based multi-stage builds for all three distributions
  • CI/CD workflows: Automated testing, building, PyPI publishing, and docs deployment via GitHub Actions

Motivation

ArcadeDB is a Java-based multi-model database, but Python is the dominant language in data science, AI/ML, and modern web development. This integration enables:

  1. Embedded database access: Run ArcadeDB directly in Python processes without external servers
  2. Simplified deployment: Self-contained wheels with all JARs bundled (just needs JRE 11+)
  3. AI/ML integration: Native vector storage with HNSW indexing for embeddings
  4. Developer experience: Pythonic API with context managers, type hints, and proper error handling
  5. Multi-model flexibility: Access Graph, Document, Key/Value, Vector, and Time Series models from Python

Why Three Distributions?

  • Headless (production): Core database only, minimal size, no UI dependencies
  • Minimal (development): Adds Studio web UI (~2MB overhead) for visual debugging
  • Full (Gremlin users): Adds Gremlin graph traversal language and GraphQL support

Related Issues

#2662

Architecture Decisions:

  1. Embedded vs Client/Server: Chose embedded mode as primary use case (client mode via HTTP is also supported)
  2. Three packages vs one: Allows users to choose minimal dependencies based on their needs
  3. MkDocs for docs: Material theme provides excellent UX and search functionality

Technical Overview

Package Structure

bindings/python/
├── src/arcadedb_embedded/        # Main package (316 lines core.py + 7 other modules)
│   ├── __init__.py               # Public API exports
│   ├── core.py                   # Database and DatabaseFactory
│   ├── server.py                 # ArcadeDBServer for HTTP mode (225 lines)
│   ├── results.py                # ResultSet and Result wrappers
│   ├── transactions.py           # TransactionContext manager
│   ├── vector.py                 # Vector search and HNSW indexing (142 lines)
│   ├── importer.py               # CSV, JSON, JSONL, Neo4j import (726 lines)
│   ├── exceptions.py             # ArcadeDBError exception
│   └── jvm.py                    # JVM lifecycle management
├── tests/                        # 41 tests across 6 files (1,847 lines total)
│   ├── test_core.py              # 13 tests: CRUD, transactions, queries, graphs, vectors
│   ├── test_server.py            # 6 tests: HTTP API, Studio, configuration
│   ├── test_concurrency.py       # 4 tests: File locking, thread safety, multi-process
│   ├── test_server_patterns.py   # 4 tests: Embedded + HTTP best practices
│   ├── test_importer.py          # 13 tests: CSV, JSON, JSONL, Neo4j import
│   └── test_gremlin.py           # 1 test: Gremlin query language (full only)
├── docs/                         # 63 markdown files (15,000+ lines)
│   ├── getting-started/          # Installation, quickstart, distributions
│   ├── guide/                    # User guides (core, server, vectors, import, graphs)
│   ├── api/                      # API reference for all modules
│   └── development/              # Testing, contributing, architecture, troubleshooting
├── build-all.sh                  # Unified Docker build script for all distributions
├── Dockerfile.build              # Multi-stage Docker build (177 lines)
├── setup_jars.py                 # Copies JARs to package based on distribution (172 lines)
├── extract_version.py            # Extracts version from parent pom.xml (61 lines)
├── write_version.py              # Writes _version.py during build (41 lines)
├── pyproject.toml                # Python package configuration
└── mkdocs.yml                    # Documentation site configuration

Build System

Docker-based multi-stage builds ensure reproducibility:

  1. Stage 1: Build Java components with Maven (all modules)
  2. Stage 2: Build Python wheel with specific JAR subset based on distribution
  3. Stage 3: Run pytest test suite in isolated environment
  4. Stage 4: Export built wheel for distribution

Single command builds all three distributions:

cd bindings/python && ./build-all.sh

CI/CD Workflows

Three GitHub Actions workflows added to .github/workflows/:

  1. test-python-bindings.yml: Runs pytest on every push/PR
  2. release-python-packages.yml: Builds and publishes to PyPI when release tag contains "python"
  3. deploy-python-docs.yml: Builds and deploys MkDocs to GitHub Pages

API Coverage

The bindings provide ~85% coverage of Java API features relevant to Python developers:

Feature Coverage Notes
Database CRUD ✅ 100% create, open, drop, exists
Queries ✅ 100% SQL, Cypher, Gremlin (full), MongoDB syntax
Transactions ✅ 100% Context manager pattern
Schema ✅ 100% Document types, vertex types, edge types
Indexes ✅ 90% LSM, full-text, HNSW vector
Server Mode ✅ 100% HTTP API + Studio UI
Vector Search ✅ 100% HNSW similarity search
Data Import ✅ 100% CSV, JSON, JSONL, Neo4j
Graph API ⚠️ 60% Basic operations (Python-relevant subset)
Gremlin ⚠️ 70% Query execution (full dist only)

Testing

41 tests, 100% passing across all distributions:

  • Headless: 34 passed, 7 skipped (server/Gremlin tests)
  • Minimal: 38 passed, 3 skipped (Gremlin tests)
  • Full: 41 passed, 0 skipped

Test categories:

  • Core operations: Database lifecycle, queries, transactions, schema
  • Server mode: HTTP endpoints, Studio UI, configuration
  • Concurrency: Thread safety, file locking, multi-process isolation
  • Vector search: HNSW indexing, similarity queries, distance metrics
  • Data import: CSV, JSON, JSONL, Neo4j graph import
  • Graph operations: Vertices, edges, traversals
  • Gremlin: Graph query language (full distribution only)

Additional Notes

Documentation

Comprehensive documentation site built with MkDocs (Material theme):

  • Getting Started: Installation guide, 5-minute quickstart, distribution comparison
  • User Guide: Database operations, queries, transactions, vectors, import, graphs, server mode
  • API Reference: Detailed documentation for all 8 modules
  • Development: Testing guide, architecture overview, contributing, troubleshooting
  • Java API Coverage: Comparison table showing what's implemented

Live site: https://humemai.github.io/arcadedb/latest/

Examples

Added examples/basic.py demonstrating:

  • Database creation and cleanup
  • Schema definition
  • Transactions
  • Queries with multiple languages (SQL, Cypher)
  • Graph operations (vertices, edges)
  • Vector search with HNSW
  • Data import from CSV/JSON

Dependencies

Minimal Python dependencies:

  • Required: jpype1>=1.5.0 (JVM integration)
  • Optional: numpy>=1.20.0 (for vector operations)
  • Dev: pytest, pytest-cov, black, isort, mypy

Java dependencies: All bundled in wheel (no external JARs needed)

Installation Requirements

  • Python 3.8 - 3.12
  • Java Runtime Environment (JRE)
  • That's it! Everything else is bundled.

Backward Compatibility

This PR adds a new bindings/python/ directory with no changes to existing Java code or other bindings. It's completely isolated and won't affect existing functionality.

Performance Considerations

  • Direct JVM integration: JPype provides near-native performance
  • No serialization overhead: Direct Java object access in Python
  • Transaction batching: Pythonic context managers ensure proper transaction handling
  • Lazy result iteration: ResultSet provides memory-efficient iteration over large result sets

Known Limitations

  1. Java required: Cannot run without JRE installed
  2. Single process: File-based locking prevents multiple processes accessing same database file (use server mode for multi-process)
  3. JVM startup time: First database operation incurs ~1-2 second JVM initialization
  4. Memory: JVM requires additional memory overhead (~100-200MB base)

Checklist

  • I have run the build using mvn clean package command
    • ✅ All Java modules build successfully
    • ✅ Docker-based Python build tested for all three distributions
  • My unit tests cover both failure and success scenarios
    • ✅ 41 tests covering happy path and error cases
    • ✅ Transaction rollback on errors
    • ✅ File locking edge cases
    • ✅ Invalid query handling
    • ✅ Missing JAR error handling
    • ✅ Concurrency edge cases

Additional Testing Completed

  • Distribution builds: All three distributions build cleanly via Docker
  • Test coverage: 100% test pass rate across all distributions
  • Documentation: Full docs build without errors via MkDocs
  • Example code: examples/basic.py runs successfully
  • CI workflows: GitHub Actions workflows validated (test, release, docs deploy)
  • Package metadata: PyPI metadata complete (classifiers, keywords, URLs)

@tae898
Copy link
Contributor Author

tae898 commented Oct 23, 2025

Previous discussion (outdated - click to expand) I'm sorry to keep removing & creating PRs. I swear I won't do it again. It's just that I'm trying to find the best way to make this.

The last PR (#2686) got broken somehow, so this PR is made again.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tae898, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a significant new feature: native Python bindings for ArcadeDB. It allows Python developers to seamlessly integrate and interact with the powerful Java-based multi-model database directly within their Python applications. The implementation provides a Pythonic interface to ArcadeDB's diverse data models, ensuring high performance through direct JVM integration. This initiative aims to enhance the developer experience for Python users, particularly in data science and AI/ML fields, by offering flexible deployment options and robust tooling for building and managing applications.

Highlights

  • Native Python Bindings: Introduces native Python bindings for ArcadeDB, embedding the Java database engine directly into Python processes using JPype, offering a Pythonic API for multi-model capabilities (Graph, Document, Key/Value, Vector, Time Series).
  • Multiple Distribution Options: Provides three distinct distribution variants: a minimal 'headless' (~94MB) for production, a 'minimal' (~97MB) including the Studio UI for development, and a 'full' (~158MB) with Gremlin and GraphQL support.
  • Comprehensive Ecosystem: Includes a complete Python package (~3,200 lines of code), a comprehensive test suite (41 tests, 100% passing), a full documentation site (63 Markdown files), an automated Docker-based build system, and CI/CD workflows for testing, publishing, and docs deployment.
  • Enhanced Developer Experience: Motivated by Python's dominance in data science, these bindings enable embedded database access, simplified deployment (self-contained wheels), native AI/ML integration (vector storage with HNSW), and a Pythonic API with context managers and type hints.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a massive and impressive pull request that introduces comprehensive Python bindings for ArcadeDB. The work is well-structured, including a robust Docker-based build system, an extensive test suite, and exceptionally detailed documentation. My review focuses on improving the maintainability of the build scripts and correcting some inconsistencies and potential issues within the documentation to ensure clarity and correctness for future users and contributors.

Comment on lines 248 to 250
- `Database` instances are **NOT thread-safe**
- Each thread needs its own `Database` instance
- Transactions are thread-local
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The documentation incorrectly states that Database instances are not thread-safe. This contradicts the implementation shown in tests/test_concurrency.py::test_thread_safe_operations, where a single Database instance is successfully and safely shared across multiple threads. The underlying Java engine is thread-safe, and this is a key feature. The documentation should be corrected to reflect that Database instances are indeed thread-safe and can be shared across threads within the same process.

Suggested change
- `Database` instances are **NOT thread-safe**
- Each thread needs its own `Database` instance
- Transactions are thread-local
- `Database` instances are **thread-safe**
- A single `Database` instance can be shared across multiple threads
- Transactions are thread-local

Comment on lines 386 to 408
CREATE HNSW INDEX Document.embedding
ON Document(embedding)
WITH m=16, ef=128, efConstruction=128
""")

# Insert vectors
with db.transaction():
db.command("sql", "INSERT INTO Document SET name = 'doc1', embedding = [1.0, 0.0, 0.0]")
db.command("sql", "INSERT INTO Document SET name = 'doc2', embedding = [0.9, 0.1, 0.0]")
db.command("sql", "INSERT INTO Document SET name = 'doc3', embedding = [0.0, 1.0, 0.0]")

# Similarity search
result = db.query("sql", """
SELECT name, cosine_similarity(embedding, [1.0, 0.0, 0.0]) as similarity
FROM Document
ORDER BY similarity DESC
LIMIT 2
""")

docs = list(result)
assert docs[0].get_property("name") == "doc1" # Closest match
assert docs[1].get_property("name") == "doc2" # Second closest
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This documentation block demonstrates creating a vector index and searching using SQL syntax (CREATE HNSW INDEX, cosine_similarity). This is confusing and likely incorrect in the context of the Python bindings, which use the db.create_vector_index() and index.find_nearest() methods as shown in the test code. The SQL-based vector search syntax is not the primary or documented way to perform this operation via the Python API and may not be supported in the embedded engine. This section should be revised to accurately reflect the Python API's usage.

Comment on lines 74 to 99
RUN export ARCADEDB_VERSION=$(python3 extract_version.py --format=pep440 /arcadedb/pom.xml) && \
echo "📦 Python package version: ${ARCADEDB_VERSION}" && \
case ${DISTRIBUTION} in \
headless) \
PACKAGE_NAME="arcadedb-embedded-headless" && \
DESCRIPTION="ArcadeDB embedded Python bindings - Headless distribution (excludes Gremlin, GraphQL, MongoDB/Redis wire protocols, and Studio)" \
;; \
minimal) \
PACKAGE_NAME="arcadedb-embedded-minimal" && \
DESCRIPTION="ArcadeDB embedded Python bindings - Minimal distribution (excludes Gremlin, GraphQL, MongoDB/Redis wire protocols)" \
;; \
full) \
PACKAGE_NAME="arcadedb-embedded" && \
DESCRIPTION="ArcadeDB embedded Python bindings - Full distribution (includes Gremlin, GraphQL, MongoDB/Redis wire protocols, and Studio)" \
;; \
*) \
PACKAGE_NAME="arcadedb-embedded" && \
DESCRIPTION="ArcadeDB embedded Python bindings" \
;; \
esac && \
sed -i 's|^name = .*|name = "'"${PACKAGE_NAME}"'"|' pyproject.toml && \
sed -i 's|^version = .*|version = "'"${ARCADEDB_VERSION}"'"|' pyproject.toml && \
sed -i 's|^description = .*|description = "'"${DESCRIPTION}"'"|' pyproject.toml && \
python3 -m build --wheel && \
echo "✅ Wheel built successfully!" && \
ls -lh dist/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This RUN command is very long and performs multiple distinct tasks (version extraction, modifying pyproject.toml with sed, and building the wheel). For better readability, maintainability, and to leverage Docker's layer caching more effectively, consider splitting this into smaller, more focused RUN commands. Alternatively, this logic could be moved into a helper shell script that is COPY'd and executed.

Comment on lines +129 to +158
RUN echo '#!/usr/bin/env python3\n\
import arcadedb_embedded as arcadedb\n\
import tempfile\n\
import shutil\n\
import os\n\
\n\
print("🎮 Testing ArcadeDB Python bindings...")\n\
print(f"📦 Version: {arcadedb.__version__}")\n\
\n\
temp_dir = tempfile.mkdtemp()\n\
db_path = os.path.join(temp_dir, "test_db")\n\
\n\
try:\n\
with arcadedb.create_database(db_path) as db:\n\
print("✅ Database created")\n\
\n\
with db.transaction():\n\
db.command("sql", "CREATE DOCUMENT TYPE TestDoc")\n\
db.command("sql", "INSERT INTO TestDoc SET name = '\''docker_test'\'', value = 123")\n\
print("✅ Transaction committed")\n\
\n\
result = db.query("sql", "SELECT FROM TestDoc")\n\
for record in result:\n\
print(f"✅ Query result: {record.get_property('\''name'\'')} = {record.get_property('\''value'\'')}")\n\
\n\
print("🎉 All tests passed!")\n\
finally:\n\
if os.path.exists(temp_dir):\n\
shutil.rmtree(temp_dir)\n\
' > /test/test_install.py && chmod +x /test/test_install.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Embedding a large, multi-line script directly into the Dockerfile using echo makes it difficult to read, maintain, and lint. It would be a better practice to store this script in a separate file (e.g., docker/test_install.py) and use a COPY instruction to add it to the image. This would significantly improve the readability and maintainability of the Dockerfile.

2. Click **Draft a new release**
3. Click **Choose a tag** dropdown

2. Click **Choose a tag** → Type `vX.Y.Z-python` → **Create new tag**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line appears to be a duplicate of the instruction in the preceding step. It should be removed to improve the clarity of the release process documentation.


# Count
result = db.query("sql", "SELECT count(*) as total FROM Person")
total = result[0].get_property('total')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code result[0] suggests that the ResultSet object supports direct indexing. However, based on the API documentation and other examples, ResultSet is an iterator. Accessing it by index will raise an error. To get the first item, you should either convert it to a list list(result)[0] or, more efficiently, use result.next().

Suggested change
total = result[0].get_property('total')
total = result.next().get_property('total')

Comment on lines 156 to 157
alice = result_alice.next()._java_result
bob = result_bob.next()._java_result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for creating edges uses ._java_result, which is a private attribute. Exposing private attributes in documentation is not a good practice as it indicates a leaky abstraction and can be brittle if the internal implementation changes. A public method like as_java() on the Result object would provide a more stable and explicit API. Alternatively, the Result wrapper could proxy the newEdge method.

Suggested change
alice = result_alice.next()._java_result
bob = result_bob.next()._java_result
alice = result_alice.next().as_java()
bob = result_bob.next().as_java()

@mergify
Copy link
Contributor

mergify bot commented Oct 23, 2025

🧪 CI Insights

Here's what we observed from your CI run for a46cb72.

🟢 All jobs passed!

But CI Insights is watching 👀

@robfrank robfrank linked an issue Oct 24, 2025 that may be closed by this pull request
@robfrank robfrank added this to the 25.10.1 milestone Oct 24, 2025
@robfrank robfrank added the enhancement New feature or request label Oct 24, 2025
@tae898
Copy link
Contributor Author

tae898 commented Oct 25, 2025

Previous discussion (outdated - click to expand)

@robfrank @lvca

The Python bindings are working well overall. The essential unit tests pass, and I'm now adding 9 realistic examples with larger datasets (millions of records) to test performance and scalability. I'm halfway through and plan to complete this by October 31st so it can be included in the next release.

While developing these examples, I've encountered some issues (or possibly mistakes on my part) which I've documented here:

These may be Python-specific. If so, I'll fix them in the bindings.

Repository Structure

I plan to maintain my fork (https://github.com/humemai/arcadedb-embedded-python) as a Python-focused repository. It will stay synchronized with ArcadeDB's main branch but include:

  1. Python-focused README
  2. Dedicated Python documentation (https://humemai.github.io/arcadedb-embedded-python/)
  3. Python-specific GitHub Actions workflows
  4. Python bindings and related files

The Java codebase will remain unchanged and synchronized with upstream.

PyPI Releases

I'll publish Python wheels to PyPI from my fork:

These will follow ArcadeDB's release versions with an optional revision suffix for Python-specific updates (e.g., 25.9.1.3 for the third Python revision of 25.9.1).

Contributing Back

Once the Python bindings are stable and well-tested, I'd like to contribute them back to the main ArcadeDB repository via PR, similar to how projects like DuckDB and Apache Arrow maintain their Python bindings in the main repo with dedicated documentation and PyPI releases.

The quality of my python bindings can be assured by the tests via the GitHub actions performed from your repo (https://github.com/ArcadeData/arcadedb/actions/workflows/test-python-bindings.yml and https://github.com/ArcadeData/arcadedb/actions/workflows/test-python-examples.yml)

@tae898 tae898 force-pushed the python-embedded branch 2 times, most recently from 2cd71e5 to 69f4c54 Compare October 25, 2025 23:08
@lvca lvca requested a review from robfrank October 29, 2025 05:51
@tae898 tae898 force-pushed the python-embedded branch 2 times, most recently from f302557 to 53fbed6 Compare November 3, 2025 10:39
Introduce comprehensive Python bindings that enable embedded ArcadeDB usage directly from Python applications, leveraging JPype for seamless JVM integration.

Core Features:
- Embedded database operations with full CRUD support
- Document, vertex, and edge models for graph databases
- Transaction management (read, write, batch operations)
- Server mode with HTTP API support
- Vector search capabilities for AI/ML applications
- Data import from CSV/JSONL with automatic type inference
- Export to GraphML, GraphSON, JSONL, and CSV formats
- Gremlin query language support
- Async execution and batch processing utilities

Development Infrastructure:
- Multi-platform build system (Linux, macOS, Windows on x64/ARM64)
- Native build scripts with JRE bundling
- Docker-based build environment
- Comprehensive test suite with 100+ tests covering:
  * Core database operations
  * Concurrency and transactions
  * Import/export functionality
  * Server patterns and API
  * Type conversions and result handling
- CI/CD workflows for automated testing across all platforms
- Testing for examples 01-03 (verified working)

Examples:
- Simple document store with CRUD operations
- Social network graph modeling and traversal
- Vector similarity search
- CSV import with MovieLens dataset (examples 04-05 included but not CI-tested yet)

Build System:
- Platform-specific wheel generation
- JAR exclusion filtering for minimal distributions
- Version extraction from parent pom.xml
- Setup utilities for streamlined installation

This implementation provides a Pythonic interface to ArcadeDB while maintaining compatibility with the Java API and supporting all major platforms.
@tae898
Copy link
Contributor Author

tae898 commented Nov 3, 2025

Python Bindings for ArcadeDB

Overview

This PR introduces comprehensive Python bindings for ArcadeDB, enabling embedded database usage directly from Python applications. The implementation leverages JPype for seamless JVM integration, providing a Pythonic interface while maintaining full compatibility with the Java API.

Related Issue

#2662

🎯 Key Highlights

Multi-Platform Support (6 Platforms)

All platforms supported thanks to Java's JIT nature:

  • linux/amd64
  • linux/arm64
  • darwin/amd64 (Intel Mac)
  • darwin/arm64 (Apple Silicon)
  • windows/amd64
  • windows/arm64

Key Innovation: Instead of compiling native extensions for each platform, we ship platform-specific stripped JREs bundled with each wheel. This approach:

  • Eliminates the need for users to install Java
  • Ensures consistent behavior across all platforms
  • Simplifies the build process (no native compilation required)
  • Leverages Java's "write once, run anywhere" philosophy

Current Status

⚠️ Not Production Ready - Currently undergoing comprehensive testing across all platforms.

📦 PyPI Distribution Pending - Wheels are ready but not yet published to PyPI. Waiting for PyPI approval to push wheels larger than 100MB (current wheels include bundled JREs).

🚀 Features

Core Functionality

  • Embedded Database Operations: Full CRUD support for documents, vertices, and edges
  • Transaction Management: Read, write, and batch operations with ACID guarantees
  • Graph Database Support: Native graph modeling with traversal capabilities
  • Vector Search: AI/ML-ready vector similarity search
  • Multiple Query Languages: SQL, Gremlin, and programmatic API
  • Server Mode: HTTP API for remote access

Data Import/Export

  • Import: CSV and JSONL with automatic type inference
  • Export: GraphML, GraphSON, JSONL, and CSV formats
  • Batch Processing: Optimized bulk operations with BatchContext and AsyncExecutor

Development Infrastructure

  • Multi-Platform Build System: Native build scripts with JRE bundling
  • Docker Support: Docker-based build environment for Linux
  • Comprehensive Testing: 100+ tests covering core operations, concurrency, and edge cases
  • CI/CD: Automated testing across all 6 platforms via GitHub Actions
  • Examples: Working examples for common use cases (examples 01-03 verified in CI)

📦 Installation (When Available on PyPI)

pip install arcadedb-embedded

Platform-specific wheels will be automatically selected based on your system.

🔧 Build System

The build system generates platform-specific wheels with bundled JREs:

# Build for current platform
cd bindings/python
./build.sh

# Build for specific platform
./build.sh linux/amd64
./build.sh darwin/arm64
./build.sh windows/amd64

JAR Exclusion

Non-essential JARs (e.g., gRPC) are excluded to minimize wheel size, configured via jar_exclusions.txt.

📊 Testing

Test Coverage

  • ✅ Core database operations
  • ✅ Concurrency and transaction handling
  • ✅ Import/export functionality
  • ✅ Server patterns and HTTP API
  • ✅ Type conversions and result handling
  • ✅ Async execution and batch processing

CI/CD Workflows

  • test-python-bindings.yml: Unit tests across all platforms
  • test-python-examples.yml: Examples 01-03 tested on all platforms

📝 Examples

1. Simple Document Store

Basic CRUD operations with comprehensive data type support.

2. Social Network Graph

Graph modeling with vertices, edges, and traversal queries.

3. Vector Search

Vector embeddings and semantic similarity search for AI/ML applications.

🛣️ Roadmap

  • Complete testing across all platforms
  • PyPI approval for 100MB+ wheels
  • Publish wheels to PyPI
  • Add examples 04-08 to CI testing
  • Performance benchmarking and optimization
  • Expand documentation with more advanced use cases
  • Add mkdocs documentation site

🤝 Technical Details

Architecture

  • JPype Integration: Seamless Python-Java interop without performance overhead
  • Bundled JRE: Platform-specific stripped Java Runtime Environments
  • Type Conversion: Automatic conversion between Python and Java types
  • Result Handling: Pythonic iteration over query results

@robfrank robfrank merged commit a46cb72 into ArcadeData:main Nov 5, 2025
26 of 29 checks passed
@tae898 tae898 deleted the python-embedded branch November 5, 2025 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for python embedded package

2 participants