Skip to content

Python embedded#2686

Closed
tae898 wants to merge 1 commit intoArcadeData:mainfrom
humemai:python-embedded
Closed

Python embedded#2686
tae898 wants to merge 1 commit intoArcadeData:mainfrom
humemai:python-embedded

Conversation

@tae898
Copy link
Contributor

@tae898 tae898 commented Oct 21, 2025

Python Embedded Bindings for ArcadeDB

What does this PR do?

This PR introduces native Python bindings for ArcadeDB that embed the Java database engine directly in Python processes using JPype. It provides a Pythonic API to ArcadeDB's multi-model database capabilities (Graph, Document, Key/Value, Vector, Time Series) with three distribution options tailored to different use cases.

Key Additions:

  • Complete Python package (arcadedb_embedded) with ~3,200 lines of production code
  • Three distribution variants: headless (~94MB), minimal (~97MB with Studio UI), full (~158MB with Gremlin + GraphQL)
  • Comprehensive test suite: 41 tests across 6 test files, 1,847 lines of test code, 100% passing
  • Full documentation site: 63 markdown files with MkDocs, including API reference, user guides, and examples
  • Automated build system: Docker-based multi-stage builds for all three distributions
  • CI/CD workflows: Automated testing, building, PyPI publishing, and docs deployment via GitHub Actions

Motivation

ArcadeDB is a Java-based multi-model database, but Python is the dominant language in data science, AI/ML, and modern web development. This integration enables:

  1. Embedded database access: Run ArcadeDB directly in Python processes without external servers
  2. Simplified deployment: Self-contained wheels with all JARs bundled (just needs JRE 11+)
  3. AI/ML integration: Native vector storage with HNSW indexing for embeddings
  4. Developer experience: Pythonic API with context managers, type hints, and proper error handling
  5. Multi-model flexibility: Access Graph, Document, Key/Value, Vector, and Time Series models from Python

Why Three Distributions?

  • Headless (production): Core database only, minimal size, no UI dependencies
  • Minimal (development): Adds Studio web UI (~2MB overhead) for visual debugging
  • Full (Gremlin users): Adds Gremlin graph traversal language and GraphQL support

Related Issues

#2662

Architecture Decisions:

  1. Embedded vs Client/Server: Chose embedded mode as primary use case (client mode via HTTP is also supported)
  2. Three packages vs one: Allows users to choose minimal dependencies based on their needs
  3. MkDocs for docs: Material theme provides excellent UX and search functionality

Technical Overview

Package Structure

bindings/python/
├── src/arcadedb_embedded/        # Main package (316 lines core.py + 7 other modules)
│   ├── __init__.py               # Public API exports
│   ├── core.py                   # Database and DatabaseFactory
│   ├── server.py                 # ArcadeDBServer for HTTP mode (225 lines)
│   ├── results.py                # ResultSet and Result wrappers
│   ├── transactions.py           # TransactionContext manager
│   ├── vector.py                 # Vector search and HNSW indexing (142 lines)
│   ├── importer.py               # CSV, JSON, JSONL, Neo4j import (726 lines)
│   ├── exceptions.py             # ArcadeDBError exception
│   └── jvm.py                    # JVM lifecycle management
├── tests/                        # 41 tests across 6 files (1,847 lines total)
│   ├── test_core.py              # 13 tests: CRUD, transactions, queries, graphs, vectors
│   ├── test_server.py            # 6 tests: HTTP API, Studio, configuration
│   ├── test_concurrency.py       # 4 tests: File locking, thread safety, multi-process
│   ├── test_server_patterns.py   # 4 tests: Embedded + HTTP best practices
│   ├── test_importer.py          # 13 tests: CSV, JSON, JSONL, Neo4j import
│   └── test_gremlin.py           # 1 test: Gremlin query language (full only)
├── docs/                         # 63 markdown files (15,000+ lines)
│   ├── getting-started/          # Installation, quickstart, distributions
│   ├── guide/                    # User guides (core, server, vectors, import, graphs)
│   ├── api/                      # API reference for all modules
│   └── development/              # Testing, contributing, architecture, troubleshooting
├── build-all.sh                  # Unified Docker build script for all distributions
├── Dockerfile.build              # Multi-stage Docker build (177 lines)
├── setup_jars.py                 # Copies JARs to package based on distribution (172 lines)
├── extract_version.py            # Extracts version from parent pom.xml (61 lines)
├── write_version.py              # Writes _version.py during build (41 lines)
├── pyproject.toml                # Python package configuration
└── mkdocs.yml                    # Documentation site configuration

Build System

Docker-based multi-stage builds ensure reproducibility:

  1. Stage 1: Build Java components with Maven (all modules)
  2. Stage 2: Build Python wheel with specific JAR subset based on distribution
  3. Stage 3: Run pytest test suite in isolated environment
  4. Stage 4: Export built wheel for distribution

Single command builds all three distributions:

cd bindings/python && ./build-all.sh

CI/CD Workflows

Three GitHub Actions workflows added to .github/workflows/:

  1. test-python-bindings.yml: Runs pytest on every push/PR
  2. release-python-packages.yml: Builds and publishes to PyPI when release tag contains "python"
  3. deploy-python-docs.yml: Builds and deploys MkDocs to GitHub Pages

API Coverage

The bindings provide ~85% coverage of Java API features relevant to Python developers:

Feature Coverage Notes
Database CRUD ✅ 100% create, open, drop, exists
Queries ✅ 100% SQL, Cypher, Gremlin (full), MongoDB syntax
Transactions ✅ 100% Context manager pattern
Schema ✅ 100% Document types, vertex types, edge types
Indexes ✅ 90% LSM, full-text, HNSW vector
Server Mode ✅ 100% HTTP API + Studio UI
Vector Search ✅ 100% HNSW similarity search
Data Import ✅ 100% CSV, JSON, JSONL, Neo4j
Graph API ⚠️ 60% Basic operations (Python-relevant subset)
Gremlin ⚠️ 70% Query execution (full dist only)

Testing

41 tests, 100% passing across all distributions:

  • Headless: 34 passed, 7 skipped (server/Gremlin tests)
  • Minimal: 38 passed, 3 skipped (Gremlin tests)
  • Full: 41 passed, 0 skipped

Test categories:

  • Core operations: Database lifecycle, queries, transactions, schema
  • Server mode: HTTP endpoints, Studio UI, configuration
  • Concurrency: Thread safety, file locking, multi-process isolation
  • Vector search: HNSW indexing, similarity queries, distance metrics
  • Data import: CSV, JSON, JSONL, Neo4j graph import
  • Graph operations: Vertices, edges, traversals
  • Gremlin: Graph query language (full distribution only)

Additional Notes

Documentation

Comprehensive documentation site built with MkDocs (Material theme):

  • Getting Started: Installation guide, 5-minute quickstart, distribution comparison
  • User Guide: Database operations, queries, transactions, vectors, import, graphs, server mode
  • API Reference: Detailed documentation for all 8 modules
  • Development: Testing guide, architecture overview, contributing, troubleshooting
  • Java API Coverage: Comparison table showing what's implemented

Live site: https://humemai.github.io/arcadedb/latest/

Examples

Added examples/basic.py demonstrating:

  • Database creation and cleanup
  • Schema definition
  • Transactions
  • Queries with multiple languages (SQL, Cypher)
  • Graph operations (vertices, edges)
  • Vector search with HNSW
  • Data import from CSV/JSON

Dependencies

Minimal Python dependencies:

  • Required: jpype1>=1.5.0 (JVM integration)
  • Optional: numpy>=1.20.0 (for vector operations)
  • Dev: pytest, pytest-cov, black, isort, mypy

Java dependencies: All bundled in wheel (no external JARs needed)

Installation Requirements

  • Python 3.8 - 3.12
  • Java Runtime Environment (JRE)
  • That's it! Everything else is bundled.

Backward Compatibility

This PR adds a new bindings/python/ directory with no changes to existing Java code or other bindings. It's completely isolated and won't affect existing functionality.

Performance Considerations

  • Direct JVM integration: JPype provides near-native performance
  • No serialization overhead: Direct Java object access in Python
  • Transaction batching: Pythonic context managers ensure proper transaction handling
  • Lazy result iteration: ResultSet provides memory-efficient iteration over large result sets

Known Limitations

  1. Java required: Cannot run without JRE installed
  2. Single process: File-based locking prevents multiple processes accessing same database file (use server mode for multi-process)
  3. JVM startup time: First database operation incurs ~1-2 second JVM initialization
  4. Memory: JVM requires additional memory overhead (~100-200MB base)

Checklist

  • I have run the build using mvn clean package command
    • ✅ All Java modules build successfully
    • ✅ Docker-based Python build tested for all three distributions
  • My unit tests cover both failure and success scenarios
    • ✅ 41 tests covering happy path and error cases
    • ✅ Transaction rollback on errors
    • ✅ File locking edge cases
    • ✅ Invalid query handling
    • ✅ Missing JAR error handling
    • ✅ Concurrency edge cases

Additional Testing Completed

  • Distribution builds: All three distributions build cleanly via Docker
  • Test coverage: 100% test pass rate across all distributions
  • Documentation: Full docs build without errors via MkDocs
  • Example code: examples/basic.py runs successfully
  • CI workflows: GitHub Actions workflows validated (test, release, docs deploy)
  • Package metadata: PyPI metadata complete (classifiers, keywords, URLs)

@mergify
Copy link
Contributor

mergify bot commented Oct 21, 2025

🧪 CI Insights

Here's what we observed from your CI run for 408a499.

🟢 All jobs passed!

But CI Insights is watching 👀

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tae898, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers native Python bindings for ArcadeDB, bridging the gap between the Java-based multi-model database and the Python ecosystem. The new arcadedb_embedded package allows developers to seamlessly embed ArcadeDB within their Python applications, offering a high-performance, Pythonic API for various data models including graph, document, and vector. This significantly simplifies deployment for data science, AI/ML, and web development, supported by comprehensive documentation, extensive testing, and automated build/CI processes.

Highlights

  • New Python Bindings: Introduced arcadedb_embedded, a complete Python package with ~3,200 lines of production code, enabling native Python access to ArcadeDB.
  • Flexible Distribution Options: Offers three variants (headless, minimal, full) tailored for different use cases, from lightweight production to full-featured development with Gremlin and GraphQL.
  • Comprehensive Tooling: Includes a robust test suite (41 tests, 100% passing), extensive MkDocs-based documentation (63 files), and automated Docker-based build and CI/CD workflows.
  • Embedded & AI/ML Ready: Facilitates running ArcadeDB directly within Python processes, ideal for AI/ML integration with native vector storage and HNSW indexing.
  • Pythonic API: Provides a Pythonic interface with context managers, type hints, and proper error handling for ArcadeDB's multi-model capabilities (Graph, Document, Key/Value, Vector, Time Series).
Ignored Files
  • Ignored by pattern: .github/workflows/** (3)
    • .github/workflows/deploy-python-docs.yml
    • .github/workflows/release-python-packages.yml
    • .github/workflows/test-python-bindings.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an impressive and substantial pull request that introduces comprehensive Python bindings for ArcadeDB. The structure of the new bindings/python directory is well-organized, including production code, a thorough test suite, an extensive documentation site, and a robust Docker-based build system. The separation into three distributions (headless, minimal, full) is a thoughtful approach to cater to different user needs. The overall quality of the code, tests, and documentation is very high. I've identified a few medium-severity issues related to maintainability and correctness in the build scripts and documentation that would further improve the quality of this contribution.

Comment on lines +4 to +5
# Python bindings - generated version file
bindings/python/src/arcadedb_embedded/_version.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path bindings/python/src/arcadedb_embedded/_version.py appears to be redundant and incorrect. Since this .gitignore file is located in bindings/python/, paths are relative to this directory. The specified path would incorrectly resolve to bindings/python/bindings/python/src/arcadedb_embedded/_version.py from the repository root. The entry on line 2, src/arcadedb_embedded/_version.py, is sufficient and correct.


# Extract version and copy JARs
RUN echo "📌 Building distribution: ${DISTRIBUTION}" && \
export ARCADEDB_VERSION=$(python3 extract_version.py) && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to extract_version.py here is likely incorrect. It's called without arguments, so the script will try to find pom.xml at ../../pom.xml relative to its own location. Inside the container, the script is at /build/extract_version.py, so it will look for /pom.xml, which doesn't exist. This will cause the ARCADEDB_VERSION variable to be empty, and the log message on the next line will be misleading. The correct path to pom.xml is /arcadedb/pom.xml, as used in a later RUN command. Please provide the correct path here to ensure the build log is accurate.

    export ARCADEDB_VERSION=$(python3 extract_version.py /arcadedb/pom.xml) && \

Comment on lines +142 to +171
RUN echo '#!/usr/bin/env python3\n\
import arcadedb_embedded as arcadedb\n\
import tempfile\n\
import shutil\n\
import os\n\
\n\
print("🎮 Testing ArcadeDB Python bindings...")\n\
print(f"📦 Version: {arcadedb.__version__}")\n\
\n\
temp_dir = tempfile.mkdtemp()\n\
db_path = os.path.join(temp_dir, "test_db")\n\
\n\
try:\n\
with arcadedb.create_database(db_path) as db:\n\
print("✅ Database created")\n\
\n\
with db.transaction():\n\
db.command("sql", "CREATE DOCUMENT TYPE TestDoc")\n\
db.command("sql", "INSERT INTO TestDoc SET name = '\''docker_test'\'', value = 123")\n\
print("✅ Transaction committed")\n\
\n\
result = db.query("sql", "SELECT FROM TestDoc")\n\
for record in result:\n\
print(f"✅ Query result: {record.get_property('\''name'\'')} = {record.get_property('\''value'\'')}")\n\
\n\
print("🎉 All tests passed!")\n\
finally:\n\
if os.path.exists(temp_dir):\n\
shutil.rmtree(temp_dir)\n\
' > /test/test_install.py && chmod +x /test/test_install.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Embedding a multi-line Python script directly into the Dockerfile using echo makes it difficult to read, maintain, and lint. For better maintainability, I recommend moving this script into a separate file, such as tests/smoke_test.py, and using a COPY instruction to add it to the Docker image. This will make the test script much easier to manage.

@lvca
Copy link
Contributor

lvca commented Oct 21, 2025

@tae898 really impressive contribution! A few questions:

@tae898
Copy link
Contributor Author

tae898 commented Oct 21, 2025

@tae898 really impressive contribution! A few questions:

Thanks!

1. This is not really a driver but bindings

The python wheel includes the JARs and can talk to the DB directly via Java APIs, instead of HTTP. It's not a simple client. I didn't base my code on other work, although I got a lot of help from VS Copilot.

2. Java API, HTTP, and gRPC are all possible.

From the PR (https://github.com/ArcadeData/arcadedb/blob/d7c9590f88181cfcaca51b5a8aa4d63b4404a76f/bindings/python/tests/test_server_patterns.py) you can see that I test managing the DB via both direct Java API calls and HTTP REST APIs, where the former is much faster than the latter, as you can expect. I think the main usage of this python bindings is for direct Java API calls, not HTTP REST APIs, but since ArcadeDB's Java supports both, I also made it that it supports both.

Once the server is up and running with

server = arcadedb.create_server(
    root_path=root_path,
    root_password="test12345"
)
server.start()

The web UI (http://localhost:2480/) is open and you can do all the UI stuff there.

I see that the gRPC support is also added here. I try to test this but I couldn't get it working, cuz AFAIK, the gRPC client is not implemented yet and its JAR doesn't exist in the docker image yet, right? Once it's up there, I can also test it, although I still think the main usage of this python bindings is direct Java API calls.

Remarks

I told @robfrank that I'll do a bit more tests, e.g., more data, import CSVs, JSONs, etc. I'll let both of you know when I am more comfortable with this PR. After the python bindings is done, I'll move on to Arcade-RAG

@tae898 tae898 force-pushed the python-embedded branch 2 times, most recently from 84e4c6d to ffbeaf7 Compare October 21, 2025 21:45
@tae898
Copy link
Contributor Author

tae898 commented Oct 21, 2025

I'll squash my commits into one commit from now on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really want to touch the files that are not necessarily related to my python-embedded PR, but this was kinda necessary to format the files properly.

@lvca lvca requested a review from robfrank October 21, 2025 22:25
@lvca lvca added this to the 25.11.1 milestone Oct 21, 2025
@lvca lvca added the enhancement New feature or request label Oct 21, 2025
@tae898 tae898 closed this Oct 23, 2025
@tae898 tae898 deleted the python-embedded branch October 23, 2025 11:40
@tae898 tae898 mentioned this pull request Oct 23, 2025
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants