Python embedded by tae898 · Pull Request #2686 · ArcadeData/arcadedb

tae898 · 2025-10-21T12:07:29Z

Python Embedded Bindings for ArcadeDB

What does this PR do?

This PR introduces native Python bindings for ArcadeDB that embed the Java database engine directly in Python processes using JPype. It provides a Pythonic API to ArcadeDB's multi-model database capabilities (Graph, Document, Key/Value, Vector, Time Series) with three distribution options tailored to different use cases.

Key Additions:

Complete Python package (arcadedb_embedded) with ~3,200 lines of production code
Three distribution variants: headless (~94MB), minimal (~97MB with Studio UI), full (~158MB with Gremlin + GraphQL)
Comprehensive test suite: 41 tests across 6 test files, 1,847 lines of test code, 100% passing
Full documentation site: 63 markdown files with MkDocs, including API reference, user guides, and examples
Automated build system: Docker-based multi-stage builds for all three distributions
CI/CD workflows: Automated testing, building, PyPI publishing, and docs deployment via GitHub Actions

Motivation

ArcadeDB is a Java-based multi-model database, but Python is the dominant language in data science, AI/ML, and modern web development. This integration enables:

Embedded database access: Run ArcadeDB directly in Python processes without external servers
Simplified deployment: Self-contained wheels with all JARs bundled (just needs JRE 11+)
AI/ML integration: Native vector storage with HNSW indexing for embeddings
Developer experience: Pythonic API with context managers, type hints, and proper error handling
Multi-model flexibility: Access Graph, Document, Key/Value, Vector, and Time Series models from Python

Why Three Distributions?

Headless (production): Core database only, minimal size, no UI dependencies
Minimal (development): Adds Studio web UI (~2MB overhead) for visual debugging
Full (Gremlin users): Adds Gremlin graph traversal language and GraphQL support

Related Issues

#2662

Architecture Decisions:

Embedded vs Client/Server: Chose embedded mode as primary use case (client mode via HTTP is also supported)
Three packages vs one: Allows users to choose minimal dependencies based on their needs
MkDocs for docs: Material theme provides excellent UX and search functionality

Technical Overview

Package Structure

bindings/python/
├── src/arcadedb_embedded/        # Main package (316 lines core.py + 7 other modules)
│   ├── __init__.py               # Public API exports
│   ├── core.py                   # Database and DatabaseFactory
│   ├── server.py                 # ArcadeDBServer for HTTP mode (225 lines)
│   ├── results.py                # ResultSet and Result wrappers
│   ├── transactions.py           # TransactionContext manager
│   ├── vector.py                 # Vector search and HNSW indexing (142 lines)
│   ├── importer.py               # CSV, JSON, JSONL, Neo4j import (726 lines)
│   ├── exceptions.py             # ArcadeDBError exception
│   └── jvm.py                    # JVM lifecycle management
├── tests/                        # 41 tests across 6 files (1,847 lines total)
│   ├── test_core.py              # 13 tests: CRUD, transactions, queries, graphs, vectors
│   ├── test_server.py            # 6 tests: HTTP API, Studio, configuration
│   ├── test_concurrency.py       # 4 tests: File locking, thread safety, multi-process
│   ├── test_server_patterns.py   # 4 tests: Embedded + HTTP best practices
│   ├── test_importer.py          # 13 tests: CSV, JSON, JSONL, Neo4j import
│   └── test_gremlin.py           # 1 test: Gremlin query language (full only)
├── docs/                         # 63 markdown files (15,000+ lines)
│   ├── getting-started/          # Installation, quickstart, distributions
│   ├── guide/                    # User guides (core, server, vectors, import, graphs)
│   ├── api/                      # API reference for all modules
│   └── development/              # Testing, contributing, architecture, troubleshooting
├── build-all.sh                  # Unified Docker build script for all distributions
├── Dockerfile.build              # Multi-stage Docker build (177 lines)
├── setup_jars.py                 # Copies JARs to package based on distribution (172 lines)
├── extract_version.py            # Extracts version from parent pom.xml (61 lines)
├── write_version.py              # Writes _version.py during build (41 lines)
├── pyproject.toml                # Python package configuration
└── mkdocs.yml                    # Documentation site configuration

Build System

Docker-based multi-stage builds ensure reproducibility:

Stage 1: Build Java components with Maven (all modules)
Stage 2: Build Python wheel with specific JAR subset based on distribution
Stage 3: Run pytest test suite in isolated environment
Stage 4: Export built wheel for distribution

Single command builds all three distributions:

cd bindings/python && ./build-all.sh

CI/CD Workflows

Three GitHub Actions workflows added to .github/workflows/:

test-python-bindings.yml: Runs pytest on every push/PR
release-python-packages.yml: Builds and publishes to PyPI when release tag contains "python"
deploy-python-docs.yml: Builds and deploys MkDocs to GitHub Pages

API Coverage

The bindings provide ~85% coverage of Java API features relevant to Python developers:

Feature	Coverage	Notes
Database CRUD	✅ 100%	create, open, drop, exists
Queries	✅ 100%	SQL, Cypher, Gremlin (full), MongoDB syntax
Transactions	✅ 100%	Context manager pattern
Schema	✅ 100%	Document types, vertex types, edge types
Indexes	✅ 90%	LSM, full-text, HNSW vector
Server Mode	✅ 100%	HTTP API + Studio UI
Vector Search	✅ 100%	HNSW similarity search
Data Import	✅ 100%	CSV, JSON, JSONL, Neo4j
Graph API	⚠️ 60%	Basic operations (Python-relevant subset)
Gremlin	⚠️ 70%	Query execution (full dist only)

Testing

41 tests, 100% passing across all distributions:

✅ Headless: 34 passed, 7 skipped (server/Gremlin tests)
✅ Minimal: 38 passed, 3 skipped (Gremlin tests)
✅ Full: 41 passed, 0 skipped

Test categories:

Core operations: Database lifecycle, queries, transactions, schema
Server mode: HTTP endpoints, Studio UI, configuration
Concurrency: Thread safety, file locking, multi-process isolation
Vector search: HNSW indexing, similarity queries, distance metrics
Data import: CSV, JSON, JSONL, Neo4j graph import
Graph operations: Vertices, edges, traversals
Gremlin: Graph query language (full distribution only)

Additional Notes

Documentation

Comprehensive documentation site built with MkDocs (Material theme):

Getting Started: Installation guide, 5-minute quickstart, distribution comparison
User Guide: Database operations, queries, transactions, vectors, import, graphs, server mode
API Reference: Detailed documentation for all 8 modules
Development: Testing guide, architecture overview, contributing, troubleshooting
Java API Coverage: Comparison table showing what's implemented

Live site: https://humemai.github.io/arcadedb/latest/

Examples

Added examples/basic.py demonstrating:

Database creation and cleanup
Schema definition
Transactions
Queries with multiple languages (SQL, Cypher)
Graph operations (vertices, edges)
Vector search with HNSW
Data import from CSV/JSON

Dependencies

Minimal Python dependencies:

Required: jpype1>=1.5.0 (JVM integration)
Optional: numpy>=1.20.0 (for vector operations)
Dev: pytest, pytest-cov, black, isort, mypy

Java dependencies: All bundled in wheel (no external JARs needed)

Installation Requirements

Python 3.8 - 3.12
Java Runtime Environment (JRE)
That's it! Everything else is bundled.

Backward Compatibility

This PR adds a new bindings/python/ directory with no changes to existing Java code or other bindings. It's completely isolated and won't affect existing functionality.

Performance Considerations

Direct JVM integration: JPype provides near-native performance
No serialization overhead: Direct Java object access in Python
Transaction batching: Pythonic context managers ensure proper transaction handling
Lazy result iteration: ResultSet provides memory-efficient iteration over large result sets

Known Limitations

Java required: Cannot run without JRE installed
Single process: File-based locking prevents multiple processes accessing same database file (use server mode for multi-process)
JVM startup time: First database operation incurs ~1-2 second JVM initialization
Memory: JVM requires additional memory overhead (~100-200MB base)

Checklist

I have run the build using mvn clean package command
- ✅ All Java modules build successfully
- ✅ Docker-based Python build tested for all three distributions
My unit tests cover both failure and success scenarios
- ✅ 41 tests covering happy path and error cases
- ✅ Transaction rollback on errors
- ✅ File locking edge cases
- ✅ Invalid query handling
- ✅ Missing JAR error handling
- ✅ Concurrency edge cases

Additional Testing Completed

Distribution builds: All three distributions build cleanly via Docker
Test coverage: 100% test pass rate across all distributions
Documentation: Full docs build without errors via MkDocs
Example code: examples/basic.py runs successfully
CI workflows: GitHub Actions workflows validated (test, release, docs deploy)
Package metadata: PyPI metadata complete (classifiers, keywords, URLs)

mergify · 2025-10-21T12:08:25Z

🧪 CI Insights

Here's what we observed from your CI run for 408a499.

🟢 All jobs passed!

But CI Insights is watching 👀

gemini-code-assist · 2025-10-21T12:08:39Z

Summary of Changes

Hello @tae898, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers native Python bindings for ArcadeDB, bridging the gap between the Java-based multi-model database and the Python ecosystem. The new arcadedb_embedded package allows developers to seamlessly embed ArcadeDB within their Python applications, offering a high-performance, Pythonic API for various data models including graph, document, and vector. This significantly simplifies deployment for data science, AI/ML, and web development, supported by comprehensive documentation, extensive testing, and automated build/CI processes.

Highlights

New Python Bindings: Introduced arcadedb_embedded, a complete Python package with ~3,200 lines of production code, enabling native Python access to ArcadeDB.
Flexible Distribution Options: Offers three variants (headless, minimal, full) tailored for different use cases, from lightweight production to full-featured development with Gremlin and GraphQL.
Comprehensive Tooling: Includes a robust test suite (41 tests, 100% passing), extensive MkDocs-based documentation (63 files), and automated Docker-based build and CI/CD workflows.
Embedded & AI/ML Ready: Facilitates running ArcadeDB directly within Python processes, ideal for AI/ML integration with native vector storage and HNSW indexing.
Pythonic API: Provides a Pythonic interface with context managers, type hints, and proper error handling for ArcadeDB's multi-model capabilities (Graph, Document, Key/Value, Vector, Time Series).

Ignored Files

Ignored by pattern: .github/workflows/** (3)
- .github/workflows/deploy-python-docs.yml
- .github/workflows/release-python-packages.yml
- .github/workflows/test-python-bindings.yml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is an impressive and substantial pull request that introduces comprehensive Python bindings for ArcadeDB. The structure of the new bindings/python directory is well-organized, including production code, a thorough test suite, an extensive documentation site, and a robust Docker-based build system. The separation into three distributions (headless, minimal, full) is a thoughtful approach to cater to different user needs. The overall quality of the code, tests, and documentation is very high. I've identified a few medium-severity issues related to maintainability and correctness in the build scripts and documentation that would further improve the quality of this contribution.

gemini-code-assist · 2025-10-21T12:12:19Z

bindings/python/.gitignore

+# Python bindings - generated version file
+bindings/python/src/arcadedb_embedded/_version.py


The path bindings/python/src/arcadedb_embedded/_version.py appears to be redundant and incorrect. Since this .gitignore file is located in bindings/python/, paths are relative to this directory. The specified path would incorrectly resolve to bindings/python/bindings/python/src/arcadedb_embedded/_version.py from the repository root. The entry on line 2, src/arcadedb_embedded/_version.py, is sufficient and correct.

gemini-code-assist · 2025-10-21T12:12:19Z

bindings/python/Dockerfile.build

+
+# Extract version and copy JARs
+RUN echo "📌 Building distribution: ${DISTRIBUTION}" && \
+    export ARCADEDB_VERSION=$(python3 extract_version.py) && \


The call to extract_version.py here is likely incorrect. It's called without arguments, so the script will try to find pom.xml at ../../pom.xml relative to its own location. Inside the container, the script is at /build/extract_version.py, so it will look for /pom.xml, which doesn't exist. This will cause the ARCADEDB_VERSION variable to be empty, and the log message on the next line will be misleading. The correct path to pom.xml is /arcadedb/pom.xml, as used in a later RUN command. Please provide the correct path here to ensure the build log is accurate.

export ARCADEDB_VERSION=$(python3 extract_version.py /arcadedb/pom.xml) && \

gemini-code-assist · 2025-10-21T12:12:20Z

bindings/python/Dockerfile.build

+RUN echo '#!/usr/bin/env python3\n\
+import arcadedb_embedded as arcadedb\n\
+import tempfile\n\
+import shutil\n\
+import os\n\
+\n\
+print("🎮 Testing ArcadeDB Python bindings...")\n\
+print(f"📦 Version: {arcadedb.__version__}")\n\
+\n\
+temp_dir = tempfile.mkdtemp()\n\
+db_path = os.path.join(temp_dir, "test_db")\n\
+\n\
+try:\n\
+    with arcadedb.create_database(db_path) as db:\n\
+        print("✅ Database created")\n\
+        \n\
+        with db.transaction():\n\
+            db.command("sql", "CREATE DOCUMENT TYPE TestDoc")\n\
+            db.command("sql", "INSERT INTO TestDoc SET name = '\''docker_test'\'', value = 123")\n\
+        print("✅ Transaction committed")\n\
+        \n\
+        result = db.query("sql", "SELECT FROM TestDoc")\n\
+        for record in result:\n\
+            print(f"✅ Query result: {record.get_property('\''name'\'')} = {record.get_property('\''value'\'')}")\n\
+    \n\
+    print("🎉 All tests passed!")\n\
+finally:\n\
+    if os.path.exists(temp_dir):\n\
+        shutil.rmtree(temp_dir)\n\
+' > /test/test_install.py && chmod +x /test/test_install.py


Embedding a multi-line Python script directly into the Dockerfile using echo makes it difficult to read, maintain, and lint. For better maintainability, I recommend moving this script into a separate file, such as tests/smoke_test.py, and using a COPY instruction to add it to the Docker image. This will make the test script much easier to manage.

bindings/python/docs/guide/server.md

lvca · 2025-10-21T18:01:17Z

@tae898 really impressive contribution! A few questions:

is this a new python driver or is it based on the existing one (https://github.com/stevereiner/arcadedb-python)?
can Python users use the same API to access to a remote (non embedded) instance? How? In this case is it using HTTP calls or gRPC?

tae898 · 2025-10-21T20:50:31Z

@tae898 really impressive contribution! A few questions:

is this a new python driver or is it based on the existing one (https://github.com/stevereiner/arcadedb-python)?

can Python users use the same API to access to a remote (non embedded) instance? How? In this case is it using HTTP calls or gRPC?

Thanks!

1. This is not really a driver but bindings

The python wheel includes the JARs and can talk to the DB directly via Java APIs, instead of HTTP. It's not a simple client. I didn't base my code on other work, although I got a lot of help from VS Copilot.

2. Java API, HTTP, and gRPC are all possible.

From the PR (https://github.com/ArcadeData/arcadedb/blob/d7c9590f88181cfcaca51b5a8aa4d63b4404a76f/bindings/python/tests/test_server_patterns.py) you can see that I test managing the DB via both direct Java API calls and HTTP REST APIs, where the former is much faster than the latter, as you can expect. I think the main usage of this python bindings is for direct Java API calls, not HTTP REST APIs, but since ArcadeDB's Java supports both, I also made it that it supports both.

Once the server is up and running with

server = arcadedb.create_server(
    root_path=root_path,
    root_password="test12345"
)
server.start()

The web UI (http://localhost:2480/) is open and you can do all the UI stuff there.

I see that the gRPC support is also added here. I try to test this but I couldn't get it working, cuz AFAIK, the gRPC client is not implemented yet and its JAR doesn't exist in the docker image yet, right? Once it's up there, I can also test it, although I still think the main usage of this python bindings is direct Java API calls.

Remarks

I told @robfrank that I'll do a bit more tests, e.g., more data, import CSVs, JSONs, etc. I'll let both of you know when I am more comfortable with this PR. After the python bindings is done, I'll move on to Arcade-RAG

tae898 · 2025-10-21T22:01:39Z

I'll squash my commits into one commit from now on.

tae898 · 2025-10-21T22:03:05Z

.pre-commit-config.yaml

I didn't really want to touch the files that are not necessarily related to my python-embedded PR, but this was kinda necessary to format the files properly.

tae898 mentioned this pull request Oct 21, 2025

Support for python embedded package #2662

Closed

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

tae898 force-pushed the python-embedded branch 2 times, most recently from 84e4c6d to ffbeaf7 Compare October 21, 2025 21:45

Add comprehensive tests for ArcadeDB Python bindings

408a499

tae898 force-pushed the python-embedded branch from 95bc751 to 408a499 Compare October 21, 2025 22:00

tae898 commented Oct 21, 2025

View reviewed changes

lvca assigned tae898 Oct 21, 2025

lvca requested a review from robfrank October 21, 2025 22:25

lvca added this to the 25.11.1 milestone Oct 21, 2025

lvca added the enhancement New feature or request label Oct 21, 2025

tae898 closed this Oct 23, 2025

tae898 deleted the python-embedded branch October 23, 2025 11:40

tae898 mentioned this pull request Oct 23, 2025

Add Python bindings for ArcadeDB #2696

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python embedded#2686

Python embedded#2686
tae898 wants to merge 1 commit intoArcadeData:mainfrom
humemai:python-embedded

tae898 commented Oct 21, 2025

Uh oh!

mergify bot commented Oct 21, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

Uh oh!

lvca commented Oct 21, 2025 •

edited

Loading

Uh oh!

tae898 commented Oct 21, 2025 •

edited

Loading

Uh oh!

tae898 commented Oct 21, 2025

Uh oh!

tae898 Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Python bindings - generated version file
		bindings/python/src/arcadedb_embedded/_version.py

Uh oh!

Conversation

tae898 commented Oct 21, 2025

Python Embedded Bindings for ArcadeDB

What does this PR do?

Key Additions:

Motivation

Why Three Distributions?

Related Issues

Architecture Decisions:

Technical Overview

Package Structure

Build System

CI/CD Workflows

API Coverage

Testing

Additional Notes

Documentation

Examples

Dependencies

Installation Requirements

Backward Compatibility

Performance Considerations

Known Limitations

Checklist

Additional Testing Completed

Uh oh!

mergify bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

🟢 All jobs passed!

Uh oh!

gemini-code-assist bot commented Oct 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lvca commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tae898 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. This is not really a driver but bindings

2. Java API, HTTP, and gRPC are all possible.

Remarks

Uh oh!

tae898 commented Oct 21, 2025

Uh oh!

tae898 Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mergify bot commented Oct 21, 2025 •

edited

Loading

lvca commented Oct 21, 2025 •

edited

Loading

tae898 commented Oct 21, 2025 •

edited

Loading