Add Python bindings for ArcadeDB#2696
Conversation
Previous discussion (outdated - click to expand)I'm sorry to keep removing & creating PRs. I swear I won't do it again. It's just that I'm trying to find the best way to make this.The last PR (#2686) got broken somehow, so this PR is made again. |
Summary of ChangesHello @tae898, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a significant new feature: native Python bindings for ArcadeDB. It allows Python developers to seamlessly integrate and interact with the powerful Java-based multi-model database directly within their Python applications. The implementation provides a Pythonic interface to ArcadeDB's diverse data models, ensuring high performance through direct JVM integration. This initiative aims to enhance the developer experience for Python users, particularly in data science and AI/ML fields, by offering flexible deployment options and robust tooling for building and managing applications. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is a massive and impressive pull request that introduces comprehensive Python bindings for ArcadeDB. The work is well-structured, including a robust Docker-based build system, an extensive test suite, and exceptionally detailed documentation. My review focuses on improving the maintainability of the build scripts and correcting some inconsistencies and potential issues within the documentation to ensure clarity and correctness for future users and contributors.
| - `Database` instances are **NOT thread-safe** | ||
| - Each thread needs its own `Database` instance | ||
| - Transactions are thread-local |
There was a problem hiding this comment.
The documentation incorrectly states that Database instances are not thread-safe. This contradicts the implementation shown in tests/test_concurrency.py::test_thread_safe_operations, where a single Database instance is successfully and safely shared across multiple threads. The underlying Java engine is thread-safe, and this is a key feature. The documentation should be corrected to reflect that Database instances are indeed thread-safe and can be shared across threads within the same process.
| - `Database` instances are **NOT thread-safe** | |
| - Each thread needs its own `Database` instance | |
| - Transactions are thread-local | |
| - `Database` instances are **thread-safe** | |
| - A single `Database` instance can be shared across multiple threads | |
| - Transactions are thread-local |
| CREATE HNSW INDEX Document.embedding | ||
| ON Document(embedding) | ||
| WITH m=16, ef=128, efConstruction=128 | ||
| """) | ||
|
|
||
| # Insert vectors | ||
| with db.transaction(): | ||
| db.command("sql", "INSERT INTO Document SET name = 'doc1', embedding = [1.0, 0.0, 0.0]") | ||
| db.command("sql", "INSERT INTO Document SET name = 'doc2', embedding = [0.9, 0.1, 0.0]") | ||
| db.command("sql", "INSERT INTO Document SET name = 'doc3', embedding = [0.0, 1.0, 0.0]") | ||
|
|
||
| # Similarity search | ||
| result = db.query("sql", """ | ||
| SELECT name, cosine_similarity(embedding, [1.0, 0.0, 0.0]) as similarity | ||
| FROM Document | ||
| ORDER BY similarity DESC | ||
| LIMIT 2 | ||
| """) | ||
|
|
||
| docs = list(result) | ||
| assert docs[0].get_property("name") == "doc1" # Closest match | ||
| assert docs[1].get_property("name") == "doc2" # Second closest | ||
| ``` |
There was a problem hiding this comment.
This documentation block demonstrates creating a vector index and searching using SQL syntax (CREATE HNSW INDEX, cosine_similarity). This is confusing and likely incorrect in the context of the Python bindings, which use the db.create_vector_index() and index.find_nearest() methods as shown in the test code. The SQL-based vector search syntax is not the primary or documented way to perform this operation via the Python API and may not be supported in the embedded engine. This section should be revised to accurately reflect the Python API's usage.
bindings/python/Dockerfile.build
Outdated
| RUN export ARCADEDB_VERSION=$(python3 extract_version.py --format=pep440 /arcadedb/pom.xml) && \ | ||
| echo "📦 Python package version: ${ARCADEDB_VERSION}" && \ | ||
| case ${DISTRIBUTION} in \ | ||
| headless) \ | ||
| PACKAGE_NAME="arcadedb-embedded-headless" && \ | ||
| DESCRIPTION="ArcadeDB embedded Python bindings - Headless distribution (excludes Gremlin, GraphQL, MongoDB/Redis wire protocols, and Studio)" \ | ||
| ;; \ | ||
| minimal) \ | ||
| PACKAGE_NAME="arcadedb-embedded-minimal" && \ | ||
| DESCRIPTION="ArcadeDB embedded Python bindings - Minimal distribution (excludes Gremlin, GraphQL, MongoDB/Redis wire protocols)" \ | ||
| ;; \ | ||
| full) \ | ||
| PACKAGE_NAME="arcadedb-embedded" && \ | ||
| DESCRIPTION="ArcadeDB embedded Python bindings - Full distribution (includes Gremlin, GraphQL, MongoDB/Redis wire protocols, and Studio)" \ | ||
| ;; \ | ||
| *) \ | ||
| PACKAGE_NAME="arcadedb-embedded" && \ | ||
| DESCRIPTION="ArcadeDB embedded Python bindings" \ | ||
| ;; \ | ||
| esac && \ | ||
| sed -i 's|^name = .*|name = "'"${PACKAGE_NAME}"'"|' pyproject.toml && \ | ||
| sed -i 's|^version = .*|version = "'"${ARCADEDB_VERSION}"'"|' pyproject.toml && \ | ||
| sed -i 's|^description = .*|description = "'"${DESCRIPTION}"'"|' pyproject.toml && \ | ||
| python3 -m build --wheel && \ | ||
| echo "✅ Wheel built successfully!" && \ | ||
| ls -lh dist/ |
There was a problem hiding this comment.
This RUN command is very long and performs multiple distinct tasks (version extraction, modifying pyproject.toml with sed, and building the wheel). For better readability, maintainability, and to leverage Docker's layer caching more effectively, consider splitting this into smaller, more focused RUN commands. Alternatively, this logic could be moved into a helper shell script that is COPY'd and executed.
| RUN echo '#!/usr/bin/env python3\n\ | ||
| import arcadedb_embedded as arcadedb\n\ | ||
| import tempfile\n\ | ||
| import shutil\n\ | ||
| import os\n\ | ||
| \n\ | ||
| print("🎮 Testing ArcadeDB Python bindings...")\n\ | ||
| print(f"📦 Version: {arcadedb.__version__}")\n\ | ||
| \n\ | ||
| temp_dir = tempfile.mkdtemp()\n\ | ||
| db_path = os.path.join(temp_dir, "test_db")\n\ | ||
| \n\ | ||
| try:\n\ | ||
| with arcadedb.create_database(db_path) as db:\n\ | ||
| print("✅ Database created")\n\ | ||
| \n\ | ||
| with db.transaction():\n\ | ||
| db.command("sql", "CREATE DOCUMENT TYPE TestDoc")\n\ | ||
| db.command("sql", "INSERT INTO TestDoc SET name = '\''docker_test'\'', value = 123")\n\ | ||
| print("✅ Transaction committed")\n\ | ||
| \n\ | ||
| result = db.query("sql", "SELECT FROM TestDoc")\n\ | ||
| for record in result:\n\ | ||
| print(f"✅ Query result: {record.get_property('\''name'\'')} = {record.get_property('\''value'\'')}")\n\ | ||
| \n\ | ||
| print("🎉 All tests passed!")\n\ | ||
| finally:\n\ | ||
| if os.path.exists(temp_dir):\n\ | ||
| shutil.rmtree(temp_dir)\n\ | ||
| ' > /test/test_install.py && chmod +x /test/test_install.py |
There was a problem hiding this comment.
Embedding a large, multi-line script directly into the Dockerfile using echo makes it difficult to read, maintain, and lint. It would be a better practice to store this script in a separate file (e.g., docker/test_install.py) and use a COPY instruction to add it to the image. This would significantly improve the readability and maintainability of the Dockerfile.
| 2. Click **Draft a new release** | ||
| 3. Click **Choose a tag** dropdown | ||
|
|
||
| 2. Click **Choose a tag** → Type `vX.Y.Z-python` → **Create new tag** |
|
|
||
| # Count | ||
| result = db.query("sql", "SELECT count(*) as total FROM Person") | ||
| total = result[0].get_property('total') |
There was a problem hiding this comment.
The code result[0] suggests that the ResultSet object supports direct indexing. However, based on the API documentation and other examples, ResultSet is an iterator. Accessing it by index will raise an error. To get the first item, you should either convert it to a list list(result)[0] or, more efficiently, use result.next().
| total = result[0].get_property('total') | |
| total = result.next().get_property('total') |
bindings/python/docs/guide/graphs.md
Outdated
| alice = result_alice.next()._java_result | ||
| bob = result_bob.next()._java_result |
There was a problem hiding this comment.
The example for creating edges uses ._java_result, which is a private attribute. Exposing private attributes in documentation is not a good practice as it indicates a leaky abstraction and can be brittle if the internal implementation changes. A public method like as_java() on the Result object would provide a more stable and explicit API. Alternatively, the Result wrapper could proxy the newEdge method.
| alice = result_alice.next()._java_result | |
| bob = result_bob.next()._java_result | |
| alice = result_alice.next().as_java() | |
| bob = result_bob.next().as_java() |
🧪 CI InsightsHere's what we observed from your CI run for a46cb72. 🟢 All jobs passed!But CI Insights is watching 👀 |
5990a8b to
849b4c7
Compare
Previous discussion (outdated - click to expand)The Python bindings are working well overall. The essential unit tests pass, and I'm now adding 9 realistic examples with larger datasets (millions of records) to test performance and scalability. I'm halfway through and plan to complete this by October 31st so it can be included in the next release. While developing these examples, I've encountered some issues (or possibly mistakes on my part) which I've documented here:
These may be Python-specific. If so, I'll fix them in the bindings. Repository StructureI plan to maintain my fork (https://github.com/humemai/arcadedb-embedded-python) as a Python-focused repository. It will stay synchronized with ArcadeDB's main branch but include:
The Java codebase will remain unchanged and synchronized with upstream. PyPI ReleasesI'll publish Python wheels to PyPI from my fork:
These will follow ArcadeDB's release versions with an optional revision suffix for Python-specific updates (e.g., Contributing BackOnce the Python bindings are stable and well-tested, I'd like to contribute them back to the main ArcadeDB repository via PR, similar to how projects like DuckDB and Apache Arrow maintain their Python bindings in the main repo with dedicated documentation and PyPI releases. The quality of my python bindings can be assured by the tests via the GitHub actions performed from your repo (https://github.com/ArcadeData/arcadedb/actions/workflows/test-python-bindings.yml and https://github.com/ArcadeData/arcadedb/actions/workflows/test-python-examples.yml) |
2cd71e5 to
69f4c54
Compare
f302557 to
53fbed6
Compare
Introduce comprehensive Python bindings that enable embedded ArcadeDB usage directly from Python applications, leveraging JPype for seamless JVM integration. Core Features: - Embedded database operations with full CRUD support - Document, vertex, and edge models for graph databases - Transaction management (read, write, batch operations) - Server mode with HTTP API support - Vector search capabilities for AI/ML applications - Data import from CSV/JSONL with automatic type inference - Export to GraphML, GraphSON, JSONL, and CSV formats - Gremlin query language support - Async execution and batch processing utilities Development Infrastructure: - Multi-platform build system (Linux, macOS, Windows on x64/ARM64) - Native build scripts with JRE bundling - Docker-based build environment - Comprehensive test suite with 100+ tests covering: * Core database operations * Concurrency and transactions * Import/export functionality * Server patterns and API * Type conversions and result handling - CI/CD workflows for automated testing across all platforms - Testing for examples 01-03 (verified working) Examples: - Simple document store with CRUD operations - Social network graph modeling and traversal - Vector similarity search - CSV import with MovieLens dataset (examples 04-05 included but not CI-tested yet) Build System: - Platform-specific wheel generation - JAR exclusion filtering for minimal distributions - Version extraction from parent pom.xml - Setup utilities for streamlined installation This implementation provides a Pythonic interface to ArcadeDB while maintaining compatibility with the Java API and supporting all major platforms.
53fbed6 to
a46cb72
Compare
Python Bindings for ArcadeDBOverviewThis PR introduces comprehensive Python bindings for ArcadeDB, enabling embedded database usage directly from Python applications. The implementation leverages JPype for seamless JVM integration, providing a Pythonic interface while maintaining full compatibility with the Java API. Related Issue🎯 Key HighlightsMulti-Platform Support (6 Platforms)✅ All platforms supported thanks to Java's JIT nature:
Key Innovation: Instead of compiling native extensions for each platform, we ship platform-specific stripped JREs bundled with each wheel. This approach:
Current Status📦 PyPI Distribution Pending - Wheels are ready but not yet published to PyPI. Waiting for PyPI approval to push wheels larger than 100MB (current wheels include bundled JREs). 🚀 FeaturesCore Functionality
Data Import/Export
Development Infrastructure
📦 Installation (When Available on PyPI)pip install arcadedb-embeddedPlatform-specific wheels will be automatically selected based on your system. 🔧 Build SystemThe build system generates platform-specific wheels with bundled JREs: # Build for current platform
cd bindings/python
./build.sh
# Build for specific platform
./build.sh linux/amd64
./build.sh darwin/arm64
./build.sh windows/amd64JAR ExclusionNon-essential JARs (e.g., gRPC) are excluded to minimize wheel size, configured via 📊 TestingTest Coverage
CI/CD Workflows
📝 Examples1. Simple Document StoreBasic CRUD operations with comprehensive data type support. 2. Social Network GraphGraph modeling with vertices, edges, and traversal queries. 3. Vector SearchVector embeddings and semantic similarity search for AI/ML applications. 🛣️ Roadmap
🤝 Technical DetailsArchitecture
|
Previous discussion (outdated - click to expand)
Python Embedded Bindings for ArcadeDB
What does this PR do?
This PR introduces native Python bindings for ArcadeDB that embed the Java database engine directly in Python processes using JPype. It provides a Pythonic API to ArcadeDB's multi-model database capabilities (Graph, Document, Key/Value, Vector, Time Series) with three distribution options tailored to different use cases.
Key Additions:
arcadedb_embedded) with ~3,200 lines of production codeMotivation
ArcadeDB is a Java-based multi-model database, but Python is the dominant language in data science, AI/ML, and modern web development. This integration enables:
Why Three Distributions?
Related Issues
#2662
Architecture Decisions:
Technical Overview
Package Structure
Build System
Docker-based multi-stage builds ensure reproducibility:
Single command builds all three distributions:
CI/CD Workflows
Three GitHub Actions workflows added to
.github/workflows/:test-python-bindings.yml: Runs pytest on every push/PRrelease-python-packages.yml: Builds and publishes to PyPI when release tag contains "python"deploy-python-docs.yml: Builds and deploys MkDocs to GitHub PagesAPI Coverage
The bindings provide ~85% coverage of Java API features relevant to Python developers:
Testing
41 tests, 100% passing across all distributions:
Test categories:
Additional Notes
Documentation
Comprehensive documentation site built with MkDocs (Material theme):
Live site:
https://humemai.github.io/arcadedb/latest/Examples
Added
examples/basic.pydemonstrating:Dependencies
Minimal Python dependencies:
jpype1>=1.5.0(JVM integration)numpy>=1.20.0(for vector operations)pytest,pytest-cov,black,isort,mypyJava dependencies: All bundled in wheel (no external JARs needed)
Installation Requirements
Backward Compatibility
This PR adds a new
bindings/python/directory with no changes to existing Java code or other bindings. It's completely isolated and won't affect existing functionality.Performance Considerations
Known Limitations
Checklist
mvn clean packagecommandAdditional Testing Completed
examples/basic.pyruns successfully