Problem Statement
Currently, paimon-cpp uses a fixed BUNDLED approach for all third-party dependencies through CMake's ExternalProject_Add. While this ensures build reproducibility, it has several limitations:
Current Limitations
- No choice in dependency sources: All dependencies are downloaded and built from source, even when system libraries are already available
- Long build times: Building all dependencies (Arrow, ORC, Protobuf, compression libraries, etc.) from source can take significant time
- No reuse of existing installations: Users cannot leverage pre-installed libraries from system package managers (apt, yum, brew, conda, vcpkg)
- Inflexible for different environments: Different deployment scenarios (development, CI, production) may benefit from different dependency management strategies
- No per-dependency control: Cannot selectively choose BUNDLED for some dependencies and SYSTEM for others
Example Use Cases That Are Currently Difficult
- Development: Developer has Arrow 17.0.0 already installed system-wide, but still needs to rebuild it
- CI/CD: Build containers with pre-installed dependencies to speed up CI pipelines
- Custom builds: Organizations with specific library versions or patches in non-standard locations
- Conda environments: Users working within conda environments want to use conda-provided libraries
Proposed Solution
Implement a flexible dependency management system similar to Apache Arrow C++, which provides:
1. Global Dependency Source Control
Add a PAIMON_DEPENDENCY_SOURCE option:
-DPAIMON_DEPENDENCY_SOURCE=<AUTO|BUNDLED|SYSTEM|CONDA>
- AUTO (default): Try to find system libraries first, fall back to bundled build if not found
- BUNDLED: Always download and build dependencies from source (current behavior)
- SYSTEM: Use only system-installed libraries (fail if not found)
- CONDA: Use libraries from
$CONDA_PREFIX environment
2. Per-Dependency Source Control
Allow users to override individual dependencies:
-DArrow_SOURCE=SYSTEM
-DArrow_ROOT=/usr/local/arrow-17.0.0
-Dzstd_SOURCE=BUNDLED
-Dglog_SOURCE=AUTO
3. Unified Path Prefix
Support a common prefix for all unspecified dependencies:
-DPAIMON_PACKAGE_PREFIX=/opt/mylibs
This automatically sets Arrow_ROOT, zstd_ROOT, etc. to /opt/mylibs.
4. Shared vs Static Library Control
-DPAIMON_DEPENDENCY_USE_SHARED=OFF
-DPAIMON_ARROW_USE_SHARED=ON
Implementation Approach
Following Arrow's design pattern:
-
Create a resolve_dependency() macro that:
- Checks
${DEPENDENCY_NAME}_SOURCE variable
- Falls back to
PAIMON_ACTUAL_DEPENDENCY_SOURCE if not set
- Calls
find_package() for SYSTEM/AUTO or build_dependency() for BUNDLED
-
Create Find modules (e.g., FindArrowAlt.cmake) that:
- Respect
${PACKAGE}_ROOT CMake variable
- Search in
${PACKAGE}_ROOT/{include,lib} with NO_DEFAULT_PATH
- Fall back to system paths if
_ROOT is not set
- Support both shared and static library preferences
-
Update ThirdpartyToolchain.cmake:
- Replace direct
build_<dependency>() calls with resolve_dependency()
- Set default
_SOURCE values based on PAIMON_DEPENDENCY_SOURCE
-
Maintain backward compatibility:
- Default to
AUTO or BUNDLED to preserve current behavior
- Existing build commands work without changes
Example Usage
Use System Libraries
cmake -B build \
-DPAIMON_DEPENDENCY_SOURCE=SYSTEM \
-DArrow_ROOT=/usr/local \
-Dglog_ROOT=/usr/local \
-Dzstd_ROOT=/usr/local
Mixed Approach (some bundled, some system)
cmake -B build \
-DPAIMON_DEPENDENCY_SOURCE=AUTO \
-DArrow_SOURCE=SYSTEM \
-DArrow_ROOT=/custom/arrow \
-Dzstd_SOURCE=BUNDLED
Conda Environment
cmake -B build \
-DPAIMON_DEPENDENCY_SOURCE=CONDA
Benefits
- Faster builds: Reuse pre-installed libraries, especially for iterative development
- Flexible deployment: Support diverse environments (bare metal, containers, HPC clusters)
- Better CI integration: Cache dependencies across builds
- Ecosystem compatibility: Work seamlessly with conda, vcpkg, conan, system package managers
- Gradual adoption: Users can opt-in to new features without breaking existing builds
- Resource efficiency: Avoid rebuilding large dependencies like Arrow (which itself has many dependencies)
Reference Implementation
Apache Arrow C++ has successfully implemented this pattern:
Questions for Discussion
- Should the default be
AUTO (convenient) or BUNDLED (current, most reproducible)?
- Which dependencies should support this first? (Suggestion: Start with Arrow, compression libraries)
- Should we maintain compatibility with older CMake package formats, or require modern targets?
- How to handle transitive dependency conflicts between SYSTEM and BUNDLED libraries?
Implementation Phases
Phase 1: Core infrastructure
- Implement
resolve_dependency() macro
- Add
PAIMON_DEPENDENCY_SOURCE option
- Support
<PACKAGE>_SOURCE and <PACKAGE>_ROOT variables
Phase 2: Major dependencies
- Arrow (including Parquet)
- ORC + Protobuf
- Compression libraries (Snappy, zstd, lz4, zlib)
Phase 3: Additional dependencies
- Avro
- glog, fmt, RapidJSON
- TBB
- Testing libraries (GTest)
Phase 4: Advanced features
- Conda/vcpkg integration
- Shared vs static library preferences
- Better error messages and diagnostics
Compatibility
- ✅ Backward compatible: Existing build scripts work unchanged
- ✅ Opt-in: New features are optional
- ✅ No breaking changes: Default behavior can remain BUNDLED initially
Would appreciate feedback from maintainers and community members on this proposal!
Problem Statement
Currently, paimon-cpp uses a fixed
BUNDLEDapproach for all third-party dependencies through CMake'sExternalProject_Add. While this ensures build reproducibility, it has several limitations:Current Limitations
Example Use Cases That Are Currently Difficult
Proposed Solution
Implement a flexible dependency management system similar to Apache Arrow C++, which provides:
1. Global Dependency Source Control
Add a
PAIMON_DEPENDENCY_SOURCEoption:-DPAIMON_DEPENDENCY_SOURCE=<AUTO|BUNDLED|SYSTEM|CONDA>$CONDA_PREFIXenvironment2. Per-Dependency Source Control
Allow users to override individual dependencies:
3. Unified Path Prefix
Support a common prefix for all unspecified dependencies:
-DPAIMON_PACKAGE_PREFIX=/opt/mylibsThis automatically sets
Arrow_ROOT,zstd_ROOT, etc. to/opt/mylibs.4. Shared vs Static Library Control
Implementation Approach
Following Arrow's design pattern:
Create a
resolve_dependency()macro that:${DEPENDENCY_NAME}_SOURCEvariablePAIMON_ACTUAL_DEPENDENCY_SOURCEif not setfind_package()for SYSTEM/AUTO orbuild_dependency()for BUNDLEDCreate Find modules (e.g.,
FindArrowAlt.cmake) that:${PACKAGE}_ROOTCMake variable${PACKAGE}_ROOT/{include,lib}withNO_DEFAULT_PATH_ROOTis not setUpdate ThirdpartyToolchain.cmake:
build_<dependency>()calls withresolve_dependency()_SOURCEvalues based onPAIMON_DEPENDENCY_SOURCEMaintain backward compatibility:
AUTOorBUNDLEDto preserve current behaviorExample Usage
Use System Libraries
Mixed Approach (some bundled, some system)
Conda Environment
Benefits
Reference Implementation
Apache Arrow C++ has successfully implemented this pattern:
ARROW_DEPENDENCY_SOURCE: https://github.com/apache/arrow/blob/main/cpp/cmake_modules/DefineOptions.cmake#L456-L464resolve_dependency()macro: https://github.com/apache/arrow/blob/main/cpp/cmake_modules/ThirdpartyToolchain.cmake#L252-L366Questions for Discussion
AUTO(convenient) orBUNDLED(current, most reproducible)?Implementation Phases
Phase 1: Core infrastructure
resolve_dependency()macroPAIMON_DEPENDENCY_SOURCEoption<PACKAGE>_SOURCEand<PACKAGE>_ROOTvariablesPhase 2: Major dependencies
Phase 3: Additional dependencies
Phase 4: Advanced features
Compatibility
Would appreciate feedback from maintainers and community members on this proposal!