Code used to generate the intake-esm catalogs used to access various datasets in NCAR's GDEX and examples of utilizing the generated catalogs.
This repository contains tools and scripts for generating intake-ESM catalogs that provide unified access to diverse Earth science datasets. While intake-ESM was originally designed for Earth System Model output, we extend its use to observations, reanalysis data, and other Earth science datasets.
The primary tool is generator/create_catalog.py. It generates an intake-ESM catalog for a specific dataset directory.
Basic CLI
python generator/create_catalog.py <directory> \
[--out <output directory>] \
[--catalog_name <name>] \
[--description <description>] \
[--exclude <glob> ...] \
[--include <glob> ...] \
[--depth <int>] \
[--ignore_vars <var name> ...] \
[--var_metadata <json string|filename>] \
[--global_metadata <json string|filename>] \
[--output_format <csv_and_json|single_json>] \
[--data_format <netcdf|zarr|reference>] \
[--make_remote]
<directory>: One or more root data directories to scan (space-separated).--out,-o: Destination directory for generated catalog files (default:./).--catalog_name,-n: Name to use for the catalog file(s) (default:dnnnnnn-posix).--description: Short human-readable description for the catalog (default:N/A).--exclude,-e: Glob pattern(s) to exclude files or directories (can be repeated).--include,-ic: Glob pattern(s) to include files or directories (can be repeated).--depth,-d: Maximum directory recursion depth (integer, default: 0).--ignore_vars,-i: Variable names to ignore (can be repeated).--var_metadata,-vm: Per-variable metadata as a JSON string or a path to a JSON file.--global_metadata,-gm: Catalog-level metadata as a JSON string or a path to a JSON file.--output_format,-of: Output style;csv_and_jsonemits CSV + JSON index files,single_jsonemits a single JSON catalog (default:csv_and_json).--data_format,-df: Input data/reference type:netcdf,zarr, orreference(default:netcdf).--make_remote,-mr: If set, prepare remote-accessible references for https and osdf (boolean flag).
python generator/create_catalog.py \
/gdex/data/need/to/be/cataloged/ \
--data_format reference \
--out /data/path/to/store/catalog \
--output_format csv_and_json \
--catalog_name intake_catalog \
--description "reference catalog" \
--depth 0 \
--include "*.zarr" \
--exclude "*.tmp" \
--ignore_vars utc_date \
--var_metadata var_meta.json \
--global_metadata global_meta.json \
--make_remote
Notes
- See
generator/create_catalog.pysource for full option parsing and advanced behaviors.
We use a custom fork of ecgtools, Currently, pin to commit SHA = 0b3d5b5d0082812e85c821c00c2d619eed0ae3cd along with custom scripts to generate our catalogs. This allows us to:
- Handle diverse data formats and structures
- Implement custom parsing logic for different data sources
- Maintain consistency across various dataset types
Although intake-ESM is primarily meant for Earth System Model output, we leverage the package to generate catalogs for:
- Observations (satellite, in-situ, etc.)
- Reanalysis datasets (ERA5, JRA-3Q, etc.)
- Model output (CESM, CMIP, etc.)
- Other Earth science data
We strive to match our vocabulary (column names) with conventions used by other major data providers including:
- DKRZ (Deutsches Klimarechenzentrum)
- Copernicus Climate Data Store
- NASA data repositories
- NOAA data services
Our catalogs support different data access patterns through three main flavors:
Direct filesystem access for users on NCAR HPC systems (Casper, Derecho)
Web-based access for remote users and standard HTTP protocols
Distributed access through the Open Science Data Federation for broader community access
For comprehensive usage examples and tutorials for the generated catelog:
- NCAR HPC users: Visit gdex-examples
- OSDF users: Visit osdf_examples
We welcome feedback from the community! Please use GitHub issues for:
- Bug reports when something is broken
- Feature requests for new functionality or datasets
Note: While we appreciate all feature requests, please understand that we may not be able to fulfill all requests due to resource constraints and project priorities.
- Check the documentation and examples linked above
- Search existing GitHub issues for similar problems
- Open a new issue with detailed information about your use case
├── README.md
├── requirements.txt
├── generator/ # Core catalog generation tools
│ ├── create_catalog.py
│ └── modify_catalog.py
├── notebooks/ # Example notebooks and development work
└── test/ # Test scripts
git clone https://github.com/NCAR/gdex-intake-esm.git
cd gdex-intake-esm
pip install -r requirements.txt