-
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
the OCR codebase currently uses a data catalog (ocr.catalog) that points to private S3 buckets. this prevents external users from running the pipeline without access to CarbonPlan's private buckets.
we have now published some of the input and output datasets to Source Coop under open licences, however, the codebase has not been updated to reference this public bucket by default.
published datasets
the data is available on Source Coop under s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr
input data
-
fire risk tensor data (Icechunk & TIFF):
input/fire-risk/tensor/USFS/dillon-et-al-2023/input/fire-risk/tensor/USFS/riley-et-al-2025/input/fire-risk/tensor/USFS/scott-et-al-2024/input/fire-risk/tensor/conus404-ffwi/
-
vector Data (Geoparquet):
input/fire-risk/vector/census-tiger/(blocks, counties, tracts) - CC BY 4.0input/fire-risk/vector/overture-maps/- ODbL
output data (versioned)
- tensor:
output/fire-risk/tensor/production/(Icechunk) - CC BY 4.0 - vector:
output/fire-risk/vector/production/(Geoparquet, PMTiles, GPKG, CSV) - ODbL
what needs to change
- update catalog definitions in datasets.py to point to Source Coop paths by default
- update configuration to allow users to easily override catalog locations via environment variables
- document the catalog structure so users understand how to:
- use the public data
- point to alternative data sources
current workarounds
until this issue is resolved, users can:
- fork the repository and modify datasets.py to point to source coop paths
- set
OCR_STORAGE_ROOTand related environment variables to reference the user's S3 bucket - download data locally and configure local paths
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request