Skip to content

use public data sources instead of private catalog #367

@andersy005

Description

@andersy005

the OCR codebase currently uses a data catalog (ocr.catalog) that points to private S3 buckets. this prevents external users from running the pipeline without access to CarbonPlan's private buckets.

we have now published some of the input and output datasets to Source Coop under open licences, however, the codebase has not been updated to reference this public bucket by default.

published datasets

the data is available on Source Coop under s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr

input data

  • fire risk tensor data (Icechunk & TIFF):

    • input/fire-risk/tensor/USFS/dillon-et-al-2023/
    • input/fire-risk/tensor/USFS/riley-et-al-2025/
    • input/fire-risk/tensor/USFS/scott-et-al-2024/
    • input/fire-risk/tensor/conus404-ffwi/
  • vector Data (Geoparquet):

    • input/fire-risk/vector/census-tiger/ (blocks, counties, tracts) - CC BY 4.0
    • input/fire-risk/vector/overture-maps/ - ODbL

output data (versioned)

  • tensor: output/fire-risk/tensor/production/ (Icechunk) - CC BY 4.0
  • vector: output/fire-risk/vector/production/ (Geoparquet, PMTiles, GPKG, CSV) - ODbL

what needs to change

  • update catalog definitions in datasets.py to point to Source Coop paths by default
  • update configuration to allow users to easily override catalog locations via environment variables
  • document the catalog structure so users understand how to:
    • use the public data
    • point to alternative data sources

current workarounds

until this issue is resolved, users can:

  • fork the repository and modify datasets.py to point to source coop paths
  • set OCR_STORAGE_ROOT and related environment variables to reference the user's S3 bucket
  • download data locally and configure local paths

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions