use public data sources instead of private catalog

the OCR codebase currently uses a data catalog (`ocr.catalog`) that points to private S3 buckets. this prevents external users from running the pipeline without access to CarbonPlan's private buckets. 

we have now published some of the input and output datasets to [Source Coop](https://source.coop/carbonplan/carbonplan-ocr) under open licences, however, the codebase has not been updated to reference this public bucket by default. 


## published datasets 

the data is available on Source Coop under  [`s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr`](https://source.coop/carbonplan/carbonplan-ocr)

### input data

- fire risk tensor data (Icechunk & TIFF):
   - `input/fire-risk/tensor/USFS/dillon-et-al-2023/`
   - `input/fire-risk/tensor/USFS/riley-et-al-2025/`
   - `input/fire-risk/tensor/USFS/scott-et-al-2024/`
   - `input/fire-risk/tensor/conus404-ffwi/` 


- vector Data (Geoparquet):
   - `input/fire-risk/vector/census-tiger/` (blocks, counties, tracts) - CC BY 4.0
   - `input/fire-risk/vector/overture-maps/` - ODbL

### output data (versioned)

- tensor: `output/fire-risk/tensor/production/` (Icechunk) - CC BY 4.0
- vector: `output/fire-risk/vector/production/` (Geoparquet, PMTiles, GPKG, CSV) - ODbL

## what needs to change

- update catalog definitions in [datasets.py](https://github.com/carbonplan/ocr/blob/588a94a1f28ed56a9efb4ee4dc994e88379e45fa/ocr/datasets.py#L440-L642) to point to Source Coop paths by default
- update configuration to allow users to easily override catalog locations via environment variables
- document the catalog structure so users understand how to:
    - use the public data
    - point to alternative data sources

## current workarounds

until this issue is resolved, users can:

- fork the repository and modify [datasets.py](https://github.com/carbonplan/ocr/blob/588a94a1f28ed56a9efb4ee4dc994e88379e45fa/ocr/datasets.py#L440-L642) to point to source coop paths
- set `OCR_STORAGE_ROOT` and related environment variables to reference the user's S3 bucket
- download data locally and configure local paths



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use public data sources instead of private catalog #367

published datasets

input data

output data (versioned)

what needs to change

current workarounds

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

use public data sources instead of private catalog #367

Description

published datasets

input data

output data (versioned)

what needs to change

current workarounds

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions