You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
subtitle: These slides are a summarization of [Cloud-Optimized Geospatial Formats Guide](https://guide.cloudnativegeo.org/) to support presentations.
4
4
author:
5
-
- "Authors + Credits: Aimee Barciauskas, Alex Mandel, Brianna Pagán, Vincent Sarago, Chris Holmes, Patrick Quinn, Matt Hanson, Ryan Abernathey"
5
+
- "Authors + Credits: Aimee Barciauskas, Alex Mandel, Brianna Pagán, Vincent Sarago, Chris Holmes, Patrick Quinn, Matt Hanson, Ryan Abernathey, Max Jones"
6
6
format:
7
7
revealjs:
8
8
incremental: true
9
9
theme: [default, custom.scss]
10
10
---
11
11
12
-
::: {.notes}
13
-
These slides were generated with https://quarto.org/docs/presentations/revealjs.
Google Slides version of this content: [Cloud-Optimized Geospatial Formats](https://docs.google.com/presentation/d/1F89kcrtX9LNQPTOuwyL5FRex_8--Vlg-DA8GJNzWqGk/edit?usp=sharing).
20
16
17
+
::: {.notes}
18
+
These slides were generated with https://quarto.org/docs/presentations/revealjs.
Based on <ahref="https://ui.josiahparry.com/spatial-analysis.html#types-of-spatial-data">ui.josiahparry.com/spatial-analysis.html</a>
38
+
:::
36
39
37
40
# What Makes Cloud-optimized Challenging?
38
41
@@ -62,7 +65,7 @@ File formats are read-oriented to support:
62
65
## What Does Cloud-Optimized Mean?
63
66
64
67
* File metadata in one read
65
-
* When accessing data over the internet, such as when data is in cloud storage, latency is high when compared with local storage so it is preferable to fetch lots of data in fewer reads.
68
+
* When accessing data over the internet, such as when data is in cloud storage, latency is high when compared with local storage so it is preferable to fetch lots of data in fewer reads. The preceding requirement of fetching all metadata in one read means that data reads can happen concurrently.
66
69
* An easy win is metadata in one read, which can be used to read a cloud-native dataset.
67
70
* A cloud-native dataset is one with small addressable chunks via files, internal tiles, or both.
68
71
@@ -84,7 +87,7 @@ File formats are read-oriented to support:
84
87
85
88
::::
86
89
87
-
::: aside
90
+
::: footer
88
91
image credit: Ryan Abernathey
89
92
:::
90
93
@@ -93,7 +96,7 @@ image credit: Ryan Abernathey
93
96
| Format | Data Type | Standard Status |
94
97
|:--------|:-----------|:-----------------|
95
98
| Cloud-Optimized GeoTIFF (COG) | Raster | OGC standard for comment |
96
-
| Zarr, Kerchunk | Multi-dimensional raster | ESDIS and OGC standards in development |
99
+
| Zarr, Kerchunk, Icechunk| Multi-dimensional raster | ESDIS and OGC standards in development |
97
100
| Cloud-Optimized Point Cloud (COPC), Entwine Point Tiles (EPT) | Point Clouds*| no known ESDIS or OGC standard |
98
101
| FlatGeobuf, GeoParquet, | Vector | no known ESDIS, draft OGC standard |
99
102
@@ -102,7 +105,7 @@ image credit: Ryan Abernathey
102
105
| Format | Adoption | Standard Status |
103
106
|:--------|:---------| :-----------------|
104
107
| Cloud-Optimized GeoTIFF (COG) | Widely adopted | OGC standard for comment |
105
-
| Zarr, Kerchunk | (Less) widely adopted, especially in specific communities | ESDIS and OGC standards in development |
108
+
| Zarr, Kerchunk, Icechunk| (Less) widely adopted, especially in specific communities | ESDIS and OGC standards in development |
106
109
| Entwine Point Tiles (EPT), Cloud-Optimized Point Cloud (COPC) | Less common (PDAL Supported) | no known ESDIS or OGC standard |
107
110
| GeoParquet, FlatGeobuf | Less common (OGR Supported) | no known ESDIS, draft OGC standard |
*Kerchunk is a way to create Zarr metadata for archival formats, so that you can leverage the benefits of partial and parallel reads for archives in NetCDF4, HDF5, GRIB2, TIFF and FITS.
174
+
*Virtual Zarr stores include metadata along with references to data in archival file formats, such that you can leverage the benefits of partial and parallel reads for archives in NetCDF4, HDF5, GRIB2, TIFF and FITS. Kerchunk and Icechunk provide ways to persist Virtual Zarr stores on disk or in object stores.
* A cross-organization working group has just formed to establish a GeoZarr standards working group, organized by Brianna Pagán (NASA) and includes representatives from many other orgs in the industry.
185
-
* The GeoZarr spec defines conventions for how geospatial data should be organized in a Zarr store. The spec details how the Zarr DataArray and DataSet metadata, and subsequent organization of data, must be in order to be conformant as GeoZarr archive.
186
-
* There is a proposal for Zarr v3 which will address challenges in language support, and storage organization to address the issues of high-latency reads and volume of reads for the many objects stored.
187
-
* There is recent work on a parquet alternative to JSON for indexing.
183
+
* V2 is widely adopted. Zarr V3 was recently released.
184
+
* Zarr V3 provides additional extension mechanisms and sharding, which allows high concurrency while minimizing the number of files.
185
+
* Zarr Python 3 supports reading and writing to Zarr format V3 while still supporting Zarr V2, with dramatic performance improvements.
186
+
* Brianna Pagán (formerly NASA, currently Development Seed) has organized the OGC GeoZarr standards working group (SWG) to establish standards for geospatial metadata. The SWG includes representatives from many other orgs in the industry.
187
+
* The GeoZarr spec defines conventions for organizing geospatial data in a Zarr store. Specifically, the spec defines conventions for Zarr DataArray and DataSet metadata and organization of associated data to be conformant as GeoZarr archive. The spec remains in development.
188
+
* Icechunk provides features including data version control and serializable isolation on top of Zarr. Icechunk is still under rapid development and is moving towards a v1.0 release.
* Vector data is traditionally stored as rows representing points, lines, or polygons with an attribute table
240
+
* Vector data is traditionally stored as rows representing points, lines, or polygons with an attribute table.
240
241
* GeoParquet defines how to store vector data in Apache Parquet, which is a columnar storage format (like many cloud data warehouses). “Give me all points with height greater than 10m”.
241
-
*Highly compressed
242
-
*Single-file or multi-file
243
-
*Recent support added to geopandas as a distinct function, R support with geoarrow
244
-
*Potential for cross language in-memory shared access
245
-
* Specifications for spatial-indexing, projection handling, etc. are still in discussion
242
+
*GeoParquet is highly compressed.
243
+
*GeoParquet can be stored in a single- or multi-file store.
244
+
*GeoParquet support exists in GeoPandas and is supported in R with GeoArrow.
245
+
*GeoArrow provides the potential for cross language in-memory shared access.
246
+
* Specifications for spatial-indexing, projection handling, etc. are still in discussion.
0 commit comments