Skip to content

Commit abf79e1

Browse files
authored
Update Zarr content in slides and tweak formatting (#160)
1 parent 3f3afdd commit abf79e1

File tree

3 files changed

+38
-37
lines changed

3 files changed

+38
-37
lines changed

images/data-structures.png

216 KB
Loading

images/virtual-zarr.png

242 KB
Loading

overview.qmd

Lines changed: 38 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -2,26 +2,27 @@
22
title: Cloud-Optimized Geospatial Formats Overview
33
subtitle: These slides are a summarization of [Cloud-Optimized Geospatial Formats Guide](https://guide.cloudnativegeo.org/) to support presentations.
44
author:
5-
- "Authors + Credits: Aimee Barciauskas, Alex Mandel, Brianna Pagán, Vincent Sarago, Chris Holmes, Patrick Quinn, Matt Hanson, Ryan Abernathey"
5+
- "Authors + Credits: Aimee Barciauskas, Alex Mandel, Brianna Pagán, Vincent Sarago, Chris Holmes, Patrick Quinn, Matt Hanson, Ryan Abernathey, Max Jones"
66
format:
77
revealjs:
88
incremental: true
99
theme: [default, custom.scss]
1010
---
1111

12-
::: {.notes}
13-
These slides were generated with https://quarto.org/docs/presentations/revealjs.
14-
Source: https://github.com/cloudnativegeo/cloud-optimized-geospatial-formats-guide.
15-
:::
1612

1713
# Cloud-Optimized Geospatial Formats Overview
1814

1915
Google Slides version of this content: [Cloud-Optimized Geospatial Formats](https://docs.google.com/presentation/d/1F89kcrtX9LNQPTOuwyL5FRex_8--Vlg-DA8GJNzWqGk/edit?usp=sharing).
2016

17+
::: {.notes}
18+
These slides were generated with https://quarto.org/docs/presentations/revealjs.
19+
Source: https://github.com/cloudnativegeo/cloud-optimized-geospatial-formats-guide.
20+
:::
21+
2122
::: {.incremental}
2223
# What Makes Cloud-Optimized Challenging?
2324

24-
* No one size fits all approach
25+
* There is no one size fits all approach.
2526
* Earth observation data may be processed into raster, vector and point cloud data types and stored in a long list of data formats and structures.
2627
* Optimization depends on the user.
2728
* Users must learn new tools and which data is accessed and how may differ depending on the user.
@@ -30,9 +31,11 @@ Google Slides version of this content: [Cloud-Optimized Geospatial Formats](http
3031

3132
# What Makes Cloud-optimized Challenging?
3233

33-
![](./images/2019-points-lines-polygons.png)
34+
![](./images/data-structures.png)
3435

35-
image source: <a href="https://ui.josiahparry.com/spatial-analysis.html#types-of-spatial-data">ui.josiahparry.com/spatial-analysis.html</a>
36+
::: footer
37+
Based on <a href="https://ui.josiahparry.com/spatial-analysis.html#types-of-spatial-data">ui.josiahparry.com/spatial-analysis.html</a>
38+
:::
3639

3740
# What Makes Cloud-optimized Challenging?
3841

@@ -62,7 +65,7 @@ File formats are read-oriented to support:
6265
## What Does Cloud-Optimized Mean?
6366

6467
* File metadata in one read
65-
* When accessing data over the internet, such as when data is in cloud storage, latency is high when compared with local storage so it is preferable to fetch lots of data in fewer reads.
68+
* When accessing data over the internet, such as when data is in cloud storage, latency is high when compared with local storage so it is preferable to fetch lots of data in fewer reads. The preceding requirement of fetching all metadata in one read means that data reads can happen concurrently.
6669
* An easy win is metadata in one read, which can be used to read a cloud-native dataset.
6770
* A cloud-native dataset is one with small addressable chunks via files, internal tiles, or both.
6871

@@ -84,7 +87,7 @@ File formats are read-oriented to support:
8487

8588
::::
8689

87-
::: aside
90+
::: footer
8891
image credit: Ryan Abernathey
8992
:::
9093

@@ -93,7 +96,7 @@ image credit: Ryan Abernathey
9396
| Format | Data Type | Standard Status |
9497
|:--------|:-----------|:-----------------|
9598
| Cloud-Optimized GeoTIFF (COG) | Raster | OGC standard for comment |
96-
| Zarr, Kerchunk | Multi-dimensional raster | ESDIS and OGC standards in development |
99+
| Zarr, Kerchunk, Icechunk | Multi-dimensional raster | ESDIS and OGC standards in development |
97100
| Cloud-Optimized Point Cloud (COPC), Entwine Point Tiles (EPT) | Point Clouds* | no known ESDIS or OGC standard |
98101
| FlatGeobuf, GeoParquet, | Vector | no known ESDIS, draft OGC standard |
99102

@@ -102,7 +105,7 @@ image credit: Ryan Abernathey
102105
| Format | Adoption | Standard Status |
103106
|:--------|:---------| :-----------------|
104107
| Cloud-Optimized GeoTIFF (COG) | Widely adopted | OGC standard for comment |
105-
| Zarr, Kerchunk | (Less) widely adopted, especially in specific communities | ESDIS and OGC standards in development |
108+
| Zarr, Kerchunk, Icechunk | (Less) widely adopted, especially in specific communities | ESDIS and OGC standards in development |
106109
| Entwine Point Tiles (EPT), Cloud-Optimized Point Cloud (COPC) | Less common (PDAL Supported) | no known ESDIS or OGC standard |
107110
| GeoParquet, FlatGeobuf | Less common (OGR Supported) | no known ESDIS, draft OGC standard |
108111

@@ -122,7 +125,7 @@ image credit: Ryan Abernathey
122125

123126
::::
124127

125-
::: aside
128+
::: footer
126129
image source: https://www.kitware.com/deciphering-cloud-optimized-geotiffs/
127130
:::
128131

@@ -142,7 +145,7 @@ image source: https://www.kitware.com/deciphering-cloud-optimized-geotiffs/
142145

143146
::::
144147

145-
::: aside
148+
::: footer
146149
image source: https://medium.com/devseed/cog-talk-part-1-whats-new-941facbcd3d1
147150
:::
148151

@@ -162,43 +165,41 @@ image source: https://medium.com/devseed/cog-talk-part-1-whats-new-941facbcd3d1
162165

163166
::::
164167

165-
::: aside
168+
::: footer
166169
image source: https://xarray.dev/
167170
:::
168171

169-
# What is Kerchunk?
172+
# What is Virtual Zarr?
170173

171-
* Kerchunk is a way to create Zarr metadata for archival formats, so that you can leverage the benefits of partial and parallel reads for archives in NetCDF4, HDF5, GRIB2, TIFF and FITS.
174+
* Virtual Zarr stores include metadata along with references to data in archival file formats, such that you can leverage the benefits of partial and parallel reads for archives in NetCDF4, HDF5, GRIB2, TIFF and FITS. Kerchunk and Icechunk provide ways to persist Virtual Zarr stores on disk or in object stores.
172175

173176
. . .
174177

175-
<img src="./images/multi_refs.png" style="margin: 0px auto; display: block; width:700px;"/>
178+
<img src="./images/virtual-zarr.png" style="margin: 0px auto; display: block; width:700px;"/>
176179

177-
::: aside
178-
image source: https://fsspec.github.io/kerchunk/detail.html
179-
:::
180180

181181
## Zarr Specs in Development
182182

183-
* V2 and older specs exist, however,
184-
* A cross-organization working group has just formed to establish a GeoZarr standards working group, organized by Brianna Pagán (NASA) and includes representatives from many other orgs in the industry.
185-
* The GeoZarr spec defines conventions for how geospatial data should be organized in a Zarr store. The spec details how the Zarr DataArray and DataSet metadata, and subsequent organization of data, must be in order to be conformant as GeoZarr archive.
186-
* There is a proposal for Zarr v3 which will address challenges in language support, and storage organization to address the issues of high-latency reads and volume of reads for the many objects stored.
187-
* There is recent work on a parquet alternative to JSON for indexing.
183+
* V2 is widely adopted. Zarr V3 was recently released.
184+
* Zarr V3 provides additional extension mechanisms and sharding, which allows high concurrency while minimizing the number of files.
185+
* Zarr Python 3 supports reading and writing to Zarr format V3 while still supporting Zarr V2, with dramatic performance improvements.
186+
* Brianna Pagán (formerly NASA, currently Development Seed) has organized the OGC GeoZarr standards working group (SWG) to establish standards for geospatial metadata. The SWG includes representatives from many other orgs in the industry.
187+
* The GeoZarr spec defines conventions for organizing geospatial data in a Zarr store. Specifically, the spec defines conventions for Zarr DataArray and DataSet metadata and organization of associated data to be conformant as GeoZarr archive. The spec remains in development.
188+
* Icechunk provides features including data version control and serializable isolation on top of Zarr. Icechunk is still under rapid development and is moving towards a v1.0 release.
188189

189190
## COPC (Cloud-Optimized Point Clouds)
190191

191192
<img src="./images/copc-vlr-chunk-table-illustration.png" style="margin: 0px auto; display: block; width:900px;"/>
192193

193-
::: aside
194+
::: footer
194195
image source: https://copc.io/
195196
:::
196197

197198
* Point clouds are a set of data points in space, such as gathered from LiDAR measurements.
198199
* COPC is a valid LAZ file.
199200
* Similar to COGs but for point clouds: COPC is just one file, but data is reorganized into a clustered octree instead of regularly gridded overviews.
200201
* 2 key features:
201-
* Support for partial decompression via storage of data in a series of chunks
202+
* Support for partial decompression via storage of data in a series of chunks.
202203
* Variable-length records (VLRs) can store application-specific metadata of any kind. VLRs describe the octree structure.
203204
* Limitation: Not all attribute types are compatible.
204205

@@ -222,7 +223,7 @@ image source: https://copc.io/
222223

223224
::::
224225

225-
::: aside
226+
::: footer
226227
image source: https://worace.works/2022/02/23/kicking-the-tires-flatgeobuf/
227228
:::
228229

@@ -236,21 +237,21 @@ image source: https://worace.works/2022/02/23/kicking-the-tires-flatgeobuf/
236237

237238
::: {.column width="50%"}
238239

239-
* Vector data is traditionally stored as rows representing points, lines, or polygons with an attribute table
240+
* Vector data is traditionally stored as rows representing points, lines, or polygons with an attribute table.
240241
* GeoParquet defines how to store vector data in Apache Parquet, which is a columnar storage format (like many cloud data warehouses). “Give me all points with height greater than 10m”.
241-
* Highly compressed
242-
* Single-file or multi-file
243-
* Recent support added to geopandas as a distinct function, R support with geoarrow
244-
* Potential for cross language in-memory shared access
245-
* Specifications for spatial-indexing, projection handling, etc. are still in discussion
242+
* GeoParquet is highly compressed.
243+
* GeoParquet can be stored in a single- or multi-file store.
244+
* GeoParquet support exists in GeoPandas and is supported in R with GeoArrow.
245+
* GeoArrow provides the potential for cross language in-memory shared access.
246+
* Specifications for spatial-indexing, projection handling, etc. are still in discussion.
246247
* Learn more: [https://github.com/opengeospatial/geoparquet](https://github.com/opengeospatial/geoparquet)
247248
:::
248249

249250
::::
250251

251252
<br />
252253

253-
::: aside
254+
::: footer
254255
image source: https://www.wherobots.ai/post/spatial-data-parquet-and-apache-sedona
255256
:::
256257

@@ -260,7 +261,7 @@ image source: https://www.wherobots.ai/post/spatial-data-parquet-and-apache-sedo
260261

261262
## Not Quite
262263

263-
* These formats and their tooling are in active development
264+
* These formats and their tooling are in active development.
264265
* Some formats were not mentioned, such as EPT, geopkg, tiledb, Cloud-Optimized HDF5. This presentation was scoped to those known best by the authors.
265266
* This site will continue to be updated with new content.
266267

0 commit comments

Comments
 (0)