Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions flatgeobuf/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ channels:
- conda-forge
dependencies:
- python=3.11
- geopandas==0.13.2
- pyogrio==0.6.0
- geopandas==1.0.1
- pyogrio==0.11.0
- ipykernel
- jupyterlab
- pyarrow
86 changes: 46 additions & 40 deletions flatgeobuf/flatgeobuf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,13 @@
"source": [
"The primary way to interact with FlatGeobuf in Python is via bindings to GDAL, as there is no pure-Python implementation of FlatGeobuf.\n",
"\n",
"There are two different Python libraries for interacting between Python and GDAL's vector support: `fiona` and `pyogrio`. Both of these are integrated into [`geopandas.read_file`](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html) via the `engine` keyword, but `pyogrio` is much faster. Set `engine=\"pyogrio\"` when using `read_file` or [`GeoDataFrame.to_file`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_file.html) to speed up reading and writing significantly. We also suggest passing `use_arrow=True` when reading for a slight extra speedup (this is not supported when writing).\n",
"There are two different Python libraries for interacting between Python and GDAL's vector support: `fiona` and `pyogrio`. Both of these are integrated into [`geopandas.read_file`](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html) via the `engine` keyword, but `pyogrio` is much faster. From `geopandas` version 1.0.0, the default is `pyogrio`. For prior `geopandas` versions, set `engine=\"pyogrio\"` when using `read_file` or [`GeoDataFrame.to_file`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_file.html) to speed up reading and writing significantly. We also suggest passing `use_arrow=True` when reading for a slight extra speedup (this is not supported when writing).\n",
"\n",
"::: {.callout-note}\n",
"\n",
"[`fiona`](https://github.com/Toblerity/Fiona) is the default engine for `geopandas.read_file`. It provides full-featured bindings to GDAL but does not implement _vectorized_ operations. [Vectorization](https://wesmckinney.com/book/numpy-basics#ndarray_binops) refers to operating on whole arrays of data at once rather than operating on individual values using a Python for loop. `fiona`'s non-vectorized approach means that each row of the source file is read individually with Python, and a Python for loop. In contrast, [`pyogrio`](https://github.com/geopandas/pyogrio)'s vectorized implementation reads all rows in C before passing the data to Python, allowing it to achieve vast speedups (up to 40x) over `fiona`.\n",
"[`fiona`](https://github.com/Toblerity/Fiona) was the default engine for `geopandas.read_file` prior to version 1.0.0. It provides full-featured bindings to GDAL but does not implement _vectorized_ operations. [Vectorization](https://wesmckinney.com/book/numpy-basics#ndarray_binops) refers to operating on whole arrays of data at once rather than operating on individual values using a Python for loop. `fiona`'s non-vectorized approach means that each row of the source file is read individually with Python, and a Python for loop. In contrast, [`pyogrio`](https://github.com/geopandas/pyogrio)'s vectorized implementation reads all rows in C before passing the data to Python, allowing it to achieve vast speedups (up to 40x) over `fiona`.\n",
"\n",
"You can opt in to using `pyogrio` with `geopandas.read_file` by passing `engine=\"pyogrio\"`.\n",
"\n",
"Additionally, if you're using GDAL version 3.6 or higher (usually the case when using pyogrio), you can pass `use_arrow=True` to `geopandas.read_file` to use `pyogrio`'s support for [GDAL's RFC 86](https://gdal.org/development/rfc/rfc86_column_oriented_api.html), which speeds up data reading even more.\n",
"If you're using GDAL version 3.6 or higher (usually the case when using pyogrio), you can pass `use_arrow=True` to `geopandas.read_file` to use `pyogrio`'s support for [GDAL's RFC 86](https://gdal.org/development/rfc/rfc86_column_oriented_api.html), which speeds up data reading even more.\n",
"\n",
":::"
]
Expand Down Expand Up @@ -60,7 +58,13 @@
"Alternatively, you can install the versions of `pyogrio` and `geopandas` used in this notebook with pip:\n",
"\n",
"```bash\n",
"pip install pyogrio==0.6.0 geopandas==0.13.2\n",
"pip install pyogrio==0.11.0 geopandas==1.0.1\n",
"```\n",
"\n",
"Additionally, to reproduce the comparisons between `fiona` and `pyogrio` below, you would need to install `fiona` separately as well, but if you prefer to only run the (faster) `pyogrio` examples, you can skip the `fiona` install and examples altogether.\n",
"\n",
"```bash\n",
"pip install fiona==1.10.1\n",
"```"
]
},
Expand Down Expand Up @@ -115,7 +119,7 @@
"source": [
"In each of the cases below, we use `geopandas.read_file` to read the file into a `GeoDataFrame`.\n",
"\n",
"First we'll show that reading this file with `engine=\"fiona\"` (the default) is slower. Taking an extra 500 milliseconds might not seem like a lot, but this file contains only 3,000 rows, so this difference gets magnified with larger files."
"First we'll show that reading this file with `engine=\"fiona\"` is slower. Taking an extra 500 milliseconds might not seem like a lot, but this file contains only 3,000 rows, so this difference gets magnified with larger files."
]
},
{
Expand All @@ -127,8 +131,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 565 ms, sys: 16.9 ms, total: 582 ms\n",
"Wall time: 685 ms\n"
"CPU times: user 1.85 s, sys: 34.7 ms, total: 1.88 s\n",
"Wall time: 2.35 s\n"
]
}
],
Expand All @@ -140,7 +144,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Passing `engine=\"pyogrio\"` speeds up loading by 18x here!"
"Using the (since version 1.0.0 default) `engine=\"pyogrio\"` speeds up loading by 22x here!"
]
},
{
Expand All @@ -152,8 +156,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 25.3 ms, sys: 6.84 ms, total: 32.1 ms\n",
"Wall time: 31.3 ms\n"
"CPU times: user 69.6 ms, sys: 15.9 ms, total: 85.5 ms\n",
"Wall time: 89.6 ms\n"
]
}
],
Expand All @@ -165,7 +169,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Using `use_arrow=True` often makes loading slightly faster again! We're now 21x faster than using fiona."
"Using `use_arrow=True` often makes loading slightly faster again! We're now 24x faster than using fiona."
]
},
{
Expand All @@ -177,8 +181,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 19.7 ms, sys: 10.1 ms, total: 29.7 ms\n",
"Wall time: 29.1 ms\n"
"CPU times: user 48 ms, sys: 30.7 ms, total: 78.8 ms\n",
"Wall time: 118 ms\n"
]
}
],
Expand Down Expand Up @@ -206,13 +210,13 @@
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 362 ms, sys: 44.4 ms, total: 407 ms\n",
"Wall time: 418 ms\n"
"CPU times: user 875 ms, sys: 53.2 ms, total: 928 ms\n",
"Wall time: 944 ms\n"
]
}
],
"source": [
"%time gdf.to_file(f\"{tmpdir.name}/out_fiona.fgb\")"
"%time gdf.to_file(f\"{tmpdir.name}/out_fiona.fgb\", engine=\"fiona\")"
]
},
{
Expand All @@ -224,8 +228,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 60.8 ms, sys: 23.4 ms, total: 84.2 ms\n",
"Wall time: 83.5 ms\n"
"CPU times: user 62.9 ms, sys: 13.9 ms, total: 76.8 ms\n",
"Wall time: 76.8 ms\n"
]
}
],
Expand All @@ -250,7 +254,7 @@
"metadata": {},
"outputs": [],
"source": [
"url = \"https://data.source.coop/cholmes/eurocrops/unprojected/flatgeobuf/FR_2018_EC21.fgb\""
"url = \"https://s3.us-west-2.amazonaws.com/us-west-2.opendata.source.coop/cholmes/eurocrops/unprojected/flatgeobuf/FR_2018_EC21.fgb\""
]
},
{
Expand All @@ -268,19 +272,28 @@
{
"data": {
"text/plain": [
"{'crs': 'EPSG:4326',\n",
"{'layer_name': 'FR_2018_EC21',\n",
" 'crs': 'EPSG:4326',\n",
" 'encoding': 'UTF-8',\n",
" 'fields': array(['ID_PARCEL', 'SURF_PARC', 'CODE_CULTU', 'CODE_GROUP', 'CULTURE_D1',\n",
" 'CULTURE_D2', 'EC_org_n', 'EC_trans_n', 'EC_hcat_n', 'EC_hcat_c'],\n",
" dtype=object),\n",
" 'dtypes': array(['object', 'float64', 'object', 'object', 'object', 'object',\n",
" 'object', 'object', 'object', 'object'], dtype=object),\n",
" 'fid_column': '',\n",
" 'geometry_name': '',\n",
" 'geometry_type': 'MultiPolygon',\n",
" 'features': 9517874,\n",
" 'total_bounds': (-6.047022416643922,\n",
" -3.916364769838749,\n",
" 68.89050422648864,\n",
" 51.075100624023094),\n",
" 'driver': 'FlatGeobuf',\n",
" 'capabilities': {'random_read': 1,\n",
" 'fast_set_next_by_index': 0,\n",
" 'fast_spatial_filter': 1},\n",
" 'capabilities': {'random_read': True,\n",
" 'fast_set_next_by_index': False,\n",
" 'fast_spatial_filter': True,\n",
" 'fast_feature_count': True,\n",
" 'fast_total_bounds': True},\n",
" 'layer_metadata': None,\n",
" 'dataset_metadata': None}"
]
Expand Down Expand Up @@ -328,13 +341,13 @@
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 144 ms, sys: 21.4 ms, total: 165 ms\n",
"Wall time: 6 s\n"
"CPU times: user 195 ms, sys: 20.6 ms, total: 216 ms\n",
"Wall time: 2.51 s\n"
]
}
],
"source": [
"%time crops_gdf = gpd.read_file(url, bbox=bounds)"
"%time crops_gdf = gpd.read_file(url, bbox=bounds, engine=\"fiona\")"
]
},
{
Expand All @@ -353,8 +366,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 26.9 ms, sys: 2.98 ms, total: 29.9 ms\n",
"Wall time: 490 ms\n"
"CPU times: user 26.4 ms, sys: 20 ms, total: 46.4 ms\n",
"Wall time: 1.91 s\n"
]
}
],
Expand Down Expand Up @@ -565,8 +578,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 25 ms, sys: 2.47 ms, total: 27.4 ms\n",
"Wall time: 706 ms\n"
"CPU times: user 12.9 ms, sys: 0 ns, total: 12.9 ms\n",
"Wall time: 155 ms\n"
]
}
],
Expand Down Expand Up @@ -671,13 +684,6 @@
"source": [
"crops_gdf.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -696,7 +702,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
"version": "3.13.1"
},
"orig_nbformat": 4
},
Expand Down
Binary file added images/data-structures.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/virtual-zarr.png
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great diagram as-is so please ignore this if you don't agree as it may just be a matter of preference but I feel like most of these types of input output diagrams have inputs->outputs going from left to right.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading