Skip to content

Conversation

@rhugonnet
Copy link
Member

@rhugonnet rhugonnet commented Jan 24, 2024

This PR adds the rst Xarray accessor mirroring the Raster class.

The accessor allows to access all attributes and run all methods already implemented for rasters in GeoUtils from a xarray.DataArray object (e.g., ds.rst.reproject()), and thereby easily access the rest of the Xarray functionalities through ds (such as plotting) and other low-level behaviour (such as implicit loading, cached loading, chunked loading).

It also opens up the opportunity to easily add Dask support to run our functionalities out-of-memory. This requires our functionalities to support a da.Array as input. This is not mandatory (if the functionality only runs with NumPy array, it will simply load the da.Array in memory immediately), but is very practical when it is supported.
We already did a lot of work to implement the most complex Dask functionalities in #537 (namely reproject(), subsample() and interp_points()). Most other functionalities are much easier to support (using dask.map_blocks or equivalent).

Resolves #383
Resolves #567
Resolves #791
Resolves #834
Resolves #825
Resolves #823
Resolves #836
Resolves #840
Resolves #839
Resolves #841
Resolves #842

Facilitated by recent code re-structuring

Recent code re-structuring moved out methods out of the Raster or Vector class into separate modules. In some cases, this changed the argument of those (non-public) wrapper methods to accept base inputs (array, transform, crs) instead of the class object itself. Among other things, this facilitates the transition to using our functions with an accessor that has a slightly different object type (classic NumPy array instead of a NumPy masked-array).
See #624.

Summary of changes

Most content of the Raster class was moved into a non-public RasterBase parent class, containing all attributes and methods shared by the Raster and rst accessor classes.
The Raster and rst are subclasses of RasterBase, and only implement method specific to their object type (such as set_mask() for Raster that uses NumPy masked-array, or its __array_interface__ specific to masked arrays).

Remaining in Raster are only functionalities specific to the Raster object itself:

  • Methods related to object creation, instantiation and loading: __init__, load(), is_loaded, from_array(), copy(),
  • Methods related to NumPy masked-array: set_mask(), etc,
  • Methods related to array interfacing/arithmetic: __array_interface__, __add__, etc.

Added in the rst accessor are only functionalities specific to the accessor: __init__, from_array(), copy().

A new _is_xr boolean attribute identifies if the RasterBase.data is an Xarray object or not.
This allows to make choices where necessary, which is only used to return the main attributes stored in the object itself: data, crs, transform and nodata.

All methods returning a raster object (like reproject() or crop()) now use from_array() that is overridden in Raster and rst to ensure they return the same type as the input: a Raster returns a Raster input, and a xarray.DataArray returns a xarray.DataArray input.
All other attributes and methods return exactly the same non-raster input.

A new method geoutils.open_raster is added to open a raster as a xr.DataArray (built on top of rioxarray.open_rasterio()). The difference is that our open_raster forces the data type to be float32 at minimum, and replaces nodata values to NaNs to natively support most NumPy array operations with nodata propagation.
This seemed required because Rioxarray does not mask nodata values while preserving the nodata value in its metadata, which is incompatible with the behaviour we need. (To give an clear example: With Rioxarray, either you load the array with -9999 in it and the ds.rio.nodata is -9999, or you load the array with NaNs in it, and the ds.rio.nodata is NaN).
I did not find another way to do this here...

New tests

Adding new tests is simple: We simply need to check that all functions give the exact same result for a raster opened as a Raster, or as a xr.DataArray.

For this, the new tests introduce a function to check the equality of a Raster and xarray.DataArray.
Then, they check that all common attributes and methods of RasterBase run and return exactly the same output (or equal to the other object type when output is a Raster/xarray.DataArray).

Discussion of core differences

The problem with the rst accessor object is that, if I'm not mistaken, we won't have access to functionalities that are not explicit such as __array_interface__. Thus, we likely cannot mirror the entire behaviour of the Raster class (for instance, no overloading_check to verify that the georeferencing is the same during an array or arithmetic operation). We can look more into it to be sure, but I don't think it is possible...

Thankfully, Xarray generally has similar behaviour as our Raster class, from the implicit loading to array-interfacing. We might want to adjust our functionalities to ensure we mirror that behaviour when possible, so that the code is written the same.

The main difference is that Xarray won't natively support nodata in its operations for integer arrays (no masked-array support in Xarray), and thus those need to be converted to NaN-arrays to do so, which increases RAM usage significantly for datasets of integer type. Here again, thankfully, chunked Dask-support can compensate for this, and run any NaN-array size.

So there are pros and cons to using the Raster or the rst accessor. We can try to reconcile differences where possible, and for those that are structural to the data objects, we should simply explain them clearly on a documentation page and leave the choice to users! 😄

TO-DO

Code

  • Add equivalent of Raster.from_array() to RasterAccessor class (or RasterBase class?) and individual setting operations (for transform, crs, nodata, and area_or_point) to make all methods (reproject, etc) naturally work on both Raster and ds.rst,
  • Ensure dual support of masked-array & NaN arrays for all methods (_reproject, etc),
  • Link to existing delayed function (for now just _reproject until Re-organize multiprocessing and chunked modules #830 is done) by detecting automatically if input array is a Dask array.
  • Add tests specific to RasterBase,
  • Add tests specific to RasterAccessor functionalities (comparing to ds.rio),
  • Add tests specific to chunked operations,

Documentation (later PR)

  • Add rst accessor to "The georeferenced raster" page,
  • Update all feature pages to list Xarray accessor option: Raster.reproject() or ds.rst.reproject(),
  • Add page "Using Xarray and Pandas accessors" explaining pros and cons in "Fundamentals"?,
  • Add page "Dask support" listing all functionalities supported out-of-memory, with tips,
  • Update API with rst accessor.

Other Dask support to add (later PRs)

  • The reduce_points function can copy the same logic as interp_points,
  • The crop function using isel of Rioxarray,
  • The rasterize/polygonize functions should be straightforward,
  • But the proximity function would be a bit of work...

@rhugonnet rhugonnet marked this pull request as draft January 24, 2024 12:34
@rhugonnet rhugonnet changed the title Add Xarray accessor mirroring Raster API Add Xarray accessor mirroring Raster class Jan 24, 2024
@rhugonnet rhugonnet marked this pull request as ready for review December 7, 2024 02:03
@rhugonnet
Copy link
Member Author

@adehecq @atedstone @erikmannerfelt This PR is also ready for your first review! It is not finalized, but at a good stage to hear your feedback, questions, recommendations, and then move forward to finalize it.
I have described the general concepts in the description, and the tests comparing both types of outputs should give you a good idea of how this works in practice in the code.

Once you are done with this one, you should look at the one for the dem accessor in xDEM (which depends on this PR).
Much less changes are needed, but it has a slightly more complex class structure: GlacioHack/xdem#656

@adehecq
Copy link
Member

adehecq commented Feb 6, 2025

This is a giant step forward!! 🎉
Thanks for the effort you put into this over the past months and not only in this PR, which is just the final push! 🙌 🙌 🙌

I don't have many comments, as I'm not so familiar with the xarray accessor, so I just have to trust you that it works. I'm looking forward to see it explained in the documentation and test it myself, but at the moment it is still abstract 😉
My main comment is regarding the new Dask dependency. Is it really needed or could it be optional? Does it add a lot of inherited dependencies?

I can also only vaguely imagine the limitations you mention in the description about not accessing the hidden functions. I guess we'll figure out the issues by using and practicing. But the first step is to get it out!

@rhugonnet
Copy link
Member Author

rhugonnet commented Jan 26, 2026

@belletva @remi-braun @guillaumeeb @adehecq @atedstone This PR is getting close to finalized! 😄

Note for clarity: In the following, "Implicit loading" and "Laziness" are not the same thing.
"Implicit loading" designates when the object (DataArray, or Raster) loads the data implicitly when .data is called (as it only reads metadata by default), which works with any backend (for instance NumPy only),
"Laziness" designates delayed Dask (or other engines) operations that call .data, but wait to do calculations until we use compute() on the output array or writer.

As a summary, I have fully reconciled the Raster class and new rst accessor behaviour:

  • All properties and methods yield the exact same output (same object/value, or in-place modifications),
  • All properties and methods trigger the exact same implicit loading mechanism (using our own logic in Raster, and using existing internal logic for Xarray), and is exactly the one we expect for this operation (for example, our crop() does not load neither input nor output array, as does Xarray's sel()).
  • Chunked methods implemented through Dask respect laziness and yield the exact same result as without chunks (array full in-memory) and as with other backends using chunks (e.g. using Multiprocessing as backend through the Raster class). This is only implemented for reproject() for now, as we need to do a bit of restructuring before adding other chunked functionalities, see details in Re-organize multiprocessing and chunked modules #830. This restructuring will be easy enough (just moving files), but would be better in a separate PR to avoid enormous architectural changes here mixed with functional ones.

The three above aspects are all tested pretty thoroughly in test_base.py , which took most of my time (If you want to review one file, I think this test module is the most useful): https://github.com/rhugonnet/geoutils/blob/ccb730d22d4a1192a35a1f6e83ebb7de76ac64af/tests/test_raster/test_base.py#L155.
The tests pass on all of our test data (a 3-band raster, an integer-raster, and a float32 raster with NaNs).
As the tests check all functionalities are mirrored exactly, they directly expand the existing tests we had through the Raster class, so we don't need to add that many tests overall. 🙂

Apart from this, I had to slightly adjust or deprecate some Raster functionalities that didn't match fully the behaviour of Xarray. Those were minimal, often remnants of old code and not public API, see #834 #836 #839 #840 #841 #842. I also changed the initial structure using _is_xr to differentiate behaviour within the base class in favor of @abstractmethods everywhere, which is cleaner (thanks @guillaumeeb!).

We still need to write up the documentation changes. As functionalities mirror almost exactly the Raster class, it won't be too hard. It's mostly about adding relevant pages on the accessor + scalibility features with Dask. I think this should wait until after #830 and the addition of other Dask features (interp_points, subsample, and many that can be simply map_block/map_overlap.)

Also, I bumped into something counter-intuitive when implementing some functionalities:
Most functionalities (e.g. _feature() functions) live in their own separate module, and are wrapped in the RasterBase class for both the rst accessor and Raster class. They return a DataArray in the case of a DataArray input (for example using isel()). Once this DataArray is returned, we can't simply manipulate it using ds.transform, ds.crs, because it's just a DataArray, not a class/subclass object... So we need to use our rst accessor to call ds.rst.transform, etc... which feels quite weird because we're calling the accessor within its own RasterBase parent class!
I searched online to see if this was "legal" practice (i.e. won't break things in the long run), and it seems it is for accessors (because they aren't actually linked to the DataArray object, only accessing it, so no circularity issue there). Rioxarray is doing the same thing, and our code running fine confirms this. 😉
On a related topic, we'll also need to start to use duck typing to avoid potential circularity to implement functionalities out of the class yet have proper typing: #827.

Happy for your comments if you have the time! Otherwise could be at later stage (documentation?).
I think I'll merge soon, then work on the quicker steps left in separate PRs: restructuring for chunked ops + linking to other Dask funcs + write up the couple new doc pages.

@rhugonnet
Copy link
Member Author

rhugonnet commented Jan 26, 2026

P.S.: I have not implemented an accessor for an xarray.Dataset yet. As you commented, it does look fairly straightforward: wrapping the operations in loops across variables.
To make it work with our current structure, we'd need to split the RasterBase class into two (a parent RasterBase and a child RasterArray, keeping the exact same methods only in RasterArray and generalizing the properties in RasterBase to work on a Dataset where necessary), like this:

RasterBase
├── RasterArray
│   ├── Raster
│   └── DataArray accessor
└── Dataset accessor

@atedstone
Copy link
Member

@rhugonnet this looks immense, I am looking forward to trying out this big new functionality properly. Given the decent test suite coverage and my lack of deep technical expertise on accessors, I would be content to see this merged into the main dev branch and then to get a bit more involved once we have the skeleton of documentation available as you suggest.

@rhugonnet
Copy link
Member Author

Thanks @atedstone! @belletva and @adehecq share the same opinion after a meeting yesterday.
So I'll merge this soon, then address #830 and #827, link the remaining Dask functionalities, write the doc skeleton, and once this is done I will ping you all again for review 😉

@rhugonnet
Copy link
Member Author

I've now finalized the tests that check that Dask out-of-memory reprojection gives 100% the same result as in-memory (only possible since we added the tolerance argument in Rasterio, which officially came out in 1.5, and set it at "0" by default).
Merging!

@rhugonnet rhugonnet merged commit 233e9bf into GlacioHack:main Jan 29, 2026
33 checks passed
@rhugonnet rhugonnet deleted the add_xarray_accessor branch January 29, 2026 00:26
@remi-braun
Copy link

Amazing! I'll test this accessor asap!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants