Skip to content
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
f231393
Initial version of appending to zarr store
Jan 24, 2019
f14f3b7
Added docs
Jan 24, 2019
928440d
Resolve PEP8 incompliances
Jan 24, 2019
442e938
Added write and append test for mode 'a'
Jan 26, 2019
389ba43
Resolved conflicts with master
Jan 26, 2019
6097da2
Merge branch 'master' of https://github.com/pydata/xarray into append…
Jan 29, 2019
390a792
Merged repaired master
Jan 29, 2019
da9a962
Resolved pep8 issue
Jan 29, 2019
6756b8f
Put target store encoding in appended variable
davidbrochart Apr 3, 2019
95d5782
Merge with master
davidbrochart Apr 4, 2019
a750a92
Rewrite test with appending along time dimension
davidbrochart Apr 4, 2019
295084b
Add chunk_size parameter for rechunking appended coordinate
davidbrochart Apr 22, 2019
cc353e1
Merge remote-tracking branch 'upstream/master' into HEAD
davidbrochart Apr 22, 2019
e56a210
Merge remote-tracking branch 'upstream/master' into HEAD
davidbrochart May 21, 2019
519b398
Add chunk_dim test
davidbrochart May 21, 2019
c85aa98
Merge remote-tracking branch 'upstream/master' into append_zarr
Jun 4, 2019
3adfd49
Add type check and tests for it.
Jun 17, 2019
608813b
Add documentation
Jun 17, 2019
2078838
Add test for compute=False and commented it out
Jun 17, 2019
7a90ce8
Merge master
Jun 17, 2019
b8af5bd
Remove python 3.7 string formatting
Jun 17, 2019
5bee0dc
Fix PEP8 incompliance
Jun 17, 2019
7ed77ad
Merge branch 'append_zarr' of https://github.com/jendrikjoe/xarray in…
Jun 17, 2019
b4ff1c7
Add missing whitespaces
Jun 18, 2019
5b3f8ea
allowed for compute=False when appending to a zarr store
Jun 20, 2019
7564329
Fixed empty array data error
Jun 20, 2019
93be790
flake8 fixes
Jun 20, 2019
58c4b78
removed chunk_dim argument to to_zarr function
Jun 21, 2019
5316593
implemented requested changes
Jun 24, 2019
ad08c73
Update xarray/backends/api.py
shikharsg Jun 25, 2019
4d2122b
added contributors and example of using append to zarr
Jun 25, 2019
105ed39
Merge branch 'append_zarr' of https://github.com/jendrikjoe/xarray in…
Jun 25, 2019
af4a5a5
fixed docs fail
Jun 25, 2019
62d4f52
fixed docs
Jun 25, 2019
9558811
Merge branch 'master' into append_zarr
shikharsg Jun 25, 2019
9d70e02
removed unnecessary condition
Jun 25, 2019
a6ff494
attempt at clean string encoding and variable length strings
Jun 26, 2019
34b700f
implemented suggestions
Jun 26, 2019
3e54cb9
* append_dim does not need to be specified if creating a new array wi…
Jun 27, 2019
97ed25b
Merge remote-tracking branch 'upstream/master' into append_zarr
Jun 27, 2019
beb12e5
raise ValueError when append_dim is not a valid dimension
Jun 27, 2019
41a6ca3
flake8 fix
Jun 27, 2019
321aec1
removed unused comment
Jun 27, 2019
2b130ff
* raise error when appending with encoding provided for existing vari…
Jun 29, 2019
58de86d
Merge remote-tracking branch 'upstream/master' into append_zarr
Jun 29, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -604,6 +604,14 @@ store is already present at that path, an error will be raised, preventing it
from being overwritten. To override this behavior and overwrite an existing
store, add ``mode='w'`` when invoking ``to_zarr``.

It is also possible to append to an existing store. For that, add ``mode='a'``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we possibly get an example of using append here?

and set ``append_dim`` to the name of the dimension along which to append.
It is necessary that the data only contains types which are either a subclass
of ``np.number`` or ``np.string_``. It as well needs to be taken into account,
that the size of the ``np.string_`` in the first chunk sets the maximum
string size for all following ones. To encode the data consider using
:py:func:`~xarray.core.api.encode_utf8`.

To read back a zarr dataset that has been created this way, we use the
:py:func:`~xarray.open_zarr` method:

Expand Down
1 change: 1 addition & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@ Other enhancements
report showing what exactly differs between the two objects (dimensions /
coordinates / variables / attributes) (:issue:`1507`).
By `Benoit Bovy <https://github.com/benbovy>`_.
- Added append capability to the zarr store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to credit all the contributors to this PR.

- Add ``tolerance`` option to ``resample()`` methods ``bfill``, ``pad``,
``nearest``. (:issue:`2695`)
By `Hauke Schulz <https://github.com/observingClouds>`_.
Expand Down
34 changes: 32 additions & 2 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,21 @@
from io import BytesIO
from numbers import Number
from pathlib import Path
import re

import numpy as np
import pandas as pd

from .. import Dataset, backends, conventions
from ..core import indexing
from ..core.combine import (
_CONCAT_DIM_DEFAULT, _auto_combine, _infer_concat_order_from_positions)
from ..core.utils import close_on_error, is_grib_path, is_remote_uri
from ..core.variable import Variable
from .common import ArrayWriter
from .locks import _get_scheduler
from ..coding.variables import safe_setitem, unpack_for_encoding
from ..coding.strings import encode_string_array

DATAARRAY_NAME = '__xarray_dataarray_name__'
DATAARRAY_VARIABLE = '__xarray_dataarray_variable__'
Expand Down Expand Up @@ -1003,8 +1008,30 @@ def save_mfdataset(datasets, paths, mode='w', format=None, groups=None,
for w, s in zip(writes, stores)])


def encode_utf8(var, string_max_length):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the idea of publicly exposing this helper function. I understand the motivation for it, but it's inconsistent with how string encoding works in the rest of xarray, via the encoding keyword argument or variable attribute. The implementation itself is also not so user friendly: it stores strings as np.string_/bytes (which is necessary), but that data doesn't get read back into a NumPy array that is compatible with Python's str type.

It would be better to support explicitly setting something like dtype='S123' in encoding to indicate a desired fixed width, and to continue doing string encoding along the lines of the encode_zarr_variable() .

For now, perhaps it is better to leave this part out...

dims, data, attrs, encoding = unpack_for_encoding(var)
missing = pd.isnull(data)
data[missing] = ""
data = encode_string_array(data, 'utf-8')
data = data.astype(np.dtype("S{}".format(string_max_length * 2)))
return Variable(dims, data, attrs, encoding)


def _validate_datatypes_for_zarr_append(dataset):
"""DataArray.name and Dataset keys must be a string or None"""
def check_dtype(var):
if (not np.issubdtype(var.dtype, np.number)
and not np.issubdtype(var.dtype, np.string_)):
# and not re.match('^bytes[1-9]+$', var.dtype.name)):
raise ValueError('Invalid dtype for DataVariable: {} '
'dtype must be a subtype of number or '
'a fixed sized string'.format(var))
for k in dataset.data_vars.values():
check_dtype(k)


def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None,
encoding=None, compute=True, consolidated=False):
encoding=None, compute=True, consolidated=False, append_dim=None):
"""This function creates an appropriate datastore for writing a dataset to
a zarr ztore

Expand All @@ -1019,11 +1046,14 @@ def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None,
_validate_dataset_names(dataset)
_validate_attrs(dataset)

if mode == "a":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise an error if encoding was passed explicitly into to_zarr() and specifies encoding for any existing variables? I.e., ds.to_arr(..., mode='a', encoding={'existing_variable': ....}). I think this would always indicate a programming mistake, given that we throw away these variable encodings anyways.

_validate_datatypes_for_zarr_append(dataset)

zstore = backends.ZarrStore.open_group(store=store, mode=mode,
synchronizer=synchronizer,
group=group,
consolidate_on_close=consolidated)

zstore.append_dim = append_dim
writer = ArrayWriter()
# TODO: figure out how to properly handle unlimited_dims
dump_to_store(dataset, zstore, writer, encoding=encoding)
Expand Down
17 changes: 14 additions & 3 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,24 +158,35 @@ class ArrayWriter:
def __init__(self, lock=None):
self.sources = []
self.targets = []
self.regions = []
self.lock = lock

def add(self, source, target):
def add(self, source, target, region=None):
if isinstance(source, dask_array_type):
self.sources.append(source)
self.targets.append(target)
if region:
self.regions.append(region)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to rule out the possibility of these lists getting out of sync by adding some arrays with a region and others without (e.g., if some variables already exist and some don't). This suggests that it would be better to always append a region, even if it's None.

else:
target[...] = source
if region:
target[region] = source
else:
target[...] = source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This seems like a perfect (and rather general) way to handle appends at the dask level.


def sync(self, compute=True):
if self.sources:
import dask.array as da
# TODO: consider wrapping targets with dask.delayed, if this makes
# for any discernable difference in perforance, e.g.,
# targets = [dask.delayed(t) for t in self.targets]

if not self.regions:
regions = None
else:
regions = self.regions
delayed_store = da.store(self.sources, self.targets,
lock=self.lock, compute=compute,
flush=True)
flush=True, regions=regions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.regions is always non-empty, so you should probably just use it directly (and delete the conditional above):

Suggested change
flush=True, regions=regions)
flush=True, regions=self.regions)

self.sources = []
self.targets = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reset self.regions = [] for consistency?

return delayed_store
Expand Down
149 changes: 118 additions & 31 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
from ..core import indexing
from ..core.pycompat import integer_types
from ..core.utils import FrozenOrderedDict, HiddenKeyDict
from .common import AbstractWritableDataStore, BackendArray
from .common import AbstractWritableDataStore, BackendArray, \
_encode_variable_name

# need some special secret attributes to tell us the dimensions
_DIMENSION_KEY = '_ARRAY_DIMENSIONS'
Expand Down Expand Up @@ -257,6 +258,7 @@ def __init__(self, zarr_group, consolidate_on_close=False):
self._synchronizer = self.ds.synchronizer
self._group = self.ds.path
self._consolidate_on_close = consolidate_on_close
self.append_dim = None

def open_store_variable(self, name, zarr_array):
data = indexing.LazilyOuterIndexedArray(ZarrArrayWrapper(name, self))
Expand Down Expand Up @@ -313,40 +315,125 @@ def encode_variable(self, variable):
def encode_attribute(self, a):
return _encode_zarr_attr_value(a)

def prepare_variable(self, name, variable, check_encoding=False,
unlimited_dims=None):

attrs = variable.attrs.copy()
dims = variable.dims
dtype = variable.dtype
shape = variable.shape

fill_value = attrs.pop('_FillValue', None)
if variable.encoding == {'_FillValue': None} and fill_value is None:
variable.encoding = {}

encoding = _extract_zarr_variable_encoding(
variable, raise_on_invalid=check_encoding)

encoded_attrs = OrderedDict()
# the magic for storing the hidden dimension data
encoded_attrs[_DIMENSION_KEY] = dims
for k, v in attrs.items():
encoded_attrs[k] = self.encode_attribute(v)

zarr_array = self.ds.create(name, shape=shape, dtype=dtype,
fill_value=fill_value, **encoding)
zarr_array.attrs.put(encoded_attrs)

return zarr_array, variable.data

def store(self, variables, attributes, *args, **kwargs):
AbstractWritableDataStore.store(self, variables, attributes,
*args, **kwargs)
def store(self, variables, attributes, check_encoding_set=frozenset(),
writer=None, unlimited_dims=None):
"""
Top level method for putting data on this store, this method:
- encodes variables/attributes
- sets dimensions
- sets variables

Parameters
----------
variables : dict-like
Dictionary of key/value (variable name / xr.Variable) pairs
attributes : dict-like
Dictionary of key/value (attribute name / attribute) pairs
check_encoding_set : list-like
List of variables that should be checked for invalid encoding
values
writer : ArrayWriter
unlimited_dims : list-like
List of dimension names that should be treated as unlimited
dimensions.
dimension on which the zarray will be appended
only needed in append mode
"""

existing_variables = set([vn for vn in variables
if _encode_variable_name(vn) in self.ds])
new_variables = set(variables) - existing_variables
variables_without_encoding = OrderedDict([(vn, variables[vn])
for vn in new_variables])
variables_encoded, attributes = self.encode(
variables_without_encoding, attributes)

if len(existing_variables) > 0:
# there are variables to append
# their encoding must be the same as in the store
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any unit tests that verify that encoding is kept consistent? This would be nice to add, if not.

Probably a good example would be dataset saved with scale/offset encoding, where the new dataset to be appended does not have any encoding provided. We could verify that probably scaled values are read back from disk.

ds = open_zarr(self.ds.store, auto_chunk=False)
variables_with_encoding = OrderedDict()
for vn in existing_variables:
variables_with_encoding[vn] = variables[vn]
variables_with_encoding[vn].encoding = ds[vn].encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modifies an argument that was passed into the function in-place, which in general should be avoided due to unexpected side effects in other parts of the code. It would be better to shallow copy the Variable before overriding encoding, e.g., with variables[vn].copy(deep=False).

variables_with_encoding, _ = self.encode(variables_with_encoding,
{})
variables_encoded.update(variables_with_encoding)

self.set_attributes(attributes)
self.set_dimensions(variables_encoded, unlimited_dims=unlimited_dims)
self.set_variables(variables_encoded, check_encoding_set, writer,
unlimited_dims=unlimited_dims)

def sync(self):
pass

def set_variables(self, variables, check_encoding_set, writer,
unlimited_dims=None):
"""
This provides a centralized method to set the variables on the data
store.

Parameters
----------
variables : dict-like
Dictionary of key/value (variable name / xr.Variable) pairs
check_encoding_set : list-like
List of variables that should be checked for invalid encoding
values
writer :
unlimited_dims : list-like
List of dimension names that should be treated as unlimited
dimensions.
"""
for vn, v in variables.items():
name = _encode_variable_name(vn)
check = vn in check_encoding_set
attrs = v.attrs.copy()
dims = v.dims
dtype = v.dtype
shape = v.shape

fill_value = attrs.pop('_FillValue', None)
if v.encoding == {'_FillValue': None} and fill_value is None:
v.encoding = {}
if name in self.ds:
# append to existing variable
zarr_array = self.ds[name]
if self.append_dim is None:
raise ValueError(
'dimension being appended is unknown; '
'did you forget to call to_zarr with append_dim '
'argument?')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a little more context here, e.g.,
"variable X already exists, but append_dim was not set"

if self.append_dim in dims:
# this is the DataArray that has append_dim as a
# dimension
append_axis = dims.index(self.append_dim)
new_shape = list(zarr_array.shape)
new_shape[append_axis] += v.shape[append_axis]
new_region = [slice(None)] * len(new_shape)
new_region[append_axis] = slice(
zarr_array.shape[append_axis],
None
)
zarr_array.resize(new_shape)
writer.add(v.data, zarr_array,
region=tuple(new_region))
else:
# new variable
encoding = _extract_zarr_variable_encoding(
v, raise_on_invalid=check)
encoded_attrs = OrderedDict()
# the magic for storing the hidden dimension data
encoded_attrs[_DIMENSION_KEY] = dims
for k2, v2 in attrs.items():
encoded_attrs[k2] = self.encode_attribute(v2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we pulled this attribute encoding out before the try block. Then we check encoded_attrs against zarr_array.attrs before appending.


zarr_array = self.ds.create(name, shape=shape, dtype=dtype,
fill_value=fill_value, **encoding)
zarr_array.attrs.put(encoded_attrs)
writer.add(v.data, zarr_array)

def close(self):
if self._consolidate_on_close:
import zarr
Expand Down
16 changes: 10 additions & 6 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1322,7 +1322,8 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None,
compute=compute)

def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,
encoding=None, compute=True, consolidated=False):
encoding=None, compute=True, consolidated=False,
append_dim=None):
"""Write dataset contents to a zarr group.

.. note:: Experimental
Expand All @@ -1333,9 +1334,10 @@ def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,
----------
store : MutableMapping or str, optional
Store or path to directory in file system.
mode : {'w', 'w-'}
mode : {'w', 'w-', 'a'}
Persistence mode: 'w' means create (overwrite if exists);
'w-' means create (fail if exists).
'w-' means create (fail if exists);
'a' means append (create if does not exist).
synchronizer : object, optional
Array synchronizer
group : str, obtional
Expand All @@ -1350,21 +1352,23 @@ def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,
consolidated: bool, optional
If True, apply zarr's `consolidate_metadata` function to the store
after writing.
append_dim: str, optional
If mode='a', the dimension on which the data will be appended.

References
----------
https://zarr.readthedocs.io/
"""
if encoding is None:
encoding = {}
if mode not in ['w', 'w-']:
# TODO: figure out how to handle 'r+' and 'a'
if mode not in ['w', 'w-', 'a']:
# TODO: figure out how to handle 'r+'
raise ValueError("The only supported options for mode are 'w' "
"and 'w-'.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and 'a' now!

from ..backends.api import to_zarr
return to_zarr(self, store=store, mode=mode, synchronizer=synchronizer,
group=group, encoding=encoding, compute=compute,
consolidated=consolidated)
consolidated=consolidated, append_dim=append_dim)

def __repr__(self):
return formatting.dataset_repr(self)
Expand Down
Loading