Releases · mllam/mllam-data-prep

13 Jan 09:52

v0.7.0

5349f2b

v0.7.0 Latest

Latest

This release adds support for cropping a dataset using the convex hull of the lat/lon coordinates of another dataset, which can be used for creating boundary data in Limited Area Modelling setups.

Assets 2

20 May 07:44

SimonKamuk

v0.6.1

af5eb65

v0.6.1

All changes

This release contains bugfixes to update tests to use newer version of pre-commit, use correct python version, and remove uses of incompatible typing notation.

Fixes

use old union typing notation compatible with all required python versions #77 @SimonKamuk

Maintenance

update pre-commit action to v3.0.1 #77 @SimonKamuk
fix tests to use expected python version from test matrix #77 @SimonKamuk

Contributors

SimonKamuk

Assets 2

27 Mar 13:39

leifdenby

v0.6.0

ce95c76

v0.6.0

All changes

This release adds the ability to slice input data by any coordinate, derive variables from input datasets, and store config in created datasets. It also adds support for zarr 3.0.0 and above, and a mypy typing action to pre-commit hooks. In addition a number of bugs were fixed related to adding unwanted dimensions to the dataset, chunk size estimates, and derived functions. The release also includes a number of maintenance updates including updating the DANRA test dataset to v0.2.0 (which smaller, leading to faster test execution) and updating the dataclass-wizard dependency to at least v0.29.2.

Added

add functionality to slice input data by any coordinate #55 @matschreiner
add ability to derive variables from input datasets #34, @ealerskans, @mafdmi
add github PR template to guide development process on github #44, @leifdenby
add support for zarr 3.0.0 and above #51, @kashif
warn if the user tries to load a non-YAML file #50, @j6k4m8
add mypy typing action to pre-commit hooks #67, @observingClouds
add support for storing config in created datasets and option to only overwrite zarr dataset of config change #64, @leifdenby

Fixes

fix bug which adds unwanted dimensions to the dataset #60, @ealerskans, @observingClouds
correct chunk size estimate #59, @ealerskans
fix bug arising when variables provided to derived functions are renamed #56, @leifdenby
ensure config fields defaulting to None are typed as Optional and fields defaulting to {} are given a default-factory so that serialization with default values works correctly #63, @leifdenby
fix reading of exported config files #67, @observingClouds

Maintenance

update DANRA test dataset to v0.2.0 which uses a smaller cropped domain #62, @leifdenby
update dataclass-wizard dependency to at least v0.29.2 allowing for use of Union types together with check for unmatched keys in config yaml #73, @leifdenby

Contributors

kashif, j6k4m8, and 5 other contributors

Assets 2

20 Nov 19:14

leifdenby

v0.5.0

86aa6c1

v0.5.0

This release adds support for an optional extra section in the config file (for user-defined extra information that is ignored by mllam-data-prep) and fixes a few minor issues. Note that to use extra section in the config file the schema version in the config file must be increased to v0.5.0.

Added

Add optional section called extra to config file to allow for user-defined extra information that is ignored by mllam-data-prep but can be used by downstream applications. , @leifdenby

Changed

remove f-string from name_format in config examples #35
replace global config for dataclass_wizard on mllam_data_prep.config.Config with config specific to that dataclass (to avoid conflicts with other uses of dataclass_wizard) #36
Schema version bumped to v0.5.0 to match release version that supports optional extra section in config #18

Contributors

leifdenby

Assets 2

18 Nov 17:00

leifdenby

v0.4.0

8e7a5bc

v0.4.0

This release adds support for defining the output path in the command line interface and addresses bugs around optional dependencies for dask.distributed.

Added

add optional output path argument to parser.

Changed

fix bug by making dependency distributed optional
change config example to call validation split val instead of validation #28
fix typo in install dependency distributed
add missing psutil requirement. #21.

Assets 2

12 Aug 14:03

leifdenby

v0.3.0

e49b8fc

v0.3.0

Added

add support for parallel processing using dask.distributed with command line flags --dask-distributed-local-core-fraction and --dask-distributed-local-memory-fraction to control the number of cores and memory to use on the local machine.

Assets 2

05 Aug 13:01

leifdenby

v0.2.0

3297c75

v0.2.0

Added

add support for creating dataset splits (e.g. train, validation, test) through output.splitting section in the config file, and support for optionally compute statistics for a given split (with output.splitting.splits.{split_name}.compute_statistics). .
include units and long_name attributes for all stacked variables as {output_variable}_units and {output_variable}_long_name .
include version of mllam-data-prep in output

Changed

split dataset creation and storage to zarr into separate functions mllam_data_prep.create_dataset(...) and
mllam_data_prep.create_dataset_zarr(...) respectively
changes to spec from v0.1.0:
- the architecture section has been renamed output to make it clearer that this section defines the properties of the output of mllam-data-prep
- sampling_dim removed from output (previously architecture) section of spec, this is not needed to create the training data
- the variables (and their dimensions) of the output definition has been renamed from architecture.input_variables to output.variables
- coordinate value ranges for the dimensions of the output (i.e. what that the architecture expects as input) has been renamed from
  architecture.input_ranges to output.coord_ranges to make the use more clear
- selection on variable coordinates values is now set with
  inputs.{dataset_name}.variables.{variable_name}.values rather than inputs.{dataset_name}.variables.{variable_name}.sel
- when dimension-mapping method stack_variables_by_var_name is used the formatting string for the new variable is now called name_format rather than name
- when dimension-mapping is done by simply renaming a dimension this configuration now needs to be set by providing the named method (rename) explicitly through the method key, i.e. rather than {to_dim}: {from_dim} it is now {to_dim}: {method: rename, dim: {from_dim}} to match the signature of the other dimension-mapping methods.
- attribute inputs.{dataset_name}.name attribute has been removed, with the key dataset_name this is superfluous
relax minimuim python version requirement to >3.8 to simplify downstream usage

Assets 2

22 May 16:26

leifdenby

v0.1.0

e1cf669

v0.1.0

First tagged release of mllam-data-prep which includes functionality to declaratively (in a yaml-config file) describe how the variables and
coordinates of a set of zarr-based source datasets are mapped to a new set of variables with new coordinates to single a training dataset and write this resulting single dataset to a new zarr dataset. This explicit mapping gives the flexibility to target different different model architectures (which may require different inputs with different shapes between architectures).

Assets 2

Releases: mllam/mllam-data-prep

v0.7.0

Uh oh!

v0.6.1

Fixes

Maintenance

Contributors

Uh oh!

v0.6.0

Added

Fixes

Maintenance

Contributors

Uh oh!

v0.5.0

Added

Changed

Contributors

Uh oh!

v0.4.0

Added

Changed

Uh oh!

v0.3.0

Added

Uh oh!

v0.2.0

Added

Changed

Uh oh!

v0.1.0

Uh oh!