Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
981d676
First attempt at adding derived forcings
ealerskans Oct 28, 2024
79a94db
Re-structure approach
ealerskans Nov 6, 2024
f37161c
Add derivation of cyclic encoded hour of day and day of year
ealerskans Nov 6, 2024
71afd3a
Add derivation of cyclic encoded time of year
ealerskans Nov 6, 2024
abb626b
Update and add docstrings
ealerskans Nov 6, 2024
8b1f18e
Remove time_of_year
ealerskans Nov 12, 2024
7854013
Provide the full namespace of the function
ealerskans Nov 12, 2024
7fa90bf
Rename the module with derived variables
ealerskans Nov 12, 2024
48c9e3e
Rename the function used for deriving variables
ealerskans Nov 12, 2024
8de9404
Redefine the config file for derived variables and how they are calcu…
ealerskans Nov 15, 2024
ffc030c
Remove derived variables from 'load_and_subset_dataset'
ealerskans Nov 15, 2024
692cdd3
Add try/except for derived variables when loading the dataset
ealerskans Nov 15, 2024
c0cd875
Chunk the input data with the defined output chunks
ealerskans Dec 5, 2024
55224f3
Update toa_radiation function name
ealerskans Dec 5, 2024
678ea52
Correct kwargs usage, add back dropped coordinates and return correct…
ealerskans Dec 5, 2024
9d2db07
Prepare for hour_of_day and day_of_year
ealerskans Dec 5, 2024
26455bc
Add optional 'attributes' to the config of 'derived_variables' and ch…
ealerskans Dec 6, 2024
fbb6065
Add dummy function for getting lat,lon (preparation for #33)
ealerskans Dec 9, 2024
3a12f48
Add function for chunking data and checking the chunk size
ealerskans Dec 9, 2024
3ace219
Add back coordinates on the subset instead of for each derived variab…
ealerskans Dec 9, 2024
a6b61b0
Add 'hour_of_day' to example config
ealerskans Dec 9, 2024
1814297
Merge branch 'main' into feature/derive_forcings
ealerskans Dec 9, 2024
9dcace6
Rename derived variables dataset section in the example config
ealerskans Dec 9, 2024
aba6757
Remove f-string from 'name_format'
ealerskans Dec 10, 2024
143edb6
Update README
ealerskans Dec 10, 2024
6aad6d7
Merge branch 'main' into feature/derive_forcings
ealerskans Dec 11, 2024
12e0575
Update CHANGELOG
ealerskans Dec 11, 2024
000ce92
Make functions for deriving toa_radiation and datetime forcings actua…
ealerskans Dec 11, 2024
0af6319
Update docstring and variable names in 'cyclic_encoding'
ealerskans Dec 11, 2024
284db91
Add ranges to lat and lon in docstring
ealerskans Dec 12, 2024
ba161d2
Add github username to CHANGELOG entry
ealerskans Dec 12, 2024
e3d590c
Update DerivedVariable attributes to be Dict[str, str]
ealerskans Dec 12, 2024
f8cae4f
Add missing attribute to docstring
ealerskans Dec 12, 2024
8470c82
Change var names in 'calculate_toa_radiation'
ealerskans Dec 12, 2024
69afdd3
Remove unnecessary 'or None'
ealerskans Dec 12, 2024
e17ed8b
Use var name 'dim' instead of 'd'
ealerskans Dec 12, 2024
23b119f
Use var names 'key, val' instead of 'k, v'
ealerskans Dec 12, 2024
2ce53c7
Move '_check_dataset_attributes' outside if statement
ealerskans Dec 12, 2024
f1e3d77
Set '{}' as default for 'attributes' and 'chunking'
ealerskans Dec 12, 2024
2afbb35
Make types more explicit
ealerskans Dec 13, 2024
75797a2
Rename 'ds_subset' to 'ds_derived_vars' and update comment for 'ds_in…
ealerskans Dec 13, 2024
31578e8
Add 'Optional[...]' to optional attributes
ealerskans Dec 13, 2024
90e4cf2
Move loading of dataset to a separate function
ealerskans Dec 13, 2024
717c6a5
Simplify if loops
ealerskans Dec 13, 2024
2856c6b
Update '_get_derived_variable_function'
ealerskans Dec 13, 2024
98673ee
Simplify checks of the derived fields
ealerskans Dec 13, 2024
8940e82
Issue warning saying that we assume coordinates are named 'lat' and '…
ealerskans Dec 13, 2024
e12e328
Update README to make it clear that 'attributes' is associated with '…
ealerskans Dec 13, 2024
ecdea30
Indicate that 'variables' and 'derived_variables' are mutually exclusive
ealerskans Dec 13, 2024
e3c0f22
Update docstring of 'InputDataset' class
ealerskans Dec 13, 2024
e907a6d
Correct types in '_check_attributes' docstring
ealerskans Dec 13, 2024
bb9be13
Use 'rpartition' to get 'module_name' and 'function_name'
ealerskans Dec 13, 2024
49de0b3
Add some initial tests for 'derived_variables'
ealerskans Dec 13, 2024
b268f01
Update docstrings and rename 'DerivedVariable.attributes' to 'Derived…
ealerskans Dec 17, 2024
dbd5bfd
Do not add 'attributes' to docstring
ealerskans Dec 17, 2024
474a83d
Remove unnecessary exception handling
ealerskans Dec 17, 2024
1da66e2
Move 'subset_dataset' to 'ops.subsetting'
ealerskans Dec 17, 2024
dc7dc5e
Move 'derived_variables' to 'ops'
ealerskans Dec 17, 2024
c9e96af
Move chunk size check to 'chunking' module
ealerskans Dec 17, 2024
47b8411
Add module docstring
ealerskans Dec 17, 2024
5ae772f
Update tests
ealerskans Dec 17, 2024
2c0bdf8
Add global REQUIRED_FIELD_ATTRIBUTES var and updated check for requir…
ealerskans Dec 18, 2024
f1ce6d1
Update long name for toa_radiation
ealerskans Dec 18, 2024
58d8af6
Update README
ealerskans Dec 18, 2024
f87b954
Return dropped coordinates to the data-arrays instead
ealerskans Dec 19, 2024
80cf058
Adds dims to the dataset to make it work with derived variables that …
ealerskans Dec 19, 2024
da0c171
Add ability to have 'variables' and 'derived_variables' in the same
ealerskans Dec 19, 2024
f61a3b6
Update README
ealerskans Dec 19, 2024
554f869
Add 'load_config' function, which wraps 'from_yaml_file' and checks t…
ealerskans Dec 20, 2024
085aae3
Update README
ealerskans Dec 20, 2024
980e511
Move 'chunk_dataset' to the chunking module
ealerskans Jan 8, 2025
b6e80d5
Update error message for when missing both 'variables' and 'derived_v…
ealerskans Jan 8, 2025
d6c1b36
Move the deriving-functions to a separate module
ealerskans Jan 8, 2025
f1e67bc
Update tests
ealerskans Jan 8, 2025
89e9ad8
Rename (and move): 'mllam_data_prep/ops/derived_variables.py' -> 'mll…
ealerskans Jan 8, 2025
bdf3466
Use the __post_init__() method to validate the config
ealerskans Jan 15, 2025
d3c8693
Loop over 'variables' in 'create_dataset'
ealerskans Jan 15, 2025
0fc31bf
Update file structure
ealerskans Jan 15, 2025
6a7a1e3
Add comment as to why chunking of coordinates is needed
ealerskans Jan 15, 2025
92ad379
Loop over 'derived_variables' in 'create_dataset'
ealerskans Jan 15, 2025
d95c031
Add 'extra_args' to 'derived_variables' to allow functions to have ar…
ealerskans Jan 15, 2025
e158a6c
Update 'calculate_day_of_year' to only return one component (sin or cos)
ealerskans Jan 15, 2025
ff9acc7
Do not modify the arguments in the function for checking (and now get…
ealerskans Jan 15, 2025
dc3f200
Update tests for functions for deriving toa_radiation and time compon…
ealerskans Jan 16, 2025
9093534
Update the config version to v0.6.0
ealerskans Jan 16, 2025
233206c
Raise an error if 'component' is neither 'cos' nor 'sin'.
ealerskans Jan 16, 2025
b716c13
Update docstring and rename variable
ealerskans Jan 16, 2025
8519da4
Update error message since we now only support xr.DataArrays
ealerskans Jan 17, 2025
79b6e46
Update README
ealerskans Jan 17, 2025
3baa1c0
Tests for _check_and_get_required_attributes and _get_derived_variabl…
mfroelund Jan 20, 2025
245e97b
Restructured test data into fixtures and indirect parametrizations. S…
mfroelund Jan 21, 2025
325866a
Adjusted docstrings
mfroelund Jan 22, 2025
92ae991
Merge pull request #1 from mafdmi/feature/test_derive_forcings
ealerskans Jan 23, 2025
c25993d
Merge branch 'main' into feature/derive_forcings
ealerskans Jan 23, 2025
c633462
Add mafdmi as contributor
ealerskans Jan 23, 2025
ceb0d21
Linting
ealerskans Jan 23, 2025
0ecfcca
Prefix 'function' arguments from the dataset with 'ds_input.'
ealerskans Jan 24, 2025
97ee6dd
Minor updates according to review
ealerskans Jan 24, 2025
14beca8
Change back example in README
ealerskans Jan 27, 2025
209a8d8
Update docstring for 'get_latlon_coords_for_input'
ealerskans Jan 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- add ability to derive variables from input datasets [\#34](https://github.com/mllam/mllam-data-prep/pull/34), @ealerskans, @mafdmi
- add github PR template to guide development process on github [\#44](https://github.com/mllam/mllam-data-prep/pull/44), @leifdenby
- add support for zarr 3.0.0 and above [\#51](https://github.com/mllam/mllam-data-prep/pull/51), @kashif

Expand Down
87 changes: 80 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ ds = mdp.create_dataset(config=config)
A full example configuration file is given in [example.danra.yaml](example.danra.yaml), and reproduced here for completeness:

```yaml
schema_version: v0.5.0
schema_version: v0.6.0
dataset_version: v0.1.0

output:
Expand Down Expand Up @@ -175,6 +175,24 @@ inputs:
variables:
# use surface incoming shortwave radiation as forcing
- swavr0m
derived_variables:
# derive variables to be used as forcings
toa_radiation:
kwargs:
time: ds_input.time
lat: ds_input.lat
lon: ds_input.lon
function: mllam_data_prep.ops.derive_variable.physical_field.calculate_toa_radiation
hour_of_day_sin:
kwargs:
time: ds_input.time
component: sin
function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day
hour_of_day_cos:
kwargs:
time: ds_input.time
component: cos
function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day
dim_mapping:
time:
method: rename
Expand Down Expand Up @@ -286,15 +304,32 @@ inputs:
grid_index:
method: stack
dims: [x, y]
target_architecture_variable: state
target_output_variable: state

danra_surface:
path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
dims: [time, x, y]
variables:
# shouldn't really be using sea-surface pressure as "forcing", but don't
# have radiation varibles in danra yet
- pres_seasurface
# use surface incoming shortwave radiation as forcing
- swavr0m
derived_variables:
# derive variables to be used as forcings
toa_radiation:
kwargs:
time: ds_input.time
lat: ds_input.lat
lon: ds_input.lon
function: mllam_data_prep.ops.derive_variable.physical_field.calculate_toa_radiation
hour_of_day_sin:
kwargs:
time: ds_input.time
component: sin
function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day
hour_of_day_cos:
kwargs:
time: ds_input.time
component: cos
function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day
dim_mapping:
time:
method: rename
Expand All @@ -305,7 +340,7 @@ inputs:
forcing_feature:
method: stack_variables_by_var_name
name_format: "{var_name}"
target_architecture_variable: forcing
target_output_variable: forcing

...
```
Expand All @@ -315,11 +350,49 @@ The `inputs` section defines the source datasets to extract data from. Each sour
- `path`: the path to the source dataset. This can be a local path or a URL to e.g. a zarr dataset or netCDF file, anything that can be read by `xarray.open_dataset(...)`.
- `dims`: the dimensions that the source dataset is expected to have. This is used to check that the source dataset has the expected dimensions and also makes it clearer in the config file what the dimensions of the source dataset are.
- `variables`: selects which variables to extract from the source dataset. This may either be a list of variable names, or a dictionary where each key is the variable name and the value defines a dictionary of coordinates to do selection on. When doing selection you may also optionally define the units of the variable to check that the units of the variable match the units of the variable in the model architecture.
- `target_architecture_variable`: the variable in the model architecture that the source dataset should be mapped to.
- `target_output_variable`: the variable in the model architecture that the source dataset should be mapped to.
- `dim_mapping`: defines how the dimensions of the source dataset should be mapped to the dimensions of the model architecture. This is done by defining a method to apply to each dimension. The methods are:
- `rename`: simply rename the dimension to the new name
- `stack`: stack the listed dimension to create the dimension in the output
- `stack_variables_by_var_name`: stack the dimension into the new dimension, and also stack the variable name into the new variable name. This is useful when you have multiple variables with the same dimensions that you want to stack into a single variable.
- `derived_variables`: defines the variables to be derived from the variables available in the source dataset. This should be a dictionary where each key is the name of the variable to be derived and the value defines a dictionary with the following additional information. See also the 'Derived Variables' section for more details.
- `function`: the function used to derive a variable. This should be a string with the full namespace of the function, e.g. `mllam_data_prep.ops.derived_variables.physical_field.calculate_toa_radiation`.
- `kwargs`: arguments to `function`. This is a dictionary where each key is the named argument to `function` and each value is the input to the function. Here we distinguish between values to be extracted/selected from the input dataset and values supplied by the users themselves. Arguments with values to be extracted from the input dataset need to be prefixed with "ds_input." to distinguish them from other arguments. See the 'Derived Variables' section for more details.

#### Derived Variables
Variables that are not part of the source dataset but can be derived from variables in the source dataset can also be included. They should be defined in their own section, called `derived_variables` as illustrated in the example config above and in the example config file [example.danra.yaml](example.danra.yaml).

To derive a variable, the function to be used (`function`) and the arguments to this function (`kwargs`) need to be specified, as explained above. Here we need to distinguish between arguments that should be data from the input dataset and arguments that should be supplied by the users themselves. The example below illustrates how to derive the cosine component of the cyclically encoded hour of day variable

```yaml
derived_variables:
hour_of_day_cos:
kwargs:
time: ds_input.time
component: cos
function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day
attrs:
units: 1
long_name: cos component of cyclically encoded hour of day
```

The function `mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day` takes two arguments; `time` and `component`. The `time` argument should extract the `time` variable from the input dataset and has therefore been prefixed with "ds_input." to distinguish it from other arguments that should not be extracted from the source dataset. The `component` argument, on the other hand, is a string (either "sin" or "cos") and decides if the returned derived variable is the sine or cosine component of the cyclically encoded hour of day.

In addition, an optional section called `attrs` can be added. In this section, the user can add attributes to the derived variable, as illustrated in the example above. Note that the attributes `units` and `long_name` are **required**. This means that if the function used to derive a variable does not set these attributes they are **required** to be set in the config file. If using a function defined in `mllam_data_prep.ops.derive_variable` the `attrs` section is optional as the required attributes should already be defined. In this case, adding the `units` and `long_name` attributes to the `attrs` section of the derived variable in config file will **overwrite** the already-defined attributes in the function. It is also possible to set other attributes. This can be done by adding them under the `attrs` section in the same way as shown for `unit` and `long_name` in the example above.

Currently, the following derived variables are included as part of `mllam-data-prep`:
- `toa_radiation`:
- Top-of-atmosphere incoming radiation
- function: `mllam_data_prep.ops.derive_variable.physical_field.calculate_toa_radiation`
- arguments: `lat`, `lon`, `time`
- `hour_of_day_[sin/cos]`:
- Sine or cosine part of cyclically encoded hour of day
- function: `mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day`
- arguments: `time`, `component`
- `day_of_year_[sin/cos]`:
- Sine or cosine part of cyclically encoded day of year
- function: `mllam_data_prep.ops.derive_variable.time_components.calculate_day_of_year`
- arguments: `time`, `component`


### Config schema versioning
Expand Down
20 changes: 19 additions & 1 deletion example.danra.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
schema_version: v0.5.0
schema_version: v0.6.0
dataset_version: v0.1.0

output:
Expand Down Expand Up @@ -61,6 +61,24 @@ inputs:
variables:
# use surface incoming shortwave radiation as forcing
- swavr0m
derived_variables:
# derive variables to be used as forcings
toa_radiation:
kwargs:
time: ds_input.time
lat: ds_input.lat
lon: ds_input.lon
function: mllam_data_prep.ops.derive_variable.physical_field.calculate_toa_radiation
hour_of_day_sin:
kwargs:
time: ds_input.time
component: sin
function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day
hour_of_day_cos:
kwargs:
time: ds_input.time
component: cos
function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day
dim_mapping:
time:
method: rename
Expand Down
96 changes: 87 additions & 9 deletions mllam_data_prep/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,50 @@ class InvalidConfigException(Exception):
pass


def validate_config(config_inputs):
"""
Validate that, in the config:
- either `variables` or `derived_variables` are present in the config
- if both `variables` and `derived_variables` are present, that they don't
add the same variables to the dataset

Parameters
----------
config_inputs: Dict[str, InputDataset]

Returns
-------
"""

for input_dataset_name, input_dataset in config_inputs.items():
if not input_dataset.variables and not input_dataset.derived_variables:
raise InvalidConfigException(
f"Input dataset '{input_dataset_name}' is missing the keys `variables` and/or"
" `derived_variables`. Make sure that you update the config so that the input"
f" dataset '{input_dataset_name}' contains at least either a `variables` or"
" `derived_variables` section."
)
elif input_dataset.variables and input_dataset.derived_variables:
# Check so that there are no overlapping variables
if isinstance(input_dataset.variables, list):
variable_vars = input_dataset.variables
elif isinstance(input_dataset.variables, dict):
variable_vars = input_dataset.variables.keys()
else:
raise TypeError(
f"Expected an instance of list or dict, but got {type(input_dataset.variables)}."
)
derived_variable_vars = input_dataset.derived_variables.keys()
common_vars = list(set(variable_vars) & set(derived_variable_vars))
if len(common_vars) > 0:
raise InvalidConfigException(
"Both `variables` and `derived_variables` include the following variables name(s):"
f" '{', '.join(common_vars)}'. This is not allowed. Make sure that there"
" are no overlapping variable names between `variables` and `derived_variables`,"
f" either by renaming or removing '{', '.join(common_vars)}' from one of them."
)


@dataclass
class Range:
"""
Expand Down Expand Up @@ -52,6 +96,32 @@ class ValueSelection:
units: str = None


@dataclass
class DerivedVariable:
"""
Defines a derived variables, where the function (for calculating the variable) and
the kwargs (arguments to function) are specified. kwargs can contain both arguments
which should extract/select data from the input dataset, in which case they should
have the "ds_input." prefix to distinguish them from other argument that should not
be extracted from the dataset (e.g. a string to indicate if the sine or cosine
component should be extracted).

Optionally, attributes to the derived variable can be specified in `attrs`, e.g.
{"attrs": "units": "W*m**-2, "long_name": "top-of-the-atmosphere radiation"}.
In case a function does not return an `xr.DataArray` with the required attributes
(`units` and `long_name`) set, these have to be specified in `attrs`.

Attributes:
kwargs: Variables required for calculating the derived variable.
function: Function used to calculate the derived variable.
attrs: Attributes (e.g. `units` and `long_name`) to set for the derived variable.
"""

kwargs: Dict[str, str]
function: str
attrs: Optional[Dict[str, str]] = field(default_factory=dict)


@dataclass
class DimMapping:
"""
Expand Down Expand Up @@ -120,7 +190,8 @@ class InputDataset:
1) the path to the dataset,
2) the expected dimensions of the dataset,
3) the variables to select from the dataset (and optionally subsection
along the coordinates for each variable) and finally
along the coordinates for each variable) or the variables to derive
from the dataset, and finally
4) the method by which the dimensions and variables of the dataset are
mapped to one of the output variables (this includes stacking of all
the selected variables into a new single variable along a new coordinate,
Expand All @@ -134,11 +205,6 @@ class InputDataset:
dims: List[str]
List of the expected dimensions of the dataset. E.g. `["time", "x", "y"]`.
These will be checked to ensure consistency of the dataset being read.
variables: Union[List[str], Dict[str, Dict[str, ValueSelection]]]
List of the variables to select from the dataset. E.g. `["temperature", "precipitation"]`
or a dictionary where the keys are the variable names and the values are dictionaries
defining the selection for each variable. E.g. `{"temperature": levels: {"values": [1000, 950, 900]}}`
would select the "temperature" variable and only the levels 1000, 950, and 900.
dim_mapping: Dict[str, DimMapping]
Mapping of the variables and dimensions in the input dataset to the dimensions of the
output variable (`target_output_variable`). The key is the name of the output dimension to map to
Expand All @@ -151,14 +217,23 @@ class InputDataset:
(e.g. two datasets that coincide in space and time will only differ in the feature dimension,
so the two will be combined by concatenating along the feature dimension).
If a single shared coordinate cannot be found then an exception will be raised.
variables: Union[List[str], Dict[str, Dict[str, ValueSelection]]]
List of the variables to select from the dataset. E.g. `["temperature", "precipitation"]`
or a dictionary where the keys are the variable names and the values are dictionaries
defining the selection for each variable. E.g. `{"temperature": levels: {"values": [1000, 950, 900]}}`
would select the "temperature" variable and only the levels 1000, 950, and 900.
derived_variables: Dict[str, DerivedVariable]
Dictionary of variables to derive from the dataset, where the keys are the names variables will be given and
the values are `DerivedVariable` definitions that specify how to derive a variable.
"""

path: str
dims: List[str]
variables: Union[List[str], Dict[str, Dict[str, ValueSelection]]]
dim_mapping: Dict[str, DimMapping]
target_output_variable: str
attributes: Dict[str, Any] = None
variables: Optional[Union[List[str], Dict[str, Dict[str, ValueSelection]]]] = None
derived_variables: Optional[Dict[str, DerivedVariable]] = None
attributes: Optional[Dict[str, Any]] = field(default_factory=dict)


@dataclass
Expand Down Expand Up @@ -258,7 +333,7 @@ class Output:

variables: Dict[str, List[str]]
coord_ranges: Dict[str, Range] = None
chunking: Dict[str, int] = None
chunking: Dict[str, int] = field(default_factory=dict)
splitting: Splitting = None


Expand Down Expand Up @@ -298,6 +373,9 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard):
dataset_version: str
extra: Dict[str, Any] = None

def __post_init__(self):
validate_config(self.inputs)

class _(JSONWizard.Meta):
raise_on_unknown_json_key = True

Expand Down
Loading
Loading