tests: add unit tests for functions without direct test coverage by arnavsharma990 · Pull Request #94 · mllam/mllam-data-prep

arnavsharma990 · 2026-02-20T08:11:58Z

Describe your changes

This PR adds unit tests for several functions that previously lacked direct test coverage in the dataset processing utilities.

The goal is to improve reliability and maintainability of the codebase by ensuring that key dataset validation and chunking functions are properly tested, including edge cases and error handling.

Tests were added for the following functions:

check_point_in_dataset
- verifies behavior when a point exists, does not exist, or is None
check_step
- tests constant step matching and mismatching cases
- tests non-constant steps and edge cases
load_input_dataset
- validates correct loading of datasets from Zarr and NetCDF formats
- tests error handling for invalid inputs
check_chunk_size
- tests warning behavior for small and large chunk sizes
- verifies behavior when dimensions are missing
chunk_dataset
- tests successful dataset chunking
- verifies error handling scenarios

These tests improve test coverage and help ensure dataset validation utilities behave correctly across different conditions.

Dependencies:
No new dependencies were introduced.

Issue Link

Improves test coverage for dataset validation utilities.

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the documentation to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form
I have requested a reviewer and an assignee

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section reflecting type of change:
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog
Once the PR is ready to be merged, squash commits and merge the PR.

Add focused unit tests for: - check_point_in_dataset: test point exists/not exists/None cases - check_step: test constant step matching/mismatching, non-constant step, edge cases - load_input_dataset: test zarr/netCDF loading and error handling - check_chunk_size: test warning behavior for small/large chunks, missing dimensions - chunk_dataset: test successful chunking and error handling These tests follow existing test patterns and provide minimal but complete coverage for previously untested helper functions. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Adds new unit tests to cover several helper/ops functions that previously lacked direct test coverage, focusing on selection validation, dataset loading, and dataset chunking behavior.

Changes:

Add tests for check_point_in_dataset and check_step edge cases and error paths.
Add tests for load_input_dataset for Zarr and NetCDF inputs plus missing-file handling.
Add tests for check_chunk_size warning behavior and chunk_dataset success/error handling.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
`tests/test_selection_helpers.py`	Adds coverage for selection helper validation and step-size checking edge cases.
`tests/test_loading.py`	Adds coverage for loading datasets from Zarr/NetCDF and missing-path behavior.
`tests/test_chunking.py`	Adds coverage for chunk-size checking and dataset chunking error handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T08:15:33Z

tests/test_chunking.py

+@pytest.fixture
+def large_dataset():
+    """Create a dataset that will exceed chunk size warning."""
+    # Create dataset with large chunks that exceed 1GB warning
+    # Using float64 (8 bytes), need > 1GB / 8 = 134217728 elements
+    # For simplicity, create a smaller but still large dataset
+    size = 5000
+    return xr.Dataset(
+        {
+            "large_var": (["x", "y"], np.random.random((size, size))),
+        },
+        coords={"x": range(size), "y": range(size)},
+    )


large_dataset allocates a 5000x5000 float64 array (~200MB) during test collection/execution, which is likely to slow down or OOM CI. You can trigger the chunk-size warning without a huge dataset (the implementation only uses chunks and dtype), so this fixture should be removed or made tiny.

Copilot · 2026-02-20T08:15:34Z

tests/test_chunking.py

+def test_check_chunk_size_large_chunks(large_dataset, caplog):
+    """Test check_chunk_size with large chunks (should warn)."""
+    # Use chunks that will create large memory usage
+    chunks = {"x": 1000, "y": 1000}
+    check_chunk_size(large_dataset, chunks)
+    # Should log a warning
+    assert len(caplog.records) > 0
+    assert "exceeds" in caplog.records[0].message.lower()


This warning test is currently inconsistent with the implementation:

check_chunk_size computes memory_usage from the requested chunk sizes (product of values in chunks) and dtype, so {"x": 1000, "y": 1000} is only ~8MB and should not exceed the 1GB threshold.

check_chunk_size logs via loguru.logger, which caplog does not capture by default, so caplog.records will remain empty.
Adjust the test to (1) use chunk sizes whose product exceeds the threshold and (2) capture Loguru output using a Loguru sink (or a pytest plugin that bridges Loguru to stdlib logging).

Copilot · 2026-02-20T08:15:34Z

tests/test_loading.py

+import tempfile
+from pathlib import Path
+


tempfile and Path are imported but never used in this test module; please remove them to keep the test suite clean (and avoid failing linting if enabled).

Suggested change

import tempfile

from pathlib import Path

Copilot · 2026-02-20T08:15:34Z

tests/test_loading.py

+def test_load_input_dataset_netcdf(sample_dataset, tmp_path):
+    """Test load_input_dataset with netCDF format."""
+    nc_path = tmp_path / "test.nc"
+    sample_dataset.to_netcdf(nc_path)
+


sample_dataset.to_netcdf(...) requires an optional NetCDF engine (typically scipy, netCDF4, or h5netcdf). The project dependencies don’t appear to include any of these, so this test may fail in CI depending on the environment. Consider using pytest.importorskip(...) for the chosen engine and specifying it explicitly (e.g., engine="scipy"), or adding an explicit test dependency to ensure NetCDF support is available.

Copilot · 2026-02-20T08:15:34Z

tests/test_selection_helpers.py

+"""
+Unit tests for helper functions in ops.selection module.
+"""
+import datetime


datetime is imported but unused in this test file; please remove it to avoid unused-import warnings / lint failures.

Suggested change

import datetime

Copilot · 2026-02-20T08:15:35Z

tests/test_selection_helpers.py

+def test_check_step_single_point_coordinate():
+    """Test check_step with single point coordinate (edge case - will raise IndexError)."""
+    # Create dataset with single time point (diff will be empty array)
+    time_values = pd.date_range("2020-01-01", periods=1, freq="3H")
+    ds = xr.Dataset(
+        {"var": (["time"], [1])},
+        coords={"time": time_values},
+    )
+    requested_step = pd.Timedelta(hours=3)
+    # This will raise IndexError when trying to access all_steps[0] on empty array
+    # This documents current behavior - could be improved to raise more descriptive error
+    with pytest.raises(IndexError):
+        check_step(requested_step, "time", ds)


This test locks in an IndexError for the single-point coordinate case, which is an implementation accident (coming from indexing all_steps[0] on an empty diff array). It would be more maintainable to update check_step to raise a descriptive ValueError when ds[coord] has fewer than 2 points, and assert that error type/message here instead.

- test_chunking.py: - Remove large_dataset fixture (inefficient 5000x5000 allocation) - Fix test_check_chunk_size_large_chunks to use chunk sizes that actually exceed 1GB threshold (12000x12000) instead of 1000x1000 - Use loguru handler to capture logs instead of caplog (which doesn't capture loguru output) - Remove caplog from test_check_chunk_size_small_chunks - test_loading.py: - Remove unused imports: tempfile and Path - Fix test_load_input_dataset_netcdf to use pytest.importorskip for netCDF4 engine and specify engine explicitly - test_selection_helpers.py: - Remove unused datetime import - Update test_check_step_single_point_coordinate to expect ValueError instead of IndexError - ops/selection.py: - Fix check_step to raise descriptive ValueError when coordinate has fewer than 2 points, instead of allowing IndexError Co-authored-by: Cursor <cursoragent@cursor.com>

leifdenby · 2026-02-27T22:55:15Z

Please revert to using the default PR template here. Once you've done that you can tag me for a review :) thank you!

arnavsharma990 · 2026-03-05T08:22:04Z

@leifdenby I've updated the PR to follow the default template.
Please let me know if anything else should be adjusted. Thanks!

Copilot AI review requested due to automatic review settings February 20, 2026 08:11

Copilot started reviewing on behalf of arnavsharma990 February 20, 2026 08:12 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: add unit tests for functions without direct test coverage#94

tests: add unit tests for functions without direct test coverage#94
arnavsharma990 wants to merge 2 commits intomllam:mainfrom
arnavsharma990:tests/add-missing-unit-tests

arnavsharma990 commented Feb 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

leifdenby commented Feb 27, 2026

Uh oh!

arnavsharma990 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arnavsharma990 commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

leifdenby commented Feb 27, 2026

Uh oh!

arnavsharma990 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arnavsharma990 commented Feb 20, 2026 •

edited

Loading