Skip to content

WIP: Tracked segmentation masks I/O via Zarr and Dask#936

Draft
egoistpizza wants to merge 3 commits intoneuroinformatics-unit:mainfrom
egoistpizza:feature/octron-mask-io
Draft

WIP: Tracked segmentation masks I/O via Zarr and Dask#936
egoistpizza wants to merge 3 commits intoneuroinformatics-unit:mainfrom
egoistpizza:feature/octron-mask-io

Conversation

@egoistpizza
Copy link
Copy Markdown

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other: Draft / Proof of Concept for mask I/O integration

Why is this PR needed?

As tracking models like SAM 2 and OCTRON become more widely used, there is a growing need to handle dense segmentation masks alongside standard bounding boxes. However, loading high-resolution instance masks directly into memory creates severe RAM bottlenecks. A memory-efficient, lossless approach is needed to bring this data into the movement ecosystem.

What does this PR do?

This is a draft PR to lay the early architectural groundwork for a hybrid I/O approach, following up on recent community discussions. It introduces two main components to movement.io:

  1. load_octron_bboxes: A CSV loader that handles standard bounding box coordinates but includes an extra_data_vars toggle. This allows users to selectively load heavy metrics (like eccentricity or solidity) only when explicitly required.
  2. load_masks_from_zarr: A loader that lazily references instance masks using dask.array. It casts the data to boolean and structures it into an xarray.DataArray with (time, individuals, x, y) dimensions.

References

How has this PR been tested?

I wrote a local mock script to simulate an OCTRON CSV and a dummy Zarr array. This confirms the behavior we discussed on Zulip (specifically "Option 2"): the extra_data_vars append correctly, and the masks remain as lazy boolean Dask arrays without blowing up the RAM.

Click to see the test script and output

test_poc.py

"""Test script for the proof-of-concept mask and bounding box loaders."""

import shutil
from pathlib import Path

import numpy as np
import pandas as pd
import zarr

from movement.io.load_masks import load_masks_from_zarr, load_octron_bboxes


def create_mock_data(csv_path: str, zarr_path: str) -> None:
    """Create temporary mock data for testing."""
    # 1. Create mock OCTRON CSV data
    df = pd.DataFrame({
        "x": [10.5, 11.0, 11.5],
        "y": [20.0, 21.0, 22.0],
        "width": [5.0, 5.0, 5.0],
        "height": [10.0, 10.0, 10.0],
        "eccentricity": [0.80, 0.81, 0.82],
        "solidity": [0.90, 0.90, 0.91]
    })
    df.to_csv(csv_path, index=False)

    # 2. Create mock Zarr mask data (3 frames, 100x100 pixels)
    mock_mask_data = np.random.choice([0, 1], size=(3, 100, 100))
    zarr.save(zarr_path, mock_mask_data)


def clean_up(csv_path: str, zarr_path: str) -> None:
    """Remove temporary mock data after tests are complete."""
    Path(csv_path).unlink(missing_ok=True)
    if Path(zarr_path).exists():
        shutil.rmtree(zarr_path)


def main() -> None:
    """Run the IO tests."""
    csv_path = "mock_octron.csv"
    zarr_path = "mock_mask.zarr"

    try:
        print("Creating mock data...")
        create_mock_data(csv_path, zarr_path)

        # Test 1: Bounding boxes with extra data variables
        print("\n--- Testing load_octron_bboxes ---")
        ds_bboxes = load_octron_bboxes(csv_path, extra_data_vars=True, fps=30.0)
        print(ds_bboxes)
        
        # Simple assertion to ensure extra variables are loaded
        assert "eccentricity" in ds_bboxes.data_vars, "Extra data vars failed to load."

        # Test 2: Lazy mask loading with Dask and Zarr
        print("\n--- Testing load_masks_from_zarr ---")
        zarr_dict = {"ind_0": zarr_path}
        da_masks = load_masks_from_zarr(zarr_dict)
        print(da_masks)

        # Verify mask properties
        print("\nVerifying mask properties:")
        print(f"Data Type (Expected: bool): {da_masks.dtype}")
        print(f"Dimensions: {da_masks.dims}")
        print(f"Is Dask Array?: {'dask' in str(type(da_masks.data)).lower()}")

    finally:
        # Ensure cleanup runs even if a test fails
        print("\nCleaning up mock data...")
        clean_up(csv_path, zarr_path)
        print("Done.")


if __name__ == "__main__":
    main()

Output:

Creating mock data...

--- Testing load_octron_bboxes ---
<xarray.Dataset> Size: 284B
Dimensions:       (time: 3, individuals: 1, features: 4)
Coordinates:
  * time          (time) float64 24B 0.0 0.03333 0.06667
  * individuals   (individuals) <U5 20B 'ind_0'
  * features      (features) <U6 96B 'x' 'y' 'width' 'height'
Data variables:
    bboxes        (time, individuals, features) float64 96B 10.5 20.0 ... 10.0
    eccentricity  (time, individuals) float64 24B 0.8 0.81 0.82
    solidity      (time, individuals) float64 24B 0.9 0.9 0.91

--- Testing load_masks_from_zarr ---
<xarray.DataArray 'segmentation_masks' (time: 3, individuals: 1, x: 100, y: 100)> Size: 30kB
dask.array<chunksize=(3, 1, 100, 100), meta=np.ndarray>
Coordinates:
  * individuals  (individuals) <U5 20B 'ind_0'
Dimensions without coordinates: time, x, y

Verifying mask properties:
Data Type (Expected: bool): bool
Dimensions: ('time', 'individuals', 'x', 'y')
Is Dask Array?: True

Cleaning up mock data...
Done.

(Note: Formal pytest suites will be added as the implementation matures).

Is this a breaking change?

No. This is purely additive and introduces new experimental loaders to the movement.io module without altering the existing DeepLabCut or SLEAP pipelines.

Does this PR require an update to the documentation?

Yes. Once the architecture is finalized, the I/O tutorials will need updates to show users how to utilize the extra_data_vars argument and how to work with the lazy Dask mask arrays.

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant