[Memory] Better Memory Utilization for PyHealth #622

LogicFan · 2025-11-21T14:42:47Z

This should be able to close #332 .

Co-authored-by: John Wu <[email protected]>

Carshing Co-authored-by: John Wu <[email protected]>

jhnwu3

Hey do you know which dask version you used? Probably would be good to add this to the pyprojectoml.

Will look deeper into the CI and other implementation details later, have to go wander Japan for a bit 😂

LogicFan · 2025-11-23T23:06:25Z

pyproject.toml

+  "dask[complete]~=2025.11.0",
+  "litdata~=0.2.58",
+  "xxhash~=3.6.0",


New dependencies are specified here.

ah im blind.

LogicFan · 2025-11-23T23:26:07Z

pyhealth/data/data.py

    event_type: str
    timestamp: datetime
-    attr_dict: Mapping[str, any] = field(default_factory=dict)
+    attr_dict: Mapping[str, Any] = field(default_factory=dict)


Small type hint fix here, any is a function for iterables, and Any is the correct type hint.

LogicFan · 2025-11-23T23:27:54Z

pyhealth/data/data.py

    """

-    def __init__(self, patient_id: str, data_source: pl.DataFrame) -> None:
+    def __init__(self, patient_id: str, data_source: pd.DataFrame) -> None:


The polars will spawn processes. Given this will be ran in litdata.optimize in a multi-processes environment, nested multi-process will cause a hang. Thus, we can this to pandas.

LogicFan · 2025-11-23T23:28:24Z

pyhealth/data/data.py


+    def get_events_py(self, **kawargs) -> List[Event]:
+        """Type-safe wrapper for get_events."""
+        return self.get_events(**kawargs, return_df=False) # type: ignore
+
+    def get_events_df(self, **kawargs) -> pd.DataFrame:
+        """DataFrame wrapper for get_events."""
+        return self.get_events(**kawargs, return_df=True) # type: ignore
+


Added for type hint purpose.

LogicFan · 2025-11-23T23:31:31Z

pyhealth/tasks/mortality_prediction_stagenet_mimic4.py

+            labevents_df = patient.get_events_df(
                event_type="labevents",
                start=admission_time,
                end=admission_dischtime,
-                return_df=True,
            )


It returns a pandas dataframe here, any task that uses get_event_df (or get_event(..., return_df=True)) will need to update their code to be compitable with pandas.

LogicFan · 2025-11-23T23:32:22Z

pyhealth/tasks/base_task.py

    output_schema: Dict[str, Union[str, Type]]

-    def pre_filter(self, df: pl.LazyFrame) -> pl.LazyFrame:
+    def pre_filter(self, df: dd.DataFrame) -> dd.DataFrame:


It becomes dask dataframe here, any task override this need to update.

LogicFan · 2025-11-23T23:34:14Z

pyhealth/datasets/utils.py

+    dataset.set_shuffle(shuffle)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
-        shuffle=shuffle,
        collate_fn=collate_fn_dict_with_padding,
    )



python native dataloader does not support shuffle on IterableDataset, it must be shuffled at dataset level.

LogicFan · 2025-11-23T23:35:02Z

pyhealth/datasets/splitter.py

+    train_dataset = SampleSubset(dataset, train_index) # type: ignore
+    val_dataset = SampleSubset(dataset, val_index) # type: ignore
+    test_dataset = SampleSubset(dataset, test_index) # type: ignore


torch native Subset does not support iterable dataset, we use a custom-defined iterable dataset here.

LogicFan · 2025-11-23T23:35:51Z

pyhealth/datasets/sample_dataset.py

+class SampleSubset(IterableDataset):
+    """A subset of the SampleDataset.
+
+    Args:
+        sample_dataset (SampleDataset): The original SampleDataset.
+        indices (List[int]): List of indices to include in the subset.
+    """
+


Add new SampleSubset to support creating subset of a SampleDataset for train/val/test split.

LogicFan · 2025-11-23T23:37:28Z

pyhealth/datasets/sample_dataset.py

+    def _build_subset_dataset(
+        self, base_dataset: StreamingDataset, indices: Sequence[int]
+    ) -> Tuple[StreamingDataset, int]:
+        """Create a StreamingDataset restricted to the provided indices."""
+        if len(base_dataset.subsampled_files) != len(base_dataset.region_of_interest):


Modified based on official litdata train_test_split https://github.com/Lightning-AI/litData/blob/main/src/litdata/utilities/train_test_split.py

LogicFan · 2025-11-23T23:38:00Z

pyhealth/datasets/mimic4.py

+        cache_dir: str | Path | None = None,
+        num_workers: int = 1,
+        mem_per_worker: str | int = "8GB",
+        compute: bool = True,


Add other dataset will need to add these new args.

LogicFan · 2025-11-23T23:38:44Z

pyhealth/datasets/mimic4.py

+        super().__init__(
+            root=f"{str(ehr_root)},{str(note_root)},{str(cxr_root)}",
+            tables=(ehr_tables or []) + (note_tables or []) + (cxr_tables or []),
+            dataset_name=dataset_name,
+            cache_dir=cache_dir,
+            num_workers=num_workers,
+            mem_per_worker=mem_per_worker,
+            compute=False, # defer compute to later, we need to aggregate all sub-datasets first
+            dev=dev,
+        )


important to assign root, tables, dataset_name, dev to calculate a correct default cache path.

LogicFan · 2025-11-23T23:40:36Z

pyhealth/datasets/mimic4.py

+    def load_data(self) -> dd.DataFrame:
        """


one should override load_data if it requires custom logic, since there is no .global_event_df available. The existance of a dask dataframe in class field would cause a multi-process failure, causing it unable to use multiple worker to process the sample.

LogicFan · 2025-11-23T23:41:50Z

pyhealth/datasets/base_dataset.py

+def _pickle(datum: dict[str, Any]) -> dict[str, bytes]:
+    return {k: pickle.dumps(v) for k, v in datum.items()}
+
+
+def _unpickle(datum: dict[str, bytes]) -> dict[str, Any]:
+    return {k: pickle.loads(v) for k, v in datum.items()}
+

-    if path_exists(alt_path):
-        logger.info(f"Original path does not exist. Using alternative: {alt_path}")
-        return scan_file(alt_path)
+def _patient_bucket(patient_id: str, n_partitions: int) -> int:
+    """Hash patient_id to a bucket number."""
+    bucket = int(xxhash.xxh64_intdigest(patient_id) % n_partitions)
+    return bucket

-    raise FileNotFoundError(f"Neither path exists: {path} or {alt_path}")
+
+def _transform_fn(
+    input: tuple[int, str, BaseTask],
+) -> Iterator[Dict[str, Any]]:
+    (bucket_id, merged_cache, task) = input
+    path = f"{merged_cache}/bucket={bucket_id}"
+    # This is more efficient than reading patient by patient
+    grouped = pd.read_parquet(path).groupby("patient_id")
+
+    for patient_id, patient_df in grouped:
+        patient = Patient(patient_id=str(patient_id), data_source=patient_df)
+        for sample in task(patient):
+            # Schema is too complex to be handled by LitData, so we pickle the sample here
+            yield _pickle(sample)



It's important to define these functions (outside of a class) here to avoid issue for multi-processes environment.

LogicFan · 2025-11-23T23:52:36Z

pyhealth/datasets/base_dataset.py

+        id_str = json.dumps(
+            {
+                "root": self.root,
+                "tables": sorted(self.tables),
+                "dataset_name": self.dataset_name,
+                "dev": self.dev,
+            },
+            sort_keys=True,
+        )
+        cache_dir = Path(platformdirs.user_cache_dir(appname="pyhealth")) / str(
+            uuid.uuid5(uuid.NAMESPACE_DNS, id_str)
+        )
+        print(f"No cache_dir provided. Using default cache dir: {cache_dir}")
+        self._cache_dir = cache_dir


Compute default cache dir. I think this should be unique enough?

LogicFan · 2025-11-23T23:58:00Z

pyhealth/datasets/base_dataset.py

+                    bucket = global_event_df["patient_id"].apply(
+                        lambda pid: _patient_bucket(pid, n_partitions),
+                        meta=("patient_id", "int"),
+                    )


Split dataframe into bucket based on paitent id, enable faster processing downstream.

LogicFan · 2025-11-23T23:59:31Z

pyhealth/datasets/base_dataset.py

+
+        path = self._joined_cache()
+        with open(f"{path}/index.json", "rb") as f:
+            n_partitions = json.load(f)["n_partitions"]
+        bucket = _patient_bucket(patient_id, n_partitions)
+        path = f"{path}/bucket={bucket}"
+        df = pd.read_parquet(path)
+        patient = Patient(
+            patient_id=patient_id,
+            data_source=df[df["patient_id"] == patient_id],
+        )
+        return patient



Read from relevant bucket only, this would be much faster for larger dataset.

LogicFan and others added 30 commits November 21, 2025 02:12

Add dask dependency for low memory data processing

b20e639

Co-authored-by: John Wu <[email protected]>

Add dataset cache_dir

0921380

Co-authored-by: John Wu <[email protected]>

Fix typeing

61d9c08

Co-authored-by: John Wu <[email protected]>

Convert table csv file to parquet file

05eb5a1

Co-authored-by: John Wu <[email protected]>

Add TODO

4ce8613

Co-authored-by: John Wu <[email protected]>

Change load_data to dd.DataFrame

b69bcbe

Fix mimic4 for dask

e7c7964

Co-authored-by: John Wu <[email protected]>

enable collected_global_event_df for Dask

c7b8092

Co-authored-by: John Wu <[email protected]>

Fix unique_patient_ids, stats for Dask

01a0048

Co-authored-by: John Wu <[email protected]>

Initial Attempt for Patient with Dask dataframe

eaa6ba5

Co-authored-by: John Wu <[email protected]>

Fix patient

2593680

Co-authored-by: John Wu <[email protected]>

Support get_patient for dask

006c3ae

Co-authored-by: John Wu <[email protected]>

Support iter_patients for Dask

6d38147

Co-authored-by: John Wu <[email protected]>

Add overload type hint for Patient

f41c4d5

Co-authored-by: John Wu <[email protected]>

Update pre_filter signature to Dask

9195229

Co-authored-by: John Wu <[email protected]>

Fix type hint

cf23a92

Co-authored-by: John Wu <[email protected]>

Chage lab_df to be dask compitable.

5d440c7

Co-authored-by: John Wu <[email protected]>

Fix schema inference on csv reader

d834460

Co-authored-by: John Wu <[email protected]>

Fix incorrect Dask transform

2a0d7d9

Co-authored-by: John Wu <[email protected]>

Optimize code

c769a18

Co-authored-by: John Wu <[email protected]>

revert data back to polars as it it faster

90dee13

Co-authored-by: John Wu <[email protected]>

Because patient has reverted

d2faab9

Co-authored-by: John Wu <[email protected]>

Revert task to use polars, as it's faster

c1d9117

Co-authored-by: John Wu <[email protected]>

use pl.DataFrame in patient.

9905ddb

Co-authored-by: John Wu <[email protected]>

Fix type conversion issues

91143b6

Co-authored-by: John Wu <[email protected]>

Add litdata

4091f7f

Co-authored-by: John Wu <[email protected]>

Fix Mimic4

5a9424f

Co-authored-by: John Wu <[email protected]>

Works for single worker

1a0b6a0

Change SampleDataset to IterableDataset

f7ea645

Distributed Progress Bar, Bucekt partition

a71218e

Carshing Co-authored-by: John Wu <[email protected]>

LogicFan added 9 commits November 22, 2025 22:37

Fix MortalityPredictionStageNetMIMIC4 to pandas

1f1e2cc

Fix litdata.optimize hang

cc42dd3

Fix sample dataset

cd9a79f

add SampleSubset

d7a1fcf

Add more test cases

2d47cc3

Refactor SampleSubset __init__

d4f3216

support set_shuffle, add testcase

a93d1a6

Fix incorrect typehint

4dbc4f4

Fix splitter

66e9e64

LogicFan changed the title ~~Mem 5~~ [Memory] Better Memory Utilization for PyHealth Nov 23, 2025

jhnwu3 requested changes Nov 23, 2025

View reviewed changes

LogicFan commented Nov 23, 2025

View reviewed changes

LogicFan marked this pull request as ready for review November 24, 2025 00:02

Fix task.pre_filter

790cb42

[Memory] Better Memory Utilization for PyHealth #622

Are you sure you want to change the base?

[Memory] Better Memory Utilization for PyHealth #622

Uh oh!

Conversation

LogicFan commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhnwu3 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LogicFan commented Nov 21, 2025 •

edited

Loading