Skip to content

Provide base class for dataset loaders#59

Merged
benedekrozemberczki merged 2 commits intoAstraZeneca:mainfrom
cthoyt:update-dataset
Jan 21, 2022
Merged

Provide base class for dataset loaders#59
benedekrozemberczki merged 2 commits intoAstraZeneca:mainfrom
cthoyt:update-dataset

Conversation

@cthoyt
Copy link
Contributor

@cthoyt cthoyt commented Jan 21, 2022

Summary

This PR abstracts the essential components of the dataset loader into a base class to allow for future implementations of eager datasets (e.g., all parts of the dataset are already in memory) and for other lazy local dataset loaders.

  • Code passes all tests
  • Unit tests provided for these changes
  • Documentation and docstrings added for these changes

Changes

  • Create abstract base class with unimplemented methods for getting drugs, contexts, and labeled triples
  • Rename mid-level class to RemoteDatasetLoader

Next steps

The following shows an implementation of an eager dataset, which might be more useful for local datasets.

@dataclass
class EagerDatasetLoader(DatasetLoader):
    """An eager dataset."""

    context_feature_set: ContextFeatureSet
    drug_feature_set: DrugFeatureSet
    labeled_triples: LabeledTriples

    def get_labeled_triples(self) -> LabeledTriples:
        """Get the labeled triples file from the storage."""
        return self.labeled_triples

    def get_context_features(self) -> ContextFeatureSet:
        """Get the context feature set."""
        return self.context_feature_set

    def get_drug_features(self):
        """Get the drug feature set."""
        return self.drug_feature_set

This will allow for the future addition of an eager dataset loader
Copy link
Contributor

@benedekrozemberczki benedekrozemberczki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @cthoyt !

@benedekrozemberczki benedekrozemberczki merged commit 47af8b4 into AstraZeneca:main Jan 21, 2022
@cthoyt cthoyt deleted the update-dataset branch January 21, 2022 16:16
@aminemosbah
Copy link

any chance we would load , train and essentially test our own dataset ?

@cthoyt
Copy link
Contributor Author

cthoyt commented Feb 15, 2022

@aminemosbah I would suggest looking at https://chemicalx.readthedocs.io/en/latest/api/chemicalx.data.LocalDatasetLoader.html#chemicalx.data.LocalDatasetLoader for loading your own dataset that's already in the right format in a given directory

@aminemosbah
Copy link

thx, but i have smiles to predict in a csv file to predict locally , any quick snippet ?

@cthoyt
Copy link
Contributor Author

cthoyt commented Feb 15, 2022

@aminemosbah I have in mind a solution for what you want (which is the obvious realistic use case) but this it is blocked #50 and #58. @benedekrozemberczki would love to get your input on #50 ;)

@aminemosbah
Copy link

need to hack the dataloader to make it work for local data

@aminemosbah
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants