Skip to content

Domain specific dataset discovery on the Hugging Face hub  #4702

@davanstrien

Description

@davanstrien

Is your feature request related to a problem? Please describe.

The problem

The datasets hub currently has 8,239 datasets. These datasets span a wide range of different modalities and tasks (currently with a bias towards textual data).

There are various ways of identifying datasets that may be relevant for a particular use case:

  • searching
  • various filters

Currently, however, there isn't an easy way to identify datasets belonging to a specific domain. For example, I want to browse machine learning datasets related to 'social science' or 'climate change research'.

The ability to identify datasets relating to a specific domain has come up in discussions around the BigLA datasets hackathon bigscience-workshop/lam#31 (comment). As part of the hackathon, we're currently collecting datasets related to Libraries, Archives and Museums and making them available via the hub. We currently do this under a Hugging Face organization (https://huggingface.co/biglam). However, going forward, I can see some of these datasets being migrated to sit under an organization that is the custodian of the dataset (for example, a national library the data was originally from). At this point, it becomes more difficult to quickly identify datasets from this domain without relying on search.

This is also related to some existing issues on Github related to metadata on the hub:

Describe the solution you'd like

Some possible solutions that may help with this:

Enable domain tags (from a controlled vocabulary)

  • This would add metadata field to the YAML for the domain a dataset relates to
  • Advantages:
    • the list is controlled, allowing it to be more easily integrated into the datasets tag app (https://huggingface.co/space/huggingface/datasets-tagging)
    • the controlled vocabulary could align with an existing controlled vocabulary
    • this additional metadata can be used to perform filtering by domain
  • disadvantages
    • choosing the best controlled vocab may be difficult
    • there are many datasets that are likely to fit into the 'machine learning' domain (i.e. there is a long tail of datasets that aren't in more 'generic' machine learning domain

Enable topic tags (user-generated)

Enable 'free form' topic tags for datasets and models. This would be closer to GitHub's repository topics which can be chosen from a controlled list (https://github.com/topics/) but can also be more user/org specific. This could potentially be useful for organizations to also manage their own models and datasets as the number they hold in their org grows. For example, they may create 'topic tags' for a specific project, so it's clearer which datasets /models are related to that project.

Collections

This solution would likely be the biggest shift and may require significant changes in the hub fronted. Collections could work in several different ways but would include:

Users can curate particular datasets, models, spaces, etc., into a collection. For example, they may create a collection of 'historic newspapers suitable for training language models'. These collections would not be mutually exclusive, i.e. a dataset can belong to zero, one or many collections. Collections can also potentially be nested under other collections.

This is fairly common on other data reposotiores for example the following collections:
Screenshot 2022-07-18 at 11 50 44

all belong under a higher level collection (https://bl.iro.bl.uk/collections/353c908d-b495-4413-b047-87236d2573e3?locale=en).

There are different models one could use for how these collections could be created:

  • only within an org
  • for any dataset/model
  • the owner or a dataset/model has to agree to be added to a collection
  • a collection owner can have people suggest additions to their collection
  • other models....

These collections could be thematic, related to particular training approaches, curate models with particular inference properties etc. Whilst some of these features may duplicate current/or future tag filters on the hub, they offer the advantage of being flexible and not having to predict what users will want to do upfront.

There is also potential for automating the creation of these collections based on existing metadata. For example, one could collect models trained on a collection of datasets so for example, if we had a collection of 'historic newspapers suitable for training language models' that contained 30 datasets, we could create another collection 'historic newspaper language models' that takes any model on the hub whose metadata says it used one or more of those 30 datasets.

There is also the option of exploring ML approaches to suggest models/datasets may be relevant to a particular collection.

This approach is likely to be quite difficult to implement well and would require significant thought. There is also likely to be a benefit in doing quite a bit of upfront work in curating useful collections to demonstrate the benefits of collections.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

It is possible to collate this information externally, i.e. one could link back to the relevant models/datasets from an external platform.

Additional context
Add any other context about the feature request here.

I'm cc'ing others involved in the BigLAM hackathon who may also have thoughts @cakiki @clancyoftheoverflow @albertvillanova

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions