Skip to content

Add a metadata field for when source data was produced  #3625

@davanstrien

Description

@davanstrien

Is your feature request related to a problem? Please describe.
The current problem is that information about when source data was produced is not easily visible. Though there are a variety of metadata fields available in the dataset viewer, time period information is not included. This feature request suggests making metadata relating to the time that the underlying source data was produced more prominent and outlines why this specific information is of particular importance, both in domain-specific historic research and more broadly.

Describe the solution you'd like

There are a variety of metadata fields exposed in the dataset viewer (license, task categories, etc.) These fields make this metadata more prominent both for human users and as potentially machine-actionable information (for example, through the API). I would propose to add a metadata field that says when some underlying data was produced. For example, a dataset would be labelled as being produced between 1800-1900.

Describe alternatives you've considered
This information is sometimes available in the Datacard or a paper describing the dataset. However, it's often not that easy to identify or extract this information, particularly if you want to use this field as a filter to identify relevant datasets.

Additional context

I believe this feature is relevant for a number of reasons:

  • Increasingly, there is an interest in using historical data for training language models (for example, https://huggingface.co/dbmdz/bert-base-historic-dutch-cased), and datasets to support this task (for example, https://huggingface.co/datasets/bnl_newspapers). For these datasets, indicating the time periods covered is particularly relevant.
  • More broadly, time is likely a common source of domain drift. Datasets of movie reviews from the 90s may not work well for recent movie reviews. As the documentation and long-term management of ML data become more of a priority, quickly understanding the time when the underlying text (or other data types) is arguably more important.
  • time-series data: datasets are adding more support for time series data. Again, the periods covered might be particularly relevant here.

open questions

  • I think some of my points above apply not only to the underlying data but also to annotations. As a result, there could also be an argument for encoding this information somewhere. However, I would argue (but could be persuaded otherwise) that this is probably less important for filtering. This type of context is already addressed in the datasheets template and often requires more narrative to discuss.
  • what level of granularity would make sense for this? e.g. assigning a decade, century or year?
  • how to encode this information? What formatting makes sense
  • what specific time to encode; a data range? (mean, modal, min, max value?)

This is a slightly amorphous feature request - I would be happy to discuss further/try and propose a more concrete solution if this seems like something that could be worth considering. I realise this might also touch on other parts of the 🤗 hubs ecosystem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions