Add a metadata field for when source data was produced 

**Is your feature request related to a problem? Please describe.**
The current problem is that information about when source data was produced is not easily visible. Though there are a variety of metadata fields available in the dataset viewer, time period information is not included. This feature request suggests making metadata relating to the time that the underlying *source* data was produced more prominent and outlines why this specific information is of particular importance, both in domain-specific historic research and more broadly.

**Describe the solution you'd like**

There are a variety of metadata fields exposed in the dataset viewer (license, task categories, etc.) These fields make this metadata more prominent both for human users and as potentially machine-actionable information (for example, through the API). I would propose to add a metadata field that says when some underlying data was produced. For example, a dataset would be labelled as being produced between `1800-1900`. 

**Describe alternatives you've considered**
This information is sometimes available in the Datacard or a paper describing the dataset. However, it's often not that easy to identify or extract this information, particularly if you want to use this field as a filter to identify relevant datasets. 

**Additional context**

I believe this feature is  relevant for a number of reasons: 
- Increasingly, there is an interest in using historical data for training language models (for example, https://huggingface.co/dbmdz/bert-base-historic-dutch-cased), and datasets to support this task (for example, https://huggingface.co/datasets/bnl_newspapers). For these datasets, indicating the time periods covered is particularly relevant. 
- More broadly, time is likely a common source of domain drift. Datasets of movie reviews from the 90s may not work well for recent movie reviews. As the documentation and long-term management of ML data become more of a priority, quickly understanding the time when the underlying text (or other data types) is arguably more important. 
- time-series data: datasets are adding more support for time series data. Again, the periods covered might be particularly relevant here.

**open questions**

- I think some of my points above apply not only to the underlying data but also to annotations. As a result, there could also be an argument for encoding this information somewhere. However, I would argue (but could be persuaded otherwise) that this is probably less important for filtering. This type of context is already addressed in the datasheets template and often requires more narrative to discuss. 
- what level of granularity would make sense for this? e.g. assigning a decade, century or year?
- how to encode this information? What formatting makes sense 
- what specific time to encode; a data range? (mean, modal, min, max value?) 

This is a slightly amorphous feature request - I would be happy to discuss further/try and propose a more concrete solution if this seems like something that could be worth considering. I realise this might also touch on other parts of the 🤗 hubs ecosystem. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a metadata field for when source data was produced #3625

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a metadata field for when source data was produced #3625

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions