huggingface · lhoestq · Sep 24, 2025 · Aug 19, 2025 · Sep 1, 2025 · Sep 24, 2025
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@
 🤗 Datasets is a lightweight library providing **two** main features:
 
 - **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
-- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
+- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, HDF5, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
 
 [🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Share a dataset on the Hub**](https://huggingface.co/docs/datasets/share)
 

diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx
@@ -178,6 +178,17 @@ The cache directory to store intermediate processing results will be the Arrow f
 
 For now only the Arrow streaming format is supported. The Arrow IPC file format (also known as Feather V2) is not supported.
 
+## HDF5 files
+
+[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files:
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("hdf5", data_files="data.h5")
+```
+
+Note that the HDF5 loader assumes that all datasets in the file have rows on their first dimension. Groups are flattened, i.e. the syntax `h5py.File("data.h5")["group"]["key"]` becomes `load_dataset("hdf5", data_files="data.h5")["group/key"]`.
+
 ### SQL
 
 Read database contents with [`~datasets.Dataset.from_sql`] by specifying the URI to connect to your database. You can read both table names and queries:

diff --git a/docs/source/tabular_load.mdx b/docs/source/tabular_load.mdx
@@ -4,6 +4,7 @@ A tabular dataset is a generic dataset used to describe any data stored in rows
 
 - CSV files
 - Pandas DataFrames
+- HDF5 files
 - Databases
 
 ## CSV files
@@ -63,6 +64,17 @@ Use the `splits` parameter to specify the name of the dataset split:
 
 If the dataset doesn't look as expected, you should explicitly [specify your dataset features](loading#specify-features). A [pandas.Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) may not always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length `0` or if the Series only contains `None/NaN` objects, the type is set to `null`.
 
+## HDF5 files
+
+[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files:
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("hdf5", data_files="data.h5")
+```
+
+Note that the HDF5 loader assumes that all datasets in the file have rows on their first dimension. Groups are flattened, i.e. the syntax `h5py.File("data.h5")["group"]["key"]` becomes `dataset["group/key"]`.
+
 ## Databases
 
 Datasets stored in databases are typically accessed with SQL queries. With 🤗 Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of 🤗 Datasets to prepare your dataset for training.