Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,10 @@
- local: nlp_process
title: Process text data
title: "Text"
- sections:
- local: tabular_load
title: Load tabular data
title: "Tabular"
- sections:
- local: share
title: Share
Expand Down
3 changes: 2 additions & 1 deletion docs/source/how_to.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,13 @@ Interested in learning more? Take a look at [Chapter 5](https://huggingface.co/c

</Tip>

The guides are organized into five sections:
The guides are organized into six sections:

- <span class="underline decoration-sky-400 decoration-2 font-semibold">General usage</span>: Functions for general dataset loading and processing. The functions shown in this section are applicable across all dataset modalities.
- <span class="underline decoration-pink-400 decoration-2 font-semibold">Audio</span>: How to load, process, and share audio datasets.
- <span class="underline decoration-yellow-400 decoration-2 font-semibold">Vision</span>: How to load, process, and share image datasets.
- <span class="underline decoration-green-400 decoration-2 font-semibold">Text</span>: How to load, process, and share text datasets.
- <span class="underline decoration-orange-400 decoration-2 font-semibold">Tabular</span>: How to load, process, and share tabular datasets.
- <span class="underline decoration-indigo-400 decoration-2 font-semibold">Dataset repository</span>: How to share and upload a dataset to the <a href="https://huggingface.co/datasets">Hub</a>.

If you have any questions about 🤗 Datasets, feel free to join and ask the community on our [forum](https://discuss.huggingface.co/c/datasets/10).
54 changes: 12 additions & 42 deletions docs/source/loading.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -106,39 +106,18 @@ Datasets can be loaded from local files stored on your computer and from remote

### CSV

🤗 Datasets can read a dataset made up of one or several CSV files:
🤗 Datasets can read a dataset made up of one or several CSV files (in this case, pass your CSV files as a list):

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset("csv", data_files="my_file.csv")
```

If you have more than one CSV file:

```py
>>> dataset = load_dataset("csv", data_files=["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"])
```

You can also map the training and test splits to specific CSV files:

```py
>>> dataset = load_dataset("csv", data_files={"train": ["my_train_file_1.csv", "my_train_file_2.csv"], "test": "my_test_file.csv"})
```

To load remote CSV files via HTTP, pass the URLs instead:

```py
>>> base_url = "https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/"
>>> dataset = load_dataset('csv', data_files={'train': base_url + 'train.csv', 'test': base_url + 'test.csv'})
```
<Tip>

To load zipped CSV files:
For more details, check out the [how to load tabular datasets from CSV files](tabular_load#csv-files) guide.

```py
>>> url = "https://domain.org/train_data.zip"
>>> data_files = {"train": url}
>>> dataset = load_dataset("csv", data_files=data_files)
```
</Tip>

### JSON

Expand Down Expand Up @@ -198,28 +177,19 @@ To load remote Parquet files via HTTP, pass the URLs instead:

### SQL

Read database contents with [`Dataset.from_sql`]. Both table names and queries are supported.

For example, a table from a SQLite file can be loaded with:
Read database contents with [`~datasets.Dataset.from_sql`] by specifying the URI to connect to your database. You can read both table names and queries:

```py
>>> from datasets import Dataset
>>> dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")
```

Use a query for a more precise read:

```py
>>> from sqlite3 import connect
>>> con = connect(":memory")
>>> # db writes ...
>>> from datasets import Dataset
>>> dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
# load entire table
>>> dataset = Dataset.from_sql("data_table_name", con="sqlite:///sqlite_file.db")
# load from query
>>> dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con="sqlite:///sqlite_file.db")
```

<Tip>

You can specify [`Dataset.from_sql#con`] as a [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) for the 🤗 Datasets caching to work across sessions.
For more details, check out the [how to load tabular datasets from SQL databases](tabular_load#databases) guide.

</Tip>

Expand Down Expand Up @@ -273,9 +243,9 @@ Load Pandas DataFrames with [`~Dataset.from_pandas`]:
>>> dataset = Dataset.from_pandas(df)
```

<Tip warning={true}>
<Tip>

An object data type in [pandas.Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) doesn't always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length `0` or the Series only contains `None/NaN` objects, the type is set to `null`. Avoid potential errors by constructing an explicit schema with [`Features`] using the `from_dict` or `from_pandas` methods. See the [troubleshoot](./loading#specify-features) section for more details on how to explicitly specify your own features.
For more details, check out the [how to load tabular datasets from Pandas DataFrames](tabular_load#pandas-dataframes) guide.

</Tip>

Expand Down
139 changes: 139 additions & 0 deletions docs/source/tabular_load.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Load tabular data

A tabular dataset is a generic dataset used to describe any data stored in rows and columns, where the rows represent an example and the columns represent a feature (can be continuous or categorical). These datasets are commonly stored in CSV files, Pandas DataFrames, and in database tables. This guide will show you how to load and create a tabular dataset from:

- CSV files
- Pandas DataFrames
- Databases

## CSV files

🤗 Datasets can read CSV files by specifying the generic `csv` dataset script in the [`~datasets.load_dataset`] method. To load more than one CSV file, pass them as a list to the `data_files` parameter:

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset("csv", data_files="my_file.csv")

# load multiple CSV files
>>> dataset = load_dataset("csv", data_files=["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"])
```

You can also map specific CSV files to the train and test splits:

```py
>>> dataset = load_dataset("csv", data_files={"train": ["my_train_file_1.csv", "my_train_file_2.csv"], "test": "my_test_file.csv"})
```

To load remote CSV files, pass the URLs instead:

```py
>>> base_url = "https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/"
>>> dataset = load_dataset('csv', data_files={"train": base_url + "train.csv", "test": base_url + "test.csv"})
```

To load zipped CSV files:

```py
>>> url = "https://domain.org/train_data.zip"
>>> data_files = {"train": url}
>>> dataset = load_dataset("csv", data_files=data_files)
```

## Pandas DataFrames

🤗 Datasets also supports loading datasets from [Pandas DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) with the [`~datasets.Dataset.from_pandas`] method:

```py
>>> from datasets import Dataset
>>> import pandas as pd

# create a Pandas DataFrame
>>> df = pd.read_csv("https://huggingface.co/datasets/imodels/credit-card/raw/main/train.csv")
>>> df = pd.DataFrame(df)
# load Dataset from Pandas DataFrame
>>> dataset = Dataset.from_pandas(df)
```

Use the `splits` parameter to specify the name of the dataset split:

```py
>>> train_ds = Dataset.from_pandas(train_df, split="train")
>>> test_ds = Dataset.from_pandas(test_df, split="test")
```

If the dataset doesn't look as expected, you should explicitly [specify your dataset features](loading#specify-features). A [pandas.Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) may not always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length `0` or if the Series only contains `None/NaN` objects, the type is set to `null`.

## Databases

Datasets stored in databases are typically accessed with SQL queries. With 🤗 Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of 🤗 Datasets to prepare your dataset for training.

### SQLite

SQLite is a small, lightweight database that is fast and easy to set up. You can use an existing database if you'd like, or follow along and start from scratch.

Start by creating a quick SQLite database with this [Covid-19 data](https://github.com/nytimes/covid-19-data/blob/master/us-states.csv) from the New York Times:

```py
>>> import sqlite3
>>> import pandas as pd

>>> conn = sqlite3.connect("us_covid_data.db")
>>> df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
>>> df.to_sql("states", conn, if_exists="replace")
```

This creates a `states` table in the `us_covid_data.db` database which you can now load into a dataset.

To connect to the database, you'll need the [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) that identifies your database. Connecting to a database with a URI caches the returned dataset. The URI string differs for each database dialect, so be sure to check the [Database URLs](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) for whichever database you're using.

For SQLite, it is:

```py
>>> uri = "sqlite:///us_covid_data.db"
```

Load the table by passing the table name and URI to [`~datasets.Dataset.from_sql`]:

```py
>>> from datasets import Dataset

>>> ds = Dataset.from_sql("states", uri)
>>> ds
Dataset({
features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
num_rows: 54382
})
```

Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example:

```py
>>> ds.filter(lambda x: x["state"] == "California")
```

You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables.

Load the dataset by passing your query and URI to [`~datasets.Dataset.from_sql`]:

```py
>>> from datasets import Dataset

>>> ds = Dataset.from_sql('SELECT * FROM states WHERE state="California";', uri)
>>> ds
Dataset({
features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
num_rows: 1019
})
```

Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example:

```py
>>> ds.filter(lambda x: x["cases"] > 10000)
```

### PostgreSQL

You can also connect and load a dataset from a PostgreSQL database, however we won't directly demonstrate how in the documentation because the example is only meant to be run in a notebook. Instead, take a look at how to install and setup a PostgreSQL server in this [notebook](https://colab.research.google.com/github/nateraw/huggingface-hub-examples/blob/main/sql_with_huggingface_datasets.ipynb#scrollTo=d83yGQMPHGFi)!

After you've setup your PostgreSQL database, you can use the [`~datasets.Dataset.from_sql`] method to load a dataset from a table or query.