Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,10 @@
- local: nlp_process
title: Process text data
title: "Text"
- sections:
- local: tabular_load
title: Load tabular data
title: "Tabular"
- sections:
- local: share
title: Share
Expand Down
3 changes: 2 additions & 1 deletion docs/source/how_to.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,13 @@ Interested in learning more? Take a look at [Chapter 5](https://huggingface.co/c

</Tip>

The guides are organized into five sections:
The guides are organized into six sections:

- <span class="underline decoration-sky-400 decoration-2 font-semibold">General usage</span>: Functions for general dataset loading and processing. The functions shown in this section are applicable across all dataset modalities.
- <span class="underline decoration-pink-400 decoration-2 font-semibold">Audio</span>: How to load, process, and share audio datasets.
- <span class="underline decoration-yellow-400 decoration-2 font-semibold">Vision</span>: How to load, process, and share image datasets.
- <span class="underline decoration-green-400 decoration-2 font-semibold">Text</span>: How to load, process, and share text datasets.
- <span class="underline decoration-orange-400 decoration-2 font-semibold">Tabular</span>: How to load, process, and share tabular datasets.
- <span class="underline decoration-indigo-400 decoration-2 font-semibold">Dataset repository</span>: How to share and upload a dataset to the <a href="https://huggingface.co/datasets">Hub</a>.

If you have any questions about 🤗 Datasets, feel free to join and ask the community on our [forum](https://discuss.huggingface.co/c/datasets/10).
201 changes: 201 additions & 0 deletions docs/source/tabular_load.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# Load tabular data

Many real-world datasets are stored in databases which are typically accessed by SQL queries. With 🤗 Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of 🤗 Datasets to prepare your dataset for training.

This guide will show you how to connect to SQLite and PostgreSQL and:

- Load an entire table.
- Load from a SQL query.

Check out this [notebook](https://colab.research.google.com/github/nateraw/huggingface-hub-examples/blob/main/sql_with_huggingface_datasets.ipynb) for a hands-on example!

## SQLite

SQLite is a small, lightweight database that is fast and easy to set up. You can use an existing database if you'd like, or follow along and start from scratch.

Start by creating a quick SQLite database with this [Covid-19 data](https://github.com/nytimes/covid-19-data/blob/master/us-states.csv) from the New York Times:

```py
>>> import sqlite3
>>> import pandas as pd

>>> conn = sqlite3.connect("us_covid_data.db")
>>> df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
>>> df.to_sql("states", conn, if_exists="replace")
```

This creates a `states` table in the `us_covid_data.db` database which you can now load into a dataset.

To connect to the database, you'll need the [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) that identifies your database. The URI string differs for each database dialect, so be sure to check the [Database URLs](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) for whichever database you're using.

For SQLite, it is:

```py
>>> uri = "sqlite:///us_covid_data.db"
```

Load the table by passing the table name and URI to [`~datasets.Dataset.from_sql`]:

```py
>>> from datasets import Dataset

>>> ds = Dataset.from_sql("states", uri)
>>> ds
Dataset({
features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
num_rows: 54382
})
```

Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example:

```py
>>> ds.filter(lambda x: x["state"] == "California")
```

You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables.

Load the dataset by passing your query and URI to [`~datasets.Dataset.from_sql`]:

```py
>>> from datasets import Dataset

>>> ds = Dataset.from_sql('SELECT * FROM states WHERE state="California";', uri)
>>> ds
Dataset({
features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
num_rows: 1019
})
```

Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example:

```py
>>> ds.filter(lambda x: x["cases"] > 10000)
```

## PostgreSQL

<Tip warning={true}>

This example is designed to only run in a Google Colab. Be careful if you want to run the server commands locally!

</Tip>

PostgreSQL is a popular open-source database. You can use an existing database if you'd like, or follow along and start from scratch.

Start by installing the PostgreSQL server and set up an empty database and password:

```py
# Install postgresql server
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql
!sudo service postgresql start

# Setup a password `postgres` for username `postgres`
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"

# Setup a database with name `hfds_demo` to be used
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS hfds_demo;'
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE hfds_demo;'
```

Set up the environment variables:

```py
%env POSTGRES_DB_NAME=hfds_demo
%env POSTGRES_DB_HOST=localhost
%env POSTGRES_DB_PORT=5432
%env POSTGRES_DB_USER=postgres
%env POSTGRES_DB_PASS=postgres
```

Then you can load the [Air Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Air+Quality) from the UCI Machine Learning Repository into your newly created database:

```py
!curl -s -OL https://github.com/tensorflow/io/raw/master/docs/tutorials/postgresql/AirQualityUCI.sql

!PGPASSWORD=$POSTGRES_DB_PASS psql -q -h $POSTGRES_DB_HOST -p $POSTGRES_DB_PORT -U $POSTGRES_DB_USER -d $POSTGRES_DB_NAME -f AirQualityUCI.sql
```

To connect to the database, you'll need the [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) that identifies your database. The URI string differs for each database dialect, so be sure to check the [Database URLs](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) for whichever database you're using.

For PostgreSQL, it is:

```py
>>> import os

>>> postgres_uri = "postgresql://{}:{}@{}?port={}&dbname={}".format(
... os.environ['POSTGRES_DB_USER'],
... os.environ['POSTGRES_DB_PASS'],
... os.environ['POSTGRES_DB_HOST'],
... os.environ['POSTGRES_DB_PORT'],
... os.environ['POSTGRES_DB_NAME'],
... )
>>> postgres_uri
'postgresql://postgres:postgres@localhost?port=5432&dbname=hfds_demo'
```

The Air Quality Data Set table can't be loaded directly because 🤗 Datasets can't figure out how to cast some of the columns to their correct underlying feature types. You can fix this issue by [specifying your own features](loading#specify-features):

```py
>>> from datasets import Value, Features

>>> features = Features({
... 'date': Value('date32'),
... 'time': Value('string'),
... 'co': Value('float32'),
... 'pt08s1': Value('int32'),
... 'nmhc': Value('float32'),
... 'c6h6': Value('float32'),
... 'pt08s2': Value('int32'),
... 'nox': Value('float32'),
... 'pt08s3': Value('int32'),
... 'no2': Value('float32'),
... 'pt08s4': Value('int32'),
... 'pt08s5': Value('int32'),
... 't': Value('float32'),
... 'rh': Value('float32'),
... 'ah': Value('float32'),
... })
```

Now load the table by passing the table name, URI and features to [`~datasets.Dataset.from_sql`]:

```py
>>> from datasets import Dataset

>>> ds = Dataset.from_sql("airqualityuci", postgres_uri, features=features)
>>> ds
Dataset({
features: ['date', 'time', 'co', 'pt08s1', 'nmhc', 'c6h6', 'pt08s2', 'nox', 'pt08s3', 'no2', 'pt08s4', 'pt08s5', 't', 'rh', 'ah'],
num_rows: 9357
})
```

Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example:

```py
>>> ds.filter(lambda x: x["co"] > 3)
```

You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables.

Load the dataset by passing your query and URI to [`~datasets.Dataset.from_sql`]:

```py
>>> from datasets import Dataset

>>> ds = Dataset.from_sql('SELECT date, co FROM AirQualityUCI WHERE co > 3;', postgres_uri)
>>> ds
Dataset({
features: ['date', 'co'],
num_rows: 1715
})
```

Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example:

```py
>>> ds.filter(lambda x: x["co"] > 5)
```