huggingface · stevhliu · Nov 15, 2022 · Nov 9, 2022 · Nov 9, 2022 · Nov 14, 2022
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -73,6 +73,10 @@
     - local: nlp_process
       title: Process text data
     title: "Text"
+  - sections:
+    - local: tabular_load
+      title: Load tabular data
+    title: "Tabular"
   - sections:
     - local: share
       title: Share

diff --git a/docs/source/how_to.md b/docs/source/how_to.md
@@ -10,12 +10,13 @@ Interested in learning more? Take a look at [Chapter 5](https://huggingface.co/c
 
 </Tip>
 
-The guides are organized into five sections:
+The guides are organized into six sections:
 
 - <span class="underline decoration-sky-400 decoration-2 font-semibold">General usage</span>: Functions for general dataset loading and processing. The functions shown in this section are applicable across all dataset modalities.
 - <span class="underline decoration-pink-400 decoration-2 font-semibold">Audio</span>: How to load, process, and share audio datasets.
 - <span class="underline decoration-yellow-400 decoration-2 font-semibold">Vision</span>: How to load, process, and share image datasets.
 - <span class="underline decoration-green-400 decoration-2 font-semibold">Text</span>: How to load, process, and share text datasets.
+- <span class="underline decoration-orange-400 decoration-2 font-semibold">Tabular</span>: How to load, process, and share tabular datasets.
 - <span class="underline decoration-indigo-400 decoration-2 font-semibold">Dataset repository</span>: How to share and upload a dataset to the <a href="https://huggingface.co/datasets">Hub</a>.
 
 If you have any questions about 🤗 Datasets, feel free to join and ask the community on our [forum](https://discuss.huggingface.co/c/datasets/10).
diff --git a/docs/source/tabular_load.mdx b/docs/source/tabular_load.mdx
@@ -0,0 +1,201 @@
+# Load tabular data
+
+Many real-world datasets are stored in databases which are typically accessed by SQL queries. With 🤗 Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of 🤗 Datasets to prepare your dataset for training.
+
+This guide will show you how to connect to SQLite and PostgreSQL and:
+
+- Load an entire table.
+- Load from a SQL query.
+
+Check out this [notebook](https://colab.research.google.com/github/nateraw/huggingface-hub-examples/blob/main/sql_with_huggingface_datasets.ipynb) for a hands-on example! 
+
+## SQLite
+
+SQLite is a small, lightweight database that is fast and easy to set up. You can use an existing database if you'd like, or follow along and start from scratch.
+
+Start by creating a quick SQLite database with this [Covid-19 data](https://github.com/nytimes/covid-19-data/blob/master/us-states.csv) from the New York Times:
+
+```py
+>>> import sqlite3
+>>> import pandas as pd
+
+>>> conn = sqlite3.connect("us_covid_data.db")
+>>> df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
+>>> df.to_sql("states", conn, if_exists="replace")
+```
+
+This creates a `states` table in the `us_covid_data.db` database which you can now load into a dataset.
+
+To connect to the database, you'll need the [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) that identifies your database. The URI string differs for each database dialect, so be sure to check the [Database URLs](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) for whichever database you're using.
+
+For SQLite, it is:
+
+```py
+>>> uri = "sqlite:///us_covid_data.db"
+```
+
+Load the table by passing the table name and URI to [`~datasets.Dataset.from_sql`]:
+
+```py
+>>> from datasets import Dataset
+
+>>> ds = Dataset.from_sql("states", uri)
+>>> ds
+Dataset({
+    features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
+    num_rows: 54382
+})
+```
+
+Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example:
+
+```py
+>>> ds.filter(lambda x: x["state"] == "California")
+```
+
+You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables. 
+
+Load the dataset by passing your query and URI to [`~datasets.Dataset.from_sql`]:
+
+```py
+>>> from datasets import Dataset
+
+>>> ds = Dataset.from_sql('SELECT * FROM states WHERE state="California";', uri)
+>>> ds
+Dataset({
+    features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
+    num_rows: 1019
+})
+```
+
+Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example:
+
+```py
+>>> ds.filter(lambda x: x["cases"] > 10000)
+```
+
+## PostgreSQL
+
+<Tip warning={true}>
+
+This example is designed to only run in a Google Colab. Be careful if you want to run the server commands locally!
+
+</Tip>
+
+PostgreSQL is a popular open-source database. You can use an existing database if you'd like, or follow along and start from scratch.
+
+Start by installing the PostgreSQL server and set up an empty database and password:
+
+```py
+# Install postgresql server
+!sudo apt-get -y -qq update
+!sudo apt-get -y -qq install postgresql
+!sudo service postgresql start
+
+# Setup a password `postgres` for username `postgres`
+!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
+
+# Setup a database with name `hfds_demo` to be used
+!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS hfds_demo;'
+!sudo -u postgres psql -U postgres -c 'CREATE DATABASE hfds_demo;'
+```
+
+Set up the environment variables:
+
+```py
+%env POSTGRES_DB_NAME=hfds_demo
+%env POSTGRES_DB_HOST=localhost
+%env POSTGRES_DB_PORT=5432
+%env POSTGRES_DB_USER=postgres
+%env POSTGRES_DB_PASS=postgres
+```
+
+Then you can load the [Air Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Air+Quality) from the UCI Machine Learning Repository into your newly created database:
+
+```py
+!curl -s -OL https://github.com/tensorflow/io/raw/master/docs/tutorials/postgresql/AirQualityUCI.sql
+
+!PGPASSWORD=$POSTGRES_DB_PASS psql -q -h $POSTGRES_DB_HOST -p $POSTGRES_DB_PORT -U $POSTGRES_DB_USER -d $POSTGRES_DB_NAME -f AirQualityUCI.sql
+```
+
+To connect to the database, you'll need the [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) that identifies your database. The URI string differs for each database dialect, so be sure to check the [Database URLs](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) for whichever database you're using.
+
+For PostgreSQL, it is:
+
+```py
+>>> import os
+
+>>> postgres_uri = "postgresql://{}:{}@{}?port={}&dbname={}".format(
+...     os.environ['POSTGRES_DB_USER'],
+...     os.environ['POSTGRES_DB_PASS'],
+...     os.environ['POSTGRES_DB_HOST'],
+...     os.environ['POSTGRES_DB_PORT'],
+...     os.environ['POSTGRES_DB_NAME'],
+... )
+>>> postgres_uri
+'postgresql://postgres:postgres@localhost?port=5432&dbname=hfds_demo'
+```
+
+The Air Quality Data Set table can't be loaded directly because 🤗 Datasets can't figure out how to cast some of the columns to their correct underlying feature types. You can fix this issue by [specifying your own features](loading#specify-features):
+
+```py
+>>> from datasets import Value, Features
+
+>>> features = Features({
+...     'date': Value('date32'),
+...     'time': Value('string'),
+...     'co': Value('float32'),
+...     'pt08s1': Value('int32'),
+...     'nmhc': Value('float32'),
+...     'c6h6': Value('float32'),
+...     'pt08s2': Value('int32'),
+...     'nox': Value('float32'),
+...     'pt08s3': Value('int32'),
+...     'no2': Value('float32'),
+...     'pt08s4': Value('int32'),
+...     'pt08s5': Value('int32'),
+...     't': Value('float32'),
+...     'rh': Value('float32'),
+...     'ah': Value('float32'),
+... })
+```
+
+Now load the table by passing the table name, URI and features to [`~datasets.Dataset.from_sql`]:
+
+```py
+>>> from datasets import Dataset
+
+>>> ds = Dataset.from_sql("airqualityuci", postgres_uri, features=features)
+>>> ds
+Dataset({
+    features: ['date', 'time', 'co', 'pt08s1', 'nmhc', 'c6h6', 'pt08s2', 'nox', 'pt08s3', 'no2', 'pt08s4', 'pt08s5', 't', 'rh', 'ah'],
+    num_rows: 9357
+})
+```
+
+Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example:
+
+```py
+>>> ds.filter(lambda x: x["co"] > 3)
+```
+
+You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables. 
+
+Load the dataset by passing your query and URI to [`~datasets.Dataset.from_sql`]:
+
+```py
+>>> from datasets import Dataset
+
+>>> ds = Dataset.from_sql('SELECT date, co FROM AirQualityUCI WHERE co > 3;', postgres_uri)
+>>> ds
+Dataset({
+    features: ['date', 'co'],
+    num_rows: 1715
+})
+```
+
+Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example:
+
+```py
+>>> ds.filter(lambda x: x["co"] > 5)
+```