-
Notifications
You must be signed in to change notification settings - Fork 3k
Add SQL guide #5223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add SQL guide #5223
Changes from 2 commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,201 @@ | ||
| # Load tabular data | ||
|
|
||
| Many real-world datasets are stored in databases which are typically accessed by SQL queries. With 🤗 Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of 🤗 Datasets to prepare your dataset for training. | ||
|
|
||
| This guide will show you how to connect to SQLite and PostgreSQL and: | ||
|
|
||
| - Load an entire table. | ||
| - Load from a SQL query. | ||
|
|
||
| Check out this [notebook](https://colab.research.google.com/github/nateraw/huggingface-hub-examples/blob/main/sql_with_huggingface_datasets.ipynb) for a hands-on example! | ||
|
|
||
| ## SQLite | ||
|
|
||
| SQLite is a small, lightweight database that is fast and easy to set up. You can use an existing database if you'd like, or follow along and start from scratch. | ||
|
|
||
| Start by creating a quick SQLite database with this [Covid-19 data](https://github.com/nytimes/covid-19-data/blob/master/us-states.csv) from the New York Times: | ||
|
|
||
| ```py | ||
| >>> import sqlite3 | ||
| >>> import pandas as pd | ||
|
|
||
| >>> conn = sqlite3.connect("us_covid_data.db") | ||
| >>> df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv") | ||
| >>> df.to_sql("states", conn, if_exists="replace") | ||
| ``` | ||
|
|
||
| This creates a `states` table in the `us_covid_data.db` database which you can now load into a dataset. | ||
|
|
||
| To connect to the database, you'll need the [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) that identifies your database. The URI string differs for each database dialect, so be sure to check the [Database URLs](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) for whichever database you're using. | ||
|
|
||
| For SQLite, it is: | ||
|
|
||
| ```py | ||
| >>> uri = "sqlite:///us_covid_data.db" | ||
| ``` | ||
|
|
||
| Load the table by passing the table name and URI to [`~datasets.Dataset.from_sql`]: | ||
|
|
||
| ```py | ||
| >>> from datasets import Dataset | ||
|
|
||
| >>> ds = Dataset.from_sql("states", uri) | ||
| >>> ds | ||
| Dataset({ | ||
| features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'], | ||
| num_rows: 54382 | ||
| }) | ||
| ``` | ||
|
|
||
| Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example: | ||
|
|
||
| ```py | ||
| >>> ds.filter(lambda x: x["state"] == "California") | ||
| ``` | ||
|
|
||
| You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables. | ||
|
|
||
| Load the dataset by passing your query and URI to [`~datasets.Dataset.from_sql`]: | ||
|
|
||
| ```py | ||
| >>> from datasets import Dataset | ||
|
|
||
| >>> ds = Dataset.from_sql('SELECT * FROM states WHERE state="California";', uri) | ||
| >>> ds | ||
| Dataset({ | ||
| features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'], | ||
| num_rows: 1019 | ||
| }) | ||
| ``` | ||
|
|
||
| Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example: | ||
|
|
||
| ```py | ||
| >>> ds.filter(lambda x: x["cases"] > 10000) | ||
| ``` | ||
|
|
||
| ## PostgreSQL | ||
|
|
||
| <Tip warning={true}> | ||
|
|
||
| This example is designed to only run in a Google Colab. Be careful if you want to run the server commands locally! | ||
|
|
||
| </Tip> | ||
|
|
||
| PostgreSQL is a popular open-source database. You can use an existing database if you'd like, or follow along and start from scratch. | ||
|
|
||
| Start by installing the PostgreSQL server and set up an empty database and password: | ||
|
|
||
| ```py | ||
| # Install postgresql server | ||
| !sudo apt-get -y -qq update | ||
| !sudo apt-get -y -qq install postgresql | ||
| !sudo service postgresql start | ||
|
|
||
| # Setup a password `postgres` for username `postgres` | ||
| !sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';" | ||
|
|
||
| # Setup a database with name `hfds_demo` to be used | ||
| !sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS hfds_demo;' | ||
| !sudo -u postgres psql -U postgres -c 'CREATE DATABASE hfds_demo;' | ||
| ``` | ||
|
|
||
| Set up the environment variables: | ||
|
|
||
| ```py | ||
| %env POSTGRES_DB_NAME=hfds_demo | ||
| %env POSTGRES_DB_HOST=localhost | ||
| %env POSTGRES_DB_PORT=5432 | ||
| %env POSTGRES_DB_USER=postgres | ||
| %env POSTGRES_DB_PASS=postgres | ||
| ``` | ||
|
|
||
| Then you can load the [Air Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Air+Quality) from the UCI Machine Learning Repository into your newly created database: | ||
|
|
||
| ```py | ||
| !curl -s -OL https://github.com/tensorflow/io/raw/master/docs/tutorials/postgresql/AirQualityUCI.sql | ||
|
|
||
| !PGPASSWORD=$POSTGRES_DB_PASS psql -q -h $POSTGRES_DB_HOST -p $POSTGRES_DB_PORT -U $POSTGRES_DB_USER -d $POSTGRES_DB_NAME -f AirQualityUCI.sql | ||
| ``` | ||
|
|
||
| To connect to the database, you'll need the [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) that identifies your database. The URI string differs for each database dialect, so be sure to check the [Database URLs](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) for whichever database you're using. | ||
|
|
||
| For PostgreSQL, it is: | ||
|
|
||
| ```py | ||
| >>> import os | ||
|
|
||
| >>> postgres_uri = "postgresql://{}:{}@{}?port={}&dbname={}".format( | ||
| ... os.environ['POSTGRES_DB_USER'], | ||
| ... os.environ['POSTGRES_DB_PASS'], | ||
| ... os.environ['POSTGRES_DB_HOST'], | ||
| ... os.environ['POSTGRES_DB_PORT'], | ||
| ... os.environ['POSTGRES_DB_NAME'], | ||
| ... ) | ||
| >>> postgres_uri | ||
| 'postgresql://postgres:postgres@localhost?port=5432&dbname=hfds_demo' | ||
| ``` | ||
|
|
||
| The Air Quality Data Set table can't be loaded directly because 🤗 Datasets can't figure out how to cast some of the columns to their correct underlying feature types. You can fix this issue by [specifying your own features](loading#specify-features): | ||
|
|
||
| ```py | ||
| >>> from datasets import Value, Features | ||
|
|
||
| >>> features = Features({ | ||
| ... 'date': Value('date32'), | ||
| ... 'time': Value('string'), | ||
| ... 'co': Value('float32'), | ||
| ... 'pt08s1': Value('int32'), | ||
| ... 'nmhc': Value('float32'), | ||
| ... 'c6h6': Value('float32'), | ||
| ... 'pt08s2': Value('int32'), | ||
| ... 'nox': Value('float32'), | ||
| ... 'pt08s3': Value('int32'), | ||
| ... 'no2': Value('float32'), | ||
| ... 'pt08s4': Value('int32'), | ||
| ... 'pt08s5': Value('int32'), | ||
| ... 't': Value('float32'), | ||
| ... 'rh': Value('float32'), | ||
| ... 'ah': Value('float32'), | ||
| ... }) | ||
| ``` | ||
|
|
||
| Now load the table by passing the table name, URI and features to [`~datasets.Dataset.from_sql`]: | ||
|
|
||
| ```py | ||
| >>> from datasets import Dataset | ||
|
|
||
| >>> ds = Dataset.from_sql("airqualityuci", postgres_uri, features=features) | ||
| >>> ds | ||
| Dataset({ | ||
| features: ['date', 'time', 'co', 'pt08s1', 'nmhc', 'c6h6', 'pt08s2', 'nox', 'pt08s3', 'no2', 'pt08s4', 'pt08s5', 't', 'rh', 'ah'], | ||
| num_rows: 9357 | ||
| }) | ||
| ``` | ||
|
|
||
| Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example: | ||
|
|
||
| ```py | ||
| >>> ds.filter(lambda x: x["co"] > 3) | ||
| ``` | ||
|
|
||
| You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables. | ||
|
|
||
| Load the dataset by passing your query and URI to [`~datasets.Dataset.from_sql`]: | ||
|
|
||
| ```py | ||
| >>> from datasets import Dataset | ||
|
|
||
| >>> ds = Dataset.from_sql('SELECT date, co FROM AirQualityUCI WHERE co > 3;', postgres_uri) | ||
| >>> ds | ||
| Dataset({ | ||
| features: ['date', 'co'], | ||
| num_rows: 1715 | ||
| }) | ||
| ``` | ||
|
|
||
| Then you can use all of 🤗 Datasets process features like [`~datasets.Dataset.filter`] for example: | ||
|
|
||
| ```py | ||
| >>> ds.filter(lambda x: x["co"] > 5) | ||
| ``` | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.