Skip to content

Commit 7273c2d

Browse files
authored
Docs improvements (#132)
* Fixed a bug in function references in docs * More details on the dask-sql internals
1 parent bdc518e commit 7273c2d

8 files changed

Lines changed: 129 additions & 21 deletions

File tree

docs/pages/cmd.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ or by running these lines of code
2121
cmd_loop()
2222
2323
Some options can be set, e.g. to preload some testdata.
24-
Have a look into :func:`dask_sql.cmd_loop` or call
24+
Have a look into :func:`~dask_sql.cmd_loop` or call
2525

2626
.. code-block:: bash
2727

docs/pages/custom.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Scalar Functions
1111
----------------
1212

1313
A scalar function (such as :math:`x \to x^2`) turns a given column into another column of the same length.
14-
It can be registered for usage in SQL with the :func:`dask_sql.Context.register_function` method.
14+
It can be registered for usage in SQL with the :func:`~dask_sql.Context.register_function` method.
1515

1616
Example:
1717

@@ -38,7 +38,7 @@ Aggregation Functions
3838

3939
Aggregation functions run on a single column and turn them into a single value.
4040
This means they can only be used in ``GROUP BY`` aggregations.
41-
They can be registered with the :func:`dask_sql.Context.register_aggregation` method.
41+
They can be registered with the :func:`~dask_sql.Context.register_aggregation` method.
4242
This time however, an instance of a :class:`dask.dataframe.Aggregation` needs to be passed
4343
instead of a plain function.
4444
More information on dask aggregations can be found in the

docs/pages/data_input.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,14 @@
33
Data Loading and Input
44
======================
55

6-
Before data can be queried with ``dask-sql``, it needs to be loaded into the dask cluster (or local instance) and registered with the :class:`dask_sql.Context`.
6+
Before data can be queried with ``dask-sql``, it needs to be loaded into the dask cluster (or local instance) and registered with the :class:`~dask_sql.Context`.
77
For this, ``dask-sql`` uses the wide field of possible `input formats <https://docs.dask.org/en/latest/dataframe-create.html>`_ of ``dask``, plus some additional formats only suitable for `dask-sql`.
88
You have multiple possibilities to load input data in ``dask-sql``:
99

1010
1. Load it via python
1111
-------------------------------
1212

13-
You can either use already created dask dataframes or create one by using the :func:`create_table` function.
13+
You can either use already created dask dataframes or create one by using the :func:`~dask_sql.Context.create_table` function.
1414
Chances are high, there exists already a function to load your favorite format or location (e.g. s3 or hdfs).
1515
See below for all formats understood by ``dask-sql``.
1616
Make sure to install required libraries both on the driver and worker machines.
@@ -58,7 +58,7 @@ In ``dask``, you can publish datasets with names into the cluster memory.
5858
This allows to reuse the same data from multiple clients/users in multiple sessions.
5959

6060
For example, you can publish your data using the ``client.publish_dataset`` function of the ``distributed.Client``,
61-
and then later register it in the :class:`dask_sql.Context` via SQL:
61+
and then later register it in the :class:`~dask_sql.Context` via SQL:
6262

6363
.. code-block:: python
6464
@@ -93,7 +93,7 @@ Input Formats
9393
* All formats and locations mentioned in `the Dask docu <https://docs.dask.org/en/latest/dataframe-create.html>`_, including csv, parquet, json.
9494
Just pass in the location as string (and possibly the format, e.g. "csv" if it is not clear from the file extension).
9595
The data can be from local disc or many remote locations (S3, hdfs, Azure Filesystem, http, Google Filesystem, ...) - just prefix the path with the matching protocol.
96-
Additional arguments passed to :func:`create_table` or ``CREATE TABLE`` are given to the ``read_<format>`` calls.
96+
Additional arguments passed to :func:`~dask_sql.Context.create_table` or ``CREATE TABLE`` are given to the ``read_<format>`` calls.
9797

9898
Example:
9999

@@ -113,7 +113,7 @@ Input Formats
113113
)
114114
115115
* If your data is already in Pandas (or Dask) DataFrames format, you can just use it as it is via the Python API
116-
by giving it to :ref:`create_table` directly.
116+
by giving it to :func:`~dask_sql.Context.create_table` directly.
117117
* You can connect ``dask-sql`` to an `intake <https://intake.readthedocs.io/en/latest/index.html>`_ catalog and
118118
use the data registered there. Assuming you have an intake catalog stored in "catalog.yaml" (can also be
119119
the URL of an intake server), you can read in a stored table "data_table" either via Python
@@ -161,7 +161,7 @@ Input Formats
161161
c.create_table("my_data", cursor, hive_table_name="the_name_in_hive")
162162
163163
Again, ``hive_table_name`` is optional and defaults to the table name in ``dask-sql``.
164-
You can also control the database used in Hive via the ``hive_schema_name```parameter.
164+
You can also control the database used in Hive via the ``hive_schema_name`` parameter.
165165
Additional arguments are pushed to the internally called ``read_<format>`` functions.
166166

167167
.. note::

docs/pages/how_does_it_work.rst

Lines changed: 113 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,116 @@ At the core, ``dask-sql`` does two things:
77
which is specified as a tree of java objects - similar to many other SQL engines (Hive, Flink, ...)
88
- convert this description of the query from java objects into dask API calls (and execute them) - returning a dask dataframe.
99

10-
For the first step, Apache Calcite needs to know about the columns and types of the dask dataframes,
11-
therefore some java classes to store this information for dask dataframes are defined in ``planner``.
12-
After the translation to a relational algebra is done (using ``RelationalAlgebraGenerator.getRelationalAlgebra``),
13-
the python methods defined in ``dask_sql.physical`` turn this into a physical dask execution plan by converting
14-
each piece of the relational algebra one-by-one.
10+
Th following example explains this in quite some technical details.
11+
For most of the users, this level of technical understanding is not needed.
12+
13+
1. SQL enters the library
14+
-------------------------
15+
16+
No matter of via the Python API (:ref:`api`), the command line client (:ref:`cmd`) or the server (:ref:`server`), eventually the SQL statement by the user will end up as a string in the function :func:`~dask_sql.Context.sql`.
17+
18+
2. SQL is parsed
19+
----------------
20+
21+
This function will first give the SQL string to the implemented Java classes (especially :class:`RelationalAlgebraGenerator`) via the ``jpype`` library.
22+
Inside this class, Apache Calcite is used to first parse the SQL string and then turn it into a relational algebra.
23+
For this, Apache Calcite uses the SQL language description specified in the Calcite library itself and the additional definitions in the ``.ftl```files in the ``dask-sql`` repository.
24+
They specify custom language features, such as the ``CREATE MODEL`` statement.
25+
26+
.. note::
27+
28+
``.ftl`` stands for FreeMarker Template Language and is one of the standard templating languages used in the Java ecosystem.
29+
Each of the "functions" defined in the documents defines a part of the (extended) SQL language in ``javacc`` format.
30+
FreeMarker is used to combine these parser definitions with the ones from Apache Calcite. Have a look into the ``config.fmpp`` file for more information.
31+
32+
For example the following ``javacc`` code
33+
34+
.. code-block::
35+
36+
SqlNode SqlShowTables() :
37+
{
38+
final Span s;
39+
final SqlIdentifier schema;
40+
}
41+
{
42+
<SHOW> { s = span(); } <TABLES> <FROM>
43+
schema = CompoundIdentifier()
44+
{
45+
return new SqlShowTables(s.end(this), schema);
46+
}
47+
}
48+
49+
describes a parser line, which understands SQL statements such as
50+
51+
.. code-block:: sql
52+
53+
SHOW TABLES FROM "schema"
54+
55+
While parsing the SQL, they are turned into an instance of the Java class :class:`SqlShowTables` (which is also defined in this project).
56+
The :class:`Span` is used internally in Apache Calcite to store the position in the parsed SQL statement (e.g. for better error output).
57+
The ``SqlShowTables`` javacc function (not the Java class SqlShowTables) is listed in ``config.fmpp`` as a ``statementParserMethods``, which makes it parsable as main SQL statement (similar to any normal ``SELECT ...`` statement).
58+
All Java classes used as parser return values inherit from the Calcite class :class:`SqlNode` or any derived subclass (if it makes sense). Those classes are barely containers to store the information from the parsed SQL statements (such as the schema name in the example above) and do not have any business logic by themselves.
59+
60+
3. SQL is (maybe) optimized
61+
---------------------------
62+
63+
Once the SQL string is parsed into an instance of a :class:`SqlNode` (or a subclass of it), Apache Calcite can convert it into a relational algebra and optimize it. As this is only implemented for Calcite-own classes (and not for the custom classes such as :class:`SqlCreateModel`) this conversion and optimization is not triggered for all SQL statements (have a look into :func:`Context._get_ral`).
64+
65+
After optimization, the resulting Java instance will be a class of any of the :class:`Logical*` classes in Apache Calcite (such as :class:`LogicalJoin`). Each of those can contain other instances as "inputs" creating a tree of different steps in the SQL statement (see below for an example).
66+
67+
So after all, the result is either an optimized tree of steps in the relational algebra (represented by instances of the :class:`Logical*` classes) or an instance of a :class:`SqlNode` (sub)class.
68+
69+
4. Translation to Dask API calls
70+
--------------------------------
71+
72+
Depending on which type the resulting java class has, they are converted into calls to python functions using different python "converters". For each Java class, there exist a converter class in the ``dask_sql.physical.rel`` folder, which are registered at the :class:`dask_sql.physical.rel.convert.RelConverter` class.
73+
Their job is to use the information stored in the java class instances and turn it into calls to python functions (see the example below for more information).
74+
75+
As many SQL statements contain calculations using literals and/or columns, these are split into their own functionality (``dask_sql.physical.rex``) following a similar plugin-based converter system.
76+
Have a look into the specific classes to understand how the conversion of a specific SQL language feature is implemented.
77+
78+
5. Result
79+
---------
80+
81+
The result of each of the conversions is a :class:`dask.DataFrame`, which is given to the user. In case of the command line tool or the SQL server, it is evaluated immediately - otherwise it can be used for further calculations by the user.
82+
83+
Example
84+
-------
85+
86+
Let's walk through the steps above using the example SQL statement
87+
88+
.. code-block:: sql
89+
90+
SELECT x + y FROM timeseries WHERE x > 0
91+
92+
assuming the table "timeseries" is already registered.
93+
If you want to follow along with the steps outlined in the following, start the command line tool in debug mode
94+
95+
.. code-block:: bash
96+
97+
dask-sql --load-test-data --startup --log-level DEBUG
98+
99+
and enter the SQL statement above.
100+
101+
First, the SQL is parsed by Apache Calcite and (as it is not a custom statement) transformed into a tree of relational algebra objects.
102+
103+
.. code-block:: none
104+
105+
LogicalProject(EXPR$0=[+($3, $4)])
106+
LogicalFilter(condition=[>($3, 0)])
107+
LogicalTableScan(table=[[schema, timeseries]])
108+
109+
The tree output above means, that the outer instance (:class:`LogicalProject`) needs as input the output of the previous instance (:class:`LogicalFilter`) etc.
110+
111+
Therefore the conversion to python API calls is called recursively (depth-first). First, the :class:`LogicalTableScan` is converted using the :class:`rel.logical.table_scan.LogicalTableScanPlugin` plugin. It will just get the correct :class:`dask.DataFrame` from the dictionary of already registered tables of the context.
112+
Next, the :class:`LogicalFilter` (having the dataframe as input), is converted via the :class:`rel.logical.filter.LogicalFilterPlugin`.
113+
The filter expression ``>($3, 0)`` is converted into ``df["x"] > 0`` using a combination of REX plugins (have a look into the debug output to learn more) and applied to the dataframe.
114+
The resulting dataframe is then passed to the converter :class:`rel.logical.project.LogicalProjectPlugin` for the :class:`LogicalProject`.
115+
This will calculate the expression ``df["x"] + df["y"]`` (after having converted it via the class:`RexCallPlugin` plugin) and return the final result to the user.
116+
117+
.. code-block:: python
118+
119+
df_table_scan = context.tables["timeseries"]
120+
df_filter = df_table_scan[df_table_scan["x"] > 0]
121+
df_project = df_filter.assign(col=df_filter["x"] + df_filter["y"])
122+
return df_project[["col"]]

docs/pages/machine_learning.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Please also see :ref:`ml` for more information on the SQL statements used on thi
1919
-------------------------------------------------------------
2020

2121
If you are familiar with Python and the ML ecosystem in Python, this one is probably
22-
the simplest possibility. You can use the :func:`Context.sql` call as described
22+
the simplest possibility. You can use the :func:`~dask_sql.Context.sql` call as described
2323
before to extract the data for your training or ML prediction.
2424
The result will be a Dask dataframe, which you can either directly feed into your model
2525
or convert to a pandas dataframe with `.compute()` before.
@@ -49,7 +49,7 @@ automatically. The syntax is similar to the `BigQuery Predict Syntax <https://cl
4949
This call will first collect the data from the inner ``SELECT`` call (which can be any valid
5050
``SELECT`` call, including ``JOIN``, ``WHERE``, ``GROUP BY``, custom tables and views etc.)
5151
and will then apply the model with the name "my_model" for prediction.
52-
The model needs to be registered at the context before using :func:`register_model`.
52+
The model needs to be registered at the context before using :func:`~dask_sql.Context.register_model`.
5353

5454
.. code-block:: python
5555

docs/pages/quickstart.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ Read more on the data input part in :ref:`data_input`.
4141
--------------------
4242

4343
If we want to work with the data in SQL, we need to give the data frame a unique name.
44-
We do this by registering the data at an instance of a :class:`dask_sql.Context`.
44+
We do this by registering the data at an instance of a :class:`~dask_sql.Context`.
4545
Typically, you only have a single context per application.
4646

4747
.. code-block:: python

docs/pages/server.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ or by using the created docker image
3232
docker run --rm -it -p 8080:8080 nbraun/dask-sql
3333
3434
This will spin up a server on port 8080 (by default).
35-
The port and bind interfaces can be controlled with the ``--port`` and ``--host`` command line arguments (or options to :func:`dask_sql.run_server`).
35+
The port and bind interfaces can be controlled with the ``--port`` and ``--host`` command line arguments (or options to :func:`~dask_sql.run_server`).
3636

3737
The running server looks similar to a normal presto database to any presto client and can therefore be used
3838
with any library, e.g. the `presto CLI client <https://prestosql.io/docs/current/installation/cli.html>`_ or
@@ -68,7 +68,7 @@ commands.
6868
Preregister your own data sources
6969
---------------------------------
7070

71-
The python function :func:`dask_sql.run_server` accepts an already created :class:`dask_sql.Context`.
71+
The python function :func:`~dask_sql.run_server` accepts an already created :class:`~dask_sql.Context`.
7272
This means you can preload your data sources and register them with a context before starting your server.
7373
By this, your server will already have data to query:
7474

docs/pages/sql/ml.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ As all SQL statements in ``dask-sql`` are eventually converted to Python calls,
1313
any custom Python function and library, e.g. Machine Learning libraries. Although it would be possible to
1414
register custom functions (see :ref:`custom`) for this and use them, it is much more convenient if this functionality
1515
is already included in the core SQL language.
16-
These three statements help in training and using models. Every :class:`Context` has a registry for models, which
16+
These three statements help in training and using models. Every :class:`~dask_sql.Context` has a registry for models, which
1717
can be used for training or prediction.
1818
For a full example, see :ref:`machine_learning`.
1919

@@ -128,7 +128,7 @@ Predict the target using the given model and dataframe from the ``SELECT`` query
128128
The return value is the input dataframe with an additional column named
129129
"target", which contains the predicted values.
130130
The model needs to be registered at the context before using it in this function,
131-
either by calling :func:`Context.register_model` explicitly or by training
131+
either by calling :func:`~dask_sql.Context.register_model` explicitly or by training
132132
a model using the ``CREATE MODEL`` SQL statement above.
133133

134134
A model can be anything which has a ``predict`` function.

0 commit comments

Comments
 (0)