You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/pages/data_input.rst
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,14 +3,14 @@
3
3
Data Loading and Input
4
4
======================
5
5
6
-
Before data can be queried with ``dask-sql``, it needs to be loaded into the dask cluster (or local instance) and registered with the :class:`dask_sql.Context`.
6
+
Before data can be queried with ``dask-sql``, it needs to be loaded into the dask cluster (or local instance) and registered with the :class:`~dask_sql.Context`.
7
7
For this, ``dask-sql`` uses the wide field of possible `input formats <https://docs.dask.org/en/latest/dataframe-create.html>`_ of ``dask``, plus some additional formats only suitable for `dask-sql`.
8
8
You have multiple possibilities to load input data in ``dask-sql``:
9
9
10
10
1. Load it via python
11
11
-------------------------------
12
12
13
-
You can either use already created dask dataframes or create one by using the :func:`create_table` function.
13
+
You can either use already created dask dataframes or create one by using the :func:`~dask_sql.Context.create_table` function.
14
14
Chances are high, there exists already a function to load your favorite format or location (e.g. s3 or hdfs).
15
15
See below for all formats understood by ``dask-sql``.
16
16
Make sure to install required libraries both on the driver and worker machines.
@@ -58,7 +58,7 @@ In ``dask``, you can publish datasets with names into the cluster memory.
58
58
This allows to reuse the same data from multiple clients/users in multiple sessions.
59
59
60
60
For example, you can publish your data using the ``client.publish_dataset`` function of the ``distributed.Client``,
61
-
and then later register it in the :class:`dask_sql.Context` via SQL:
61
+
and then later register it in the :class:`~dask_sql.Context` via SQL:
62
62
63
63
.. code-block:: python
64
64
@@ -93,7 +93,7 @@ Input Formats
93
93
* All formats and locations mentioned in `the Dask docu <https://docs.dask.org/en/latest/dataframe-create.html>`_, including csv, parquet, json.
94
94
Just pass in the location as string (and possibly the format, e.g. "csv" if it is not clear from the file extension).
95
95
The data can be from local disc or many remote locations (S3, hdfs, Azure Filesystem, http, Google Filesystem, ...) - just prefix the path with the matching protocol.
96
-
Additional arguments passed to :func:`create_table` or ``CREATE TABLE`` are given to the ``read_<format>`` calls.
96
+
Additional arguments passed to :func:`~dask_sql.Context.create_table` or ``CREATE TABLE`` are given to the ``read_<format>`` calls.
97
97
98
98
Example:
99
99
@@ -113,7 +113,7 @@ Input Formats
113
113
)
114
114
115
115
* If your data is already in Pandas (or Dask) DataFrames format, you can just use it as it is via the Python API
116
-
by giving it to :ref:`create_table` directly.
116
+
by giving it to :func:`~dask_sql.Context.create_table` directly.
117
117
* You can connect ``dask-sql`` to an `intake <https://intake.readthedocs.io/en/latest/index.html>`_ catalog and
118
118
use the data registered there. Assuming you have an intake catalog stored in "catalog.yaml" (can also be
119
119
the URL of an intake server), you can read in a stored table "data_table" either via Python
Copy file name to clipboardExpand all lines: docs/pages/how_does_it_work.rst
+113-5Lines changed: 113 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,8 +7,116 @@ At the core, ``dask-sql`` does two things:
7
7
which is specified as a tree of java objects - similar to many other SQL engines (Hive, Flink, ...)
8
8
- convert this description of the query from java objects into dask API calls (and execute them) - returning a dask dataframe.
9
9
10
-
For the first step, Apache Calcite needs to know about the columns and types of the dask dataframes,
11
-
therefore some java classes to store this information for dask dataframes are defined in ``planner``.
12
-
After the translation to a relational algebra is done (using ``RelationalAlgebraGenerator.getRelationalAlgebra``),
13
-
the python methods defined in ``dask_sql.physical`` turn this into a physical dask execution plan by converting
14
-
each piece of the relational algebra one-by-one.
10
+
Th following example explains this in quite some technical details.
11
+
For most of the users, this level of technical understanding is not needed.
12
+
13
+
1. SQL enters the library
14
+
-------------------------
15
+
16
+
No matter of via the Python API (:ref:`api`), the command line client (:ref:`cmd`) or the server (:ref:`server`), eventually the SQL statement by the user will end up as a string in the function :func:`~dask_sql.Context.sql`.
17
+
18
+
2. SQL is parsed
19
+
----------------
20
+
21
+
This function will first give the SQL string to the implemented Java classes (especially :class:`RelationalAlgebraGenerator`) via the ``jpype`` library.
22
+
Inside this class, Apache Calcite is used to first parse the SQL string and then turn it into a relational algebra.
23
+
For this, Apache Calcite uses the SQL language description specified in the Calcite library itself and the additional definitions in the ``.ftl```files in the ``dask-sql`` repository.
24
+
They specify custom language features, such as the ``CREATE MODEL`` statement.
25
+
26
+
.. note::
27
+
28
+
``.ftl`` stands for FreeMarker Template Language and is one of the standard templating languages used in the Java ecosystem.
29
+
Each of the "functions" defined in the documents defines a part of the (extended) SQL language in ``javacc`` format.
30
+
FreeMarker is used to combine these parser definitions with the ones from Apache Calcite. Have a look into the ``config.fmpp`` file for more information.
31
+
32
+
For example the following ``javacc`` code
33
+
34
+
.. code-block::
35
+
36
+
SqlNode SqlShowTables() :
37
+
{
38
+
final Span s;
39
+
final SqlIdentifier schema;
40
+
}
41
+
{
42
+
<SHOW> { s = span(); } <TABLES> <FROM>
43
+
schema = CompoundIdentifier()
44
+
{
45
+
return new SqlShowTables(s.end(this), schema);
46
+
}
47
+
}
48
+
49
+
describes a parser line, which understands SQL statements such as
50
+
51
+
.. code-block:: sql
52
+
53
+
SHOW TABLES FROM "schema"
54
+
55
+
While parsing the SQL, they are turned into an instance of the Java class :class:`SqlShowTables` (which is also defined in this project).
56
+
The :class:`Span` is used internally in Apache Calcite to store the position in the parsed SQL statement (e.g. for better error output).
57
+
The ``SqlShowTables`` javacc function (not the Java class SqlShowTables) is listed in ``config.fmpp`` as a ``statementParserMethods``, which makes it parsable as main SQL statement (similar to any normal ``SELECT ...`` statement).
58
+
All Java classes used as parser return values inherit from the Calcite class :class:`SqlNode` or any derived subclass (if it makes sense). Those classes are barely containers to store the information from the parsed SQL statements (such as the schema name in the example above) and do not have any business logic by themselves.
59
+
60
+
3. SQL is (maybe) optimized
61
+
---------------------------
62
+
63
+
Once the SQL string is parsed into an instance of a :class:`SqlNode` (or a subclass of it), Apache Calcite can convert it into a relational algebra and optimize it. As this is only implemented for Calcite-own classes (and not for the custom classes such as :class:`SqlCreateModel`) this conversion and optimization is not triggered for all SQL statements (have a look into :func:`Context._get_ral`).
64
+
65
+
After optimization, the resulting Java instance will be a class of any of the :class:`Logical*` classes in Apache Calcite (such as :class:`LogicalJoin`). Each of those can contain other instances as "inputs" creating a tree of different steps in the SQL statement (see below for an example).
66
+
67
+
So after all, the result is either an optimized tree of steps in the relational algebra (represented by instances of the :class:`Logical*` classes) or an instance of a :class:`SqlNode` (sub)class.
68
+
69
+
4. Translation to Dask API calls
70
+
--------------------------------
71
+
72
+
Depending on which type the resulting java class has, they are converted into calls to python functions using different python "converters". For each Java class, there exist a converter class in the ``dask_sql.physical.rel`` folder, which are registered at the :class:`dask_sql.physical.rel.convert.RelConverter` class.
73
+
Their job is to use the information stored in the java class instances and turn it into calls to python functions (see the example below for more information).
74
+
75
+
As many SQL statements contain calculations using literals and/or columns, these are split into their own functionality (``dask_sql.physical.rex``) following a similar plugin-based converter system.
76
+
Have a look into the specific classes to understand how the conversion of a specific SQL language feature is implemented.
77
+
78
+
5. Result
79
+
---------
80
+
81
+
The result of each of the conversions is a :class:`dask.DataFrame`, which is given to the user. In case of the command line tool or the SQL server, it is evaluated immediately - otherwise it can be used for further calculations by the user.
82
+
83
+
Example
84
+
-------
85
+
86
+
Let's walk through the steps above using the example SQL statement
87
+
88
+
.. code-block:: sql
89
+
90
+
SELECT x + y FROM timeseries WHERE x > 0
91
+
92
+
assuming the table "timeseries" is already registered.
93
+
If you want to follow along with the steps outlined in the following, start the command line tool in debug mode
First, the SQL is parsed by Apache Calcite and (as it is not a custom statement) transformed into a tree of relational algebra objects.
102
+
103
+
.. code-block:: none
104
+
105
+
LogicalProject(EXPR$0=[+($3, $4)])
106
+
LogicalFilter(condition=[>($3, 0)])
107
+
LogicalTableScan(table=[[schema, timeseries]])
108
+
109
+
The tree output above means, that the outer instance (:class:`LogicalProject`) needs as input the output of the previous instance (:class:`LogicalFilter`) etc.
110
+
111
+
Therefore the conversion to python API calls is called recursively (depth-first). First, the :class:`LogicalTableScan` is converted using the :class:`rel.logical.table_scan.LogicalTableScanPlugin` plugin. It will just get the correct :class:`dask.DataFrame` from the dictionary of already registered tables of the context.
112
+
Next, the :class:`LogicalFilter` (having the dataframe as input), is converted via the :class:`rel.logical.filter.LogicalFilterPlugin`.
113
+
The filter expression ``>($3, 0)`` is converted into ``df["x"] > 0`` using a combination of REX plugins (have a look into the debug output to learn more) and applied to the dataframe.
114
+
The resulting dataframe is then passed to the converter :class:`rel.logical.project.LogicalProjectPlugin` for the :class:`LogicalProject`.
115
+
This will calculate the expression ``df["x"] + df["y"]`` (after having converted it via the class:`RexCallPlugin` plugin) and return the final result to the user.
0 commit comments