docs: exhaustive overview of statements & best practices

wprzytula · wprzytula · commit 349bc79686af · 2024-08-29T13:07:04.000+02:00
In order to avoid API misuse, much knowledge is now shared in a
structured way of tables, and best practices are described to aid users.
diff --git a/docs/source/queries/paged.md b/docs/source/queries/paged.md
@@ -2,9 +2,31 @@
 Sometimes query results might be so big that one prefers not to fetch them all at once,
 e.g. to reduce latency and/or memory footprint.
 Paged queries allow to receive the whole result page by page, with a configurable page size.
+In fact, most SELECTs queries should be done with paging, to avoid big load on cluster and large memory footprint.
 
-`Session::query_iter` and `Session::execute_iter` take a [simple query](simple.md)
-or a [prepared query](prepared.md) and return an `async` iterator over result `Rows`.
+> ***Warning***\
+> Issuing unpaged SELECTs (`Session::query_unpaged` or `Session::execute_unpaged`)
+> may have dramatic performance consequences! **BEWARE!**\
+> If the result set is big (or, e.g., there are a lot of tombstones), those atrocities can happen:
+> - cluster may experience high load,
+> - queries may time out,
+> - the driver may devour a lot of RAM,
+> - latency will likely spike.
+>
+> Stay safe. Page your SELECTs.
+
+## `RowIterator`
+
+The automated way to achieve that is `RowIterator`. It always fetches and enables access to one page,
+while prefetching the next one. This limits latency and is a convenient abstraction.
+
+> ***Note***\
+> `RowIterator` is quite heavy machinery, introducing considerable overhead. Therefore,
+> don't use it for statements that do not benefit from paging. In particular, avoid using it
+> for non-SELECTs.
+
+On API level, `Session::query_iter` and `Session::execute_iter` take a [simple query](simple.md)
+or a [prepared query](prepared.md), respectively, and return an `async` iterator over result `Rows`.
 
 > ***Warning***\
 > In case of unprepared variant (`Session::query_iter`) if the values are not empty
@@ -22,7 +44,6 @@ Use `query_iter` to perform a [simple query](simple.md) with paging:
 # use scylla::Session;
 # use std::error::Error;
 # async fn check_only_compiles(session: &Session) -> Result<(), Box<dyn Error>> {
-use scylla::IntoTypedRows;
 use futures::stream::StreamExt;
 
 let mut rows_stream = session
@@ -45,7 +66,6 @@ Use `execute_iter` to perform a [prepared query](prepared.md) with paging:
 # use scylla::Session;
 # use std::error::Error;
 # async fn check_only_compiles(session: &Session) -> Result<(), Box<dyn Error>> {
-use scylla::IntoTypedRows;
 use scylla::prepared_statement::PreparedStatement;
 use futures::stream::StreamExt;
 
@@ -106,10 +126,10 @@ let _ = session.execute_iter(prepared, &[]).await?; // ...
 # }
 ```
 
-### Passing the paging state manually
-It's possible to fetch a single page from the table, extract the paging state
-from the result and manually pass it to the next query. That way, the next
-query will start fetching the results from where the previous one left off.
+## Manual paging
+It's possible to fetch a single page from the table, and manually pass paging state
+to the next query. That way, the next query will start fetching the results
+from where the previous one left off.
 
 On a `Query`:
 ```rust
@@ -197,5 +217,18 @@ loop {
 ```
 
 ### Performance
-Performance is the same as in non-paged variants.\
-For the best performance use [prepared queries](prepared.md).
+For the best performance use [prepared queries](prepared.md).
+See [query types overview](queries.md).
+
+## Best practices
+
+| Query result fetching   | Unpaged                                                                                                                 | Paged manually                                                                                       | Paged automatically                                                                               |
+|-------------------------|-------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
+| Exposed Session API     | `{query,execute}_unpaged`                                                                                               | `{query,execute}_single_page`                                                                        | `{query,execute}_iter`                                                                            |
+| Working                 | get all results in a single CQL frame, into a single Rust struct                                                        | get one page of results in a single CQL frame, into a single Rust struct                             | upon high-level iteration, fetch consecutive CQL frames and transparently iterate over their rows |
+| Cluster load            | potentially **HIGH** for large results, beware!                                                                         | normal                                                                                               | normal                                                                                            |
+| Driver overhead         | low - simple frame fetch                                                                                                | low - simple frame fetch                                                                             | considerable - `RowIteratorWorker` is a separate tokio task                                       |
+| Feature limitations     | none                                                                                                                    | none                                                                                                 | speculative execution not supported                                                               |
+| Driver memory footprint | potentially **BIG** - all results have to be stored at once!                                                            | small - only one page stored at a time                                                               | small - at most constant number of pages stored at a time                                         |
+| Latency                 | potentially **BIG** - all results have to be generated at once!                                                         | considerable on page boundary - new page needs to be fetched                                         | small - next page is always pre-fetched in background                                             |
+| Suitable operations     | - in general: operations with empty result set (non-SELECTs)</br> - as possible optimisation: SELECTs with LIMIT clause | - for advanced users who prefer more control over paging, with less overhead of `RowIteratorWorker`  | - in general: all SELECTs                                                                         |
diff --git a/docs/source/queries/queries.md b/docs/source/queries/queries.md
@@ -1,26 +1,80 @@
-# Making queries
-
-This driver supports all query types available in Scylla:
-* [Simple queries](simple.md)
-    * Easy to use
-    * Poor performance
-    * Primitive load balancing
-* [Prepared queries](prepared.md)
-    * Need to be prepared before use
-    * Fast
-    * Properly load balanced
-* [Batch statements](batch.md)
-    * Run multiple queries at once
-    * Can be prepared for better performance and load balancing
-* [Paged queries](paged.md)
-    * Allows to read result in multiple pages when it might be so big that one
-      prefers not to fetch it all at once
-    * Can be prepared for better performance and load balancing
-
-Additionally there is special functionality to enable `USE KEYSPACE` queries:
-[USE keyspace](usekeyspace.md)
-
-Queries are fully asynchronous - you can run as many of them in parallel as you wish.
+# Making queries - best practices
+
+Driver supports all kinds of statements supported by ScyllaDB. The following tables aim to bridge between DB concepts and driver's API.
+They include recommendations on which API to use in what cases.
+
+## Kinds of CQL statements (from the CQL protocol point of view):
+
+| Kind of CQL statement | Single              | Batch                                    |
+|-----------------------|---------------------|------------------------------------------|
+| Prepared              | `PreparedStatement` | `Batch` filled with `PreparedStatement`s |
+| Unprepared            | `Query`             | `Batch` filled with `Query`s             |
+
+This is **NOT** strictly related to content of the CQL query string.
+
+> ***Interesting note***\
+> In fact, any kind of CQL statement could contain any CQL query string.
+> Yet, some of such combinations don't make sense and will be rejected by the DB.
+> For example, SELECTs in a Batch are nonsense.
+
+### [Unprepared](simple.md) vs [Prepared](prepared.md)
+
+> ***GOOD TO KNOW***\
+> Each time a statement is executed by sending a query string to the DB, it needs to be parsed. Driver does not parse CQL, therefore it sees query strings as opaque.\
+> There is an option to *prepare* a statement, i.e. parse it once by the DB and associate it with an ID. After preparation, it's enough that driver sends the ID
+> and the DB already knows what operation to perform - no more expensive parsing necessary! Moreover, upon preparation driver receives valuable data for load balancing,
+> enabling advanced load balancing (so better performance!) of all further executions of that prepared statement.\
+> ***Key take-over:*** always prepare statements that you are going to execute multiple times.
+
+| Statement comparison | Unprepared                                | Prepared                                                                                                        |
+|----------------------|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
+| Exposed Session API  | `query_*`                                 | `execute_*`                                                                                                     |
+| Usability            | execute CQL statement string directly     | need to be separately prepared before use, in-background repreparations if statement falls off the server cache |
+| Performance          | poor (statement parsed each time)         | good (statement parsed only upon preparation)                                                                   |
+| Load balancing       | primitive (random choice of a node/shard) | advanced (proper node/shard, optimisations for LWT statements)                                                  |
+| Suitable operations  | one-shot operations                       | repeated operations                                                                                             |
+
+### Single vs [Batch](batch.md)
+
+| Statement comparison | Single                                                | Batch                                                                                                                                                                                |
+|----------------------|-------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Exposed Session API  | `query_*`, `execute_*`                                | `batch`                                                                                                                                                                              |
+| Usability            | simple setup                                          | need to aggregate statements and binding values to each is more cumbersome                                                                                                           |
+| Performance          | good (DB is optimised for handling single statements) | good for small batches, may be worse for larger (also: higher risk of request timeout due to big portion of work)                                                                    |
+| Load balancing       | advanced if prepared, else primitive                  | advanced if prepared **and ALL** statements in the batch target the same partition, else primitive                                                                                   |
+| Suitable operations  | most of operations                                    | - a list of operations that needs to be executed atomically (batch LightWeight Transaction)</br> - a batch of operations targetting the same partition (as an advanced optimisation) |
+
+## CQL statements - operations (based on what the CQL string contains):
+
+| CQL data manipulation statement                | Recommended statement kind                                                                                                               | Recommended Session operation                                                                               |
+|------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
+| SELECT                                         | `PreparedStatement` if repeated, `Query` if once                                                                                         | `{query,execute}_iter` (or `{query,execute}_single_page` in a manual loop for performance / more control)   |
+| INSERT, UPDATE                                 | `PreparedStatement` if repeated, `Query` if once, `Batch` if multiple statements are to be executed atomically (LightWeight Transaction) | `{query,execute}_unpaged` (paging is irrelevant, because the result set of such operation is empty)         |
+| CREATE/DROP {KEYSPACE, TABLE, TYPE, INDEX,...} | `Query`, `Batch` if multiple statements are to be executed atomically (LightWeight Transaction)                                          | `query_unpaged` (paging is irrelevant, because the result set of such operation is empty)                   |
+
+### [Paged](paged.md) vs Unpaged query
+
+> ***GOOD TO KNOW***\
+> SELECT statements return a [result set](result.md), possibly a large one. Therefore, paging is available to fetch it in chunks, relieving load on cluster and lowering latency.\
+> ***Key take-overs:***\
+> For SELECTs you had better **avoid unpaged queries**.\
+> For non-SELECTs, unpaged API is preferred.
+
+| Query result fetching | Unpaged                                                                                                                 | Paged                                                                                                                                                                |
+|-----------------------|-------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Exposed Session API   | `{query,execute}_unpaged`                                                                                               | `{query,execute}_single_page`, `{query,execute}_iter`                                                                                                                |
+| Usability             | get all results in a single CQL frame, so into a [single Rust struct](result.md)                                        | need to fetch multiple CQL frames and iterate over them - using driver's abstractions (`{query,execute}_iter`) or manually (`{query,execute}_single_page` in a loop) |
+| Performance           | - for large results, puts **high load on the cluster**</br> - for small results, the same as paged                      | - for large results, relieves the cluster</br> - for small results, the same as unpaged                                                                              |
+| Memory footprint      | potentially big - all results have to be stored at once                                                                 | small - at most constant number of pages are stored by the driver at the same time                                                                                   |
+| Latency               | potentially big - all results have to be generated at once                                                              | small - at most one chunk of data must be generated at once, so latency of each chunk is small                                                                       |
+| Suitable operations   | - in general: operations with empty result set (non-SELECTs)</br> - as possible optimisation: SELECTs with LIMIT clause | - in general: all SELECTs                                                                                                                                            |
+
+For more detailed comparison and more best practices, see [doc page about paging](paged.md).
+
+### Queries are fully asynchronous - you can run as many of them in parallel as you wish.
+
+## `USE KEYSPACE`:
+There is a special functionality to enable [USE keyspace](usekeyspace.md).
 
 ```{eval-rst}
 .. toctree::