diff --git a/docs/source/contributor-guide/architecture.md b/docs/source/contributor-guide/architecture.md index 48c065f5b73a..ef20644eafc8 100644 --- a/docs/source/contributor-guide/architecture.md +++ b/docs/source/contributor-guide/architecture.md @@ -20,7 +20,8 @@ # Architecture DataFusion's code structure and organization is described in the -[Crate Documentation], to keep it as close to the source as -possible. +[crates.io documentation], to keep it as close to the source as +possible. You can find the most up to date version in the [source code]. -[crate documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#code-organization +[crates.io documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#code-organization +[source code]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/lib.rs diff --git a/docs/source/index.rst b/docs/source/index.rst index 79fbf498f4c5..83c517faf0e2 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -37,10 +37,9 @@ community. :maxdepth: 1 :caption: Links - Issue tracker + Github and Issue Tracker crates.io - API Docs - Github + API Docs Code of conduct .. _toc.guide: @@ -50,22 +49,17 @@ community. user-guide/introduction user-guide/example-usage - user-guide/users - user-guide/comparison - user-guide/integration - user-guide/library user-guide/cli user-guide/dataframe user-guide/expressions user-guide/sql/index user-guide/configs user-guide/faq - Rust Crate Documentation .. _toc.contributor-guide: .. toctree:: - :maxdepth: 2 + :maxdepth: 1 :caption: Contributor Guide contributor-guide/index diff --git a/docs/source/user-guide/cli.md b/docs/source/user-guide/cli.md index ef65561f28f0..afe3821b2d23 100644 --- a/docs/source/user-guide/cli.md +++ b/docs/source/user-guide/cli.md @@ -17,7 +17,7 @@ under the License. --> -# DataFusion Command-line SQL Utility +# `datafusion-cli` The DataFusion CLI is a command-line interactive SQL utility for executing queries against any supported data files. It is a convenient way to diff --git a/docs/source/user-guide/comparison.md b/docs/source/user-guide/comparison.md deleted file mode 100644 index 2cb13f326afb..000000000000 --- a/docs/source/user-guide/comparison.md +++ /dev/null @@ -1,52 +0,0 @@ - - -# Comparisons to Other Projects - -When compared to similar systems, DataFusion typically is: - -1. Targeted at developers, rather than end users / data scientists. -2. Designed to be embedded, rather than a complete file based SQL system. -3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual. -4. Implemented in `Rust`, rather than `C/C++` - -Here is a comparison with similar projects that may help understand -when DataFusion might be be suitable and unsuitable for your needs: - -- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database. - Like DataFusion, it supports very fast execution, both from its custom file format - and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it - is primarily used directly by users as a serverless database and query system rather - than as a library for building such database systems. - -- [Polars](http://pola.rs): Polars is one of the fastest DataFrame - libraries at the time of writing. Like DataFusion, it is also - written in Rust and uses the Apache Arrow memory model, but unlike - DataFusion it does not provide SQL nor as many extension points. - -- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) - is an execution engine. Like DataFusion, Velox aims to - provide a reusable foundation for building database-like systems. Unlike DataFusion, - it is written in C/C++ and does not include a SQL frontend or planning /optimization - framework. - -- [Databend](https://github.com/datafuselabs/databend) is a complete - database system. Like DataFusion it is also written in Rust and - utilizes the Apache Arrow memory model, but unlike DataFusion it - targets end-users rather than developers of other database systems. diff --git a/docs/source/user-guide/example-usage.md b/docs/source/user-guide/example-usage.md index a2cd109a61ef..fd3c4cf1833c 100644 --- a/docs/source/user-guide/example-usage.md +++ b/docs/source/user-guide/example-usage.md @@ -26,7 +26,7 @@ In this example some simple processing is performed on the [`example.csv`](../.. Add the following to your `Cargo.toml` file: ```toml -datafusion = "11.0" +datafusion = "22" tokio = "1.0" ``` @@ -81,7 +81,7 @@ async fn main() -> datafusion::error::Result<()> { +---+--------+ ``` -# Identifiers and Capitalization +## Identifiers and Capitalization Please be aware that all identifiers are effectively made lower-case in SQL, so if your csv file has capital letters (ex: `Name`) you must put your column name in double quotes or the examples won't work. @@ -141,3 +141,60 @@ async fn main() -> datafusion::error::Result<()> { | 1 | 2 | +---+--------+ ``` + +## Extensibility + +DataFusion is designed to be extensible at all points. To that end, you can provide your own custom: + +- [x] User Defined Functions (UDFs) +- [x] User Defined Aggregate Functions (UDAFs) +- [x] User Defined Table Source (`TableProvider`) for tables +- [x] User Defined `Optimizer` passes (plan rewrites) +- [x] User Defined `LogicalPlan` nodes +- [x] User Defined `ExecutionPlan` nodes + +## Rust Version Compatibility + +This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler. + +## Optimized Configuration + +For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is +worth noting that using the settings in the `[profile.release]` section will significantly increase the build time. + +```toml +[dependencies] +datafusion = { version = "22.0" , features = ["simd"]} +tokio = { version = "^1.0", features = ["rt-multi-thread"] } +snmalloc-rs = "0.2" + +[profile.release] +lto = true +codegen-units = 1 +``` + +Then, in `main.rs.` update the memory allocator with the below after your imports: + +```rust +use datafusion::prelude::*; + +#[global_allocator] +static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc; + +async fn main() -> datafusion::error::Result<()> { + Ok(()) +} +``` + +Finally, in order to build with the `simd` optimization `cargo nightly` is required. + +```shell +rustup toolchain install nightly +``` + +Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally +with `native` or at least `avx2`. + +``` +RUSTFLAGS='-C target-cpu=native' cargo +nightly run --release +``` diff --git a/docs/source/user-guide/expressions.md b/docs/source/user-guide/expressions.md index 339bfadfe911..dbca3d01ca0e 100644 --- a/docs/source/user-guide/expressions.md +++ b/docs/source/user-guide/expressions.md @@ -17,7 +17,7 @@ under the License. --> -# Expressions +# Expression API DataFrame methods such as `select` and `filter` accept one or more logical expressions and there are many functions available for creating logical expressions. These are documented below. diff --git a/docs/source/user-guide/faq.md b/docs/source/user-guide/faq.md index 16a8873fff38..18f0acfa4d91 100644 --- a/docs/source/user-guide/faq.md +++ b/docs/source/user-guide/faq.md @@ -29,3 +29,37 @@ model and computational kernels. It is designed to run within a single process, for parallel query execution. [Ballista](https://github.com/apache/arrow-ballista) is a distributed compute platform built on DataFusion. + +# How does DataFusion Compare with `XYZ`? + +When compared to similar systems, DataFusion typically is: + +1. Targeted at developers, rather than end users / data scientists. +2. Designed to be embedded, rather than a complete file based SQL system. +3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual. +4. Implemented in `Rust`, rather than `C/C++` + +Here is a comparison with similar projects that may help understand +when DataFusion might be be suitable and unsuitable for your needs: + +- [DuckDB](https://www.duckdb.org) is an open source, in process analytic database. + Like DataFusion, it supports very fast execution, both from its custom file format + and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it + is primarily used directly by users as a serverless database and query system rather + than as a library for building such database systems. + +- [Polars](http://pola.rs): Polars is one of the fastest DataFrame + libraries at the time of writing. Like DataFusion, it is also + written in Rust and uses the Apache Arrow memory model, but unlike + DataFusion it is not designed with as many extension points. + +- [Facebook Velox](https://github.com/facebookincubator/velox) + is an execution engine. Like DataFusion, Velox aims to + provide a reusable foundation for building database-like systems. Unlike DataFusion, + it is written in C/C++ and does not include a SQL frontend or planning / optimization + framework. + +- [Databend](https://github.com/datafuselabs/databend) is a complete + database system. Like DataFusion it is also written in Rust and + utilizes the Apache Arrow memory model, but unlike DataFusion it + targets end-users rather than developers of other database systems. diff --git a/docs/source/user-guide/integration.md b/docs/source/user-guide/integration.md deleted file mode 100644 index bffa6b189390..000000000000 --- a/docs/source/user-guide/integration.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# Integrations and Extensions - -There are a number of community projects that extend DataFusion or -provide integrations with other systems. - -## Language Bindings - -- [datafusion-c](https://github.com/datafusion-contrib/datafusion-c) -- [datafusion-python](https://github.com/apache/arrow-datafusion-python) -- [datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby) -- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java) - -## Integrations - -- [datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable) -- [datafusion-catalogprovider-glue](https://github.com/datafusion-contrib/datafusion-catalogprovider-glue) diff --git a/docs/source/user-guide/introduction.md b/docs/source/user-guide/introduction.md index f906eac78c13..62cebd5145c6 100644 --- a/docs/source/user-guide/introduction.md +++ b/docs/source/user-guide/introduction.md @@ -17,7 +17,7 @@ under the License. --> -# Features, and Usecases +# Introduction DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in [Rust](http://rustlang.org), @@ -66,6 +66,72 @@ features, and avoid reimplementing general (but still necessary) features such as an expression representation, standard optimizations, execution plans, file format support, etc. +## Known Users + +Here are some of the projects known to use DataFusion: + +- [Ballista](https://github.com/apache/arrow-ballista) Distributed SQL Query Engine +- [Blaze](https://github.com/blaze-init/blaze) Spark accelerator with DataFusion at its core +- [CeresDB](https://github.com/CeresDB/ceresdb) Distributed Time-Series Database +- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) +- [CnosDB](https://github.com/cnosdb/cnosdb) Open Source Distributed Time Series Database +- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust) +- [Dask SQL](https://github.com/dask-contrib/dask-sql) Distributed SQL query engine in Python +- [datafusion-tui](https://github.com/datafusion-contrib/datafusion-tui) Text UI for DataFusion +- [delta-rs](https://github.com/delta-io/delta-rs) Native Rust implementation of Delta Lake +- [Flock](https://github.com/flock-lab/flock) +- [GreptimeDB](https://github.com/GreptimeTeam/greptimedb) Open Source & Cloud Native Distributed Time Series Database +- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database +- [Kamu](https://github.com/kamu-data/kamu-cli/) Planet-scale streaming data pipeline +- [Parseable](https://github.com/parseablehq/parseable) Log storage and observability platform +- [qv](https://github.com/timvw/qv) Quickly view your data +- [ROAPI](https://github.com/roapi/roapi) +- [Seafowl](https://github.com/splitgraph/seafowl) CDN-friendly analytical database +- [Synnada](https://synnada.ai/) Streaming-first framework for data products +- [Tensorbase](https://github.com/tensorbase/tensorbase) +- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar +- [ZincObserve](https://github.com/zinclabs/zincobserve) Distributed cloud native observability platform + +[ballista]: https://github.com/apache/arrow-ballista +[blaze]: https://github.com/blaze-init/blaze +[ceresdb]: https://github.com/CeresDB/ceresdb +[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust +[cnosdb]: https://github.com/cnosdb/cnosdb +[cube store]: https://github.com/cube-js/cube.js/tree/master/rust +[dask sql]: https://github.com/dask-contrib/dask-sql +[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui +[delta-rs]: https://github.com/delta-io/delta-rs +[flock]: https://github.com/flock-lab/flock +[kamu]: https://github.com/kamu-data/kamu-cli +[greptime db]: https://github.com/GreptimeTeam/greptimedb +[influxdb iox]: https://github.com/influxdata/influxdb_iox +[parseable]: https://github.com/parseablehq/parseable +[prql-query]: https://github.com/prql/prql-query +[qv]: https://github.com/timvw/qv +[roapi]: https://github.com/roapi/roapi +[seafowl]: https://github.com/splitgraph/seafowl +[synnada]: https://synnada.ai/ +[tensorbase]: https://github.com/tensorbase/tensorbase +[vegafusion]: https://vegafusion.io/ +[zincobserve]: https://github.com/zinclabs/zincobserve "if you know of another project, please submit a PR to add a link!" + +## Integrations and Extensions + +There are a number of community projects that extend DataFusion or +provide integrations with other systems. + +### Language Bindings + +- [datafusion-c](https://github.com/datafusion-contrib/datafusion-c) +- [datafusion-python](https://github.com/apache/arrow-datafusion-python) +- [datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby) +- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java) + +### Integrations + +- [datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable) +- [datafusion-catalogprovider-glue](https://github.com/datafusion-contrib/datafusion-catalogprovider-glue) + ## Why DataFusion? - _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast. diff --git a/docs/source/user-guide/library.md b/docs/source/user-guide/library.md deleted file mode 100644 index c7cc1ec425ef..000000000000 --- a/docs/source/user-guide/library.md +++ /dev/null @@ -1,127 +0,0 @@ - - -# Using DataFusion as a library - -## Create a new project - -```shell -cargo new hello_datafusion -``` - -```shell -$ cd hello_datafusion -$ tree . -. -├── Cargo.toml -└── src - └── main.rs - -1 directory, 2 files -``` - -## Default Configuration - -DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/). - -To get started, add the following to your `Cargo.toml` file: - -```toml -[dependencies] -datafusion = "11.0" -``` - -## Create a main function - -Update the main.rs file with your first datafusion application based on [Example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) - -```rust -use datafusion::prelude::*; - -#[tokio::main] -async fn main() -> datafusion::error::Result<()> { - // register the table - let ctx = SessionContext::new(); - ctx.register_csv("test", "", CsvReadOptions::new()).await?; - - // create a plan to run a SQL query - let df = ctx.sql("SELECT * FROM test").await?; - - // execute and print results - df.show().await?; - Ok(()) -} -``` - -## Extensibility - -DataFusion is designed to be extensible at all points. To that end, you can provide your own custom: - -- [x] User Defined Functions (UDFs) -- [x] User Defined Aggregate Functions (UDAFs) -- [x] User Defined Table Source (`TableProvider`) for tables -- [x] User Defined `Optimizer` passes (plan rewrites) -- [x] User Defined `LogicalPlan` nodes -- [x] User Defined `ExecutionPlan` nodes - -## Rust Version Compatibility - -This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler. - -## Optimized Configuration - -For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is -worth noting that using the settings in the `[profile.release]` section will significantly increase the build time. - -```toml -[dependencies] -datafusion = { version = "11.0" , features = ["simd"]} -tokio = { version = "^1.0", features = ["rt-multi-thread"] } -snmalloc-rs = "0.2" - -[profile.release] -lto = true -codegen-units = 1 -``` - -Then, in `main.rs.` update the memory allocator with the below after your imports: - -```rust -use datafusion::prelude::*; - -#[global_allocator] -static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc; - -async fn main() -> datafusion::error::Result<()> { - Ok(()) -} -``` - -Finally, in order to build with the `simd` optimization `cargo nightly` is required. - -```shell -rustup toolchain install nightly -``` - -Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally -with `native` or at least `avx2`. - -``` -RUSTFLAGS='-C target-cpu=native' cargo +nightly run --release -``` diff --git a/docs/source/user-guide/sql/ddl.md b/docs/source/user-guide/sql/ddl.md index 29a156bd01b1..8de29b4e50ff 100644 --- a/docs/source/user-guide/sql/ddl.md +++ b/docs/source/user-guide/sql/ddl.md @@ -98,7 +98,7 @@ WITH ORDER (sort_expression1 [ASC | DESC] [NULLS { FIRST | LAST }] [, sort_expression2 [ASC | DESC] [NULLS { FIRST | LAST }] ...]) ``` -#### Cautions When Using the WITH ORDER Clause +### Cautions When Using the WITH ORDER Clause - It's important to understand that using the `WITH ORDER` clause in the `CREATE EXTERNAL TABLE` statement only specifies the order in which the data should be read from the external file. If the data in the file is not already sorted according to the specified order, then the results may not be correct. diff --git a/docs/source/user-guide/users.md b/docs/source/user-guide/users.md deleted file mode 100644 index 0d259c8de3e2..000000000000 --- a/docs/source/user-guide/users.md +++ /dev/null @@ -1,67 +0,0 @@ - - -# Known Users - -Here are some of the projects known to use DataFusion: - -- [Ballista](https://github.com/apache/arrow-ballista) Distributed SQL Query Engine -- [Blaze](https://github.com/blaze-init/blaze) Spark accelerator with DataFusion at its core -- [CeresDB](https://github.com/CeresDB/ceresdb) Distributed Time-Series Database -- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) -- [CnosDB](https://github.com/cnosdb/cnosdb) Open Source Distributed Time Series Database -- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust) -- [Dask SQL](https://github.com/dask-contrib/dask-sql) Distributed SQL query engine in Python -- [datafusion-tui](https://github.com/datafusion-contrib/datafusion-tui) Text UI for DataFusion -- [delta-rs](https://github.com/delta-io/delta-rs) Native Rust implementation of Delta Lake -- [Flock](https://github.com/flock-lab/flock) -- [GreptimeDB](https://github.com/GreptimeTeam/greptimedb) Open Source & Cloud Native Distributed Time Series Database -- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database -- [Kamu](https://github.com/kamu-data/kamu-cli/) Planet-scale streaming data pipeline -- [Parseable](https://github.com/parseablehq/parseable) Log storage and observability platform -- [qv](https://github.com/timvw/qv) Quickly view your data -- [ROAPI](https://github.com/roapi/roapi) -- [Seafowl](https://github.com/splitgraph/seafowl) CDN-friendly analytical database -- [Synnada](https://synnada.ai/) Streaming-first framework for data products -- [Tensorbase](https://github.com/tensorbase/tensorbase) -- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar -- [ZincObserve](https://github.com/zinclabs/zincobserve) Distributed cloud native observability platform - -[ballista]: https://github.com/apache/arrow-ballista -[blaze]: https://github.com/blaze-init/blaze -[ceresdb]: https://github.com/CeresDB/ceresdb -[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust -[cnosdb]: https://github.com/cnosdb/cnosdb -[cube store]: https://github.com/cube-js/cube.js/tree/master/rust -[dask sql]: https://github.com/dask-contrib/dask-sql -[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui -[delta-rs]: https://github.com/delta-io/delta-rs -[flock]: https://github.com/flock-lab/flock -[kamu]: https://github.com/kamu-data/kamu-cli -[greptime db]: https://github.com/GreptimeTeam/greptimedb -[influxdb iox]: https://github.com/influxdata/influxdb_iox -[parseable]: https://github.com/parseablehq/parseable -[prql-query]: https://github.com/prql/prql-query -[qv]: https://github.com/timvw/qv -[roapi]: https://github.com/roapi/roapi -[seafowl]: https://github.com/splitgraph/seafowl -[synnada]: https://synnada.ai/ -[tensorbase]: https://github.com/tensorbase/tensorbase -[vegafusion]: https://vegafusion.io/ -[zincobserve]: https://github.com/zinclabs/zincobserve "if you know of another project, please submit a PR to add a link!"