Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/source/contributor-guide/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@
# Architecture

DataFusion's code structure and organization is described in the
[Crate Documentation], to keep it as close to the source as
possible.
[crates.io documentation], to keep it as close to the source as
possible. You can find the most up to date version in the [source code].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about hosting the latest document generated from the source code on github pages (or other static page hoster)? Like greptimedb.rs which is generated from https://github.com/GreptimeTeam/greptimedb/deployments/activity_log?environment=github-pages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #5981


[crate documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#code-organization
[crates.io documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#code-organization
[source code]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/lib.rs
12 changes: 3 additions & 9 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,9 @@ community.
:maxdepth: 1
:caption: Links

Issue tracker <https://github.com/apache/arrow-datafusion/issues>
Github and Issue Tracker <https://github.com/apache/arrow-datafusion>
crates.io <https://crates.io/crates/datafusion>
API Docs <https://docs.rs/datafusion/21.1.0/datafusion/>
Github <https://github.com/apache/arrow-datafusion>
API Docs <https://docs.rs/datafusion/latest/datafusion/>
Code of conduct <https://github.com/apache/arrow-datafusion/blob/main/CODE_OF_CONDUCT.md>

.. _toc.guide:
Expand All @@ -50,22 +49,17 @@ community.

user-guide/introduction
user-guide/example-usage
user-guide/users
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consolidated the content of these pages into other pages

user-guide/comparison
user-guide/integration
user-guide/library
user-guide/cli
user-guide/dataframe
user-guide/expressions
user-guide/sql/index
user-guide/configs
user-guide/faq
Rust Crate Documentation <https://docs.rs/crate/datafusion/>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was redundant with the crates.io link above


.. _toc.contributor-guide:

.. toctree::
:maxdepth: 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stops listing H2 headings ( ##) on the main table of contents

:maxdepth: 1
:caption: Contributor Guide

contributor-guide/index
Expand Down
2 changes: 1 addition & 1 deletion docs/source/user-guide/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
under the License.
-->

# DataFusion Command-line SQL Utility
# `datafusion-cli`

The DataFusion CLI is a command-line interactive SQL utility for executing
queries against any supported data files. It is a convenient way to
Expand Down
52 changes: 0 additions & 52 deletions docs/source/user-guide/comparison.md

This file was deleted.

61 changes: 59 additions & 2 deletions docs/source/user-guide/example-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ In this example some simple processing is performed on the [`example.csv`](../..
Add the following to your `Cargo.toml` file:

```toml
datafusion = "11.0"
datafusion = "22"
tokio = "1.0"
```

Expand Down Expand Up @@ -81,7 +81,7 @@ async fn main() -> datafusion::error::Result<()> {
+---+--------+
```

# Identifiers and Capitalization
## Identifiers and Capitalization

Please be aware that all identifiers are effectively made lower-case in SQL, so if your csv file has capital letters (ex: `Name`) you must put your column name in double quotes or the examples won't work.

Expand Down Expand Up @@ -141,3 +141,60 @@ async fn main() -> datafusion::error::Result<()> {
| 1 | 2 |
+---+--------+
```

## Extensibility

DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:

- [x] User Defined Functions (UDFs)
- [x] User Defined Aggregate Functions (UDAFs)
- [x] User Defined Table Source (`TableProvider`) for tables
- [x] User Defined `Optimizer` passes (plan rewrites)
- [x] User Defined `LogicalPlan` nodes
- [x] User Defined `ExecutionPlan` nodes

## Rust Version Compatibility

This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler.

## Optimized Configuration

For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is
worth noting that using the settings in the `[profile.release]` section will significantly increase the build time.

```toml
[dependencies]
datafusion = { version = "22.0" , features = ["simd"]}
tokio = { version = "^1.0", features = ["rt-multi-thread"] }
snmalloc-rs = "0.2"

[profile.release]
lto = true
codegen-units = 1
```

Then, in `main.rs.` update the memory allocator with the below after your imports:

```rust
use datafusion::prelude::*;

#[global_allocator]
static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;

async fn main() -> datafusion::error::Result<()> {
Ok(())
}
```

Finally, in order to build with the `simd` optimization `cargo nightly` is required.

```shell
rustup toolchain install nightly
```

Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally
with `native` or at least `avx2`.

```
RUSTFLAGS='-C target-cpu=native' cargo +nightly run --release
```
2 changes: 1 addition & 1 deletion docs/source/user-guide/expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
under the License.
-->

# Expressions
# Expression API

DataFrame methods such as `select` and `filter` accept one or more logical expressions and there are many functions
available for creating logical expressions. These are documented below.
Expand Down
34 changes: 34 additions & 0 deletions docs/source/user-guide/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,37 @@ model and computational kernels. It is designed to run within a single process,
for parallel query execution.

[Ballista](https://github.com/apache/arrow-ballista) is a distributed compute platform built on DataFusion.

# How does DataFusion Compare with `XYZ`?

When compared to similar systems, DataFusion typically is:

1. Targeted at developers, rather than end users / data scientists.
2. Designed to be embedded, rather than a complete file based SQL system.
3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual.
4. Implemented in `Rust`, rather than `C/C++`

Here is a comparison with similar projects that may help understand
when DataFusion might be be suitable and unsuitable for your needs:

- [DuckDB](https://www.duckdb.org) is an open source, in process analytic database.
Like DataFusion, it supports very fast execution, both from its custom file format
and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it
is primarily used directly by users as a serverless database and query system rather
than as a library for building such database systems.

- [Polars](http://pola.rs): Polars is one of the fastest DataFrame
libraries at the time of writing. Like DataFusion, it is also
written in Rust and uses the Apache Arrow memory model, but unlike
DataFusion it is not designed with as many extension points.

- [Facebook Velox](https://github.com/facebookincubator/velox)
is an execution engine. Like DataFusion, Velox aims to
provide a reusable foundation for building database-like systems. Unlike DataFusion,
it is written in C/C++ and does not include a SQL frontend or planning / optimization
framework.

- [Databend](https://github.com/datafuselabs/databend) is a complete
database system. Like DataFusion it is also written in Rust and
utilizes the Apache Arrow memory model, but unlike DataFusion it
targets end-users rather than developers of other database systems.
35 changes: 0 additions & 35 deletions docs/source/user-guide/integration.md

This file was deleted.

68 changes: 67 additions & 1 deletion docs/source/user-guide/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
under the License.
-->

# Features, and Usecases
# Introduction

DataFusion is a very fast, extensible query engine for building
high-quality data-centric systems in [Rust](http://rustlang.org),
Expand Down Expand Up @@ -66,6 +66,72 @@ features, and avoid reimplementing general (but still necessary)
features such as an expression representation, standard optimizations,
execution plans, file format support, etc.

## Known Users

Here are some of the projects known to use DataFusion:

- [Ballista](https://github.com/apache/arrow-ballista) Distributed SQL Query Engine
- [Blaze](https://github.com/blaze-init/blaze) Spark accelerator with DataFusion at its core
- [CeresDB](https://github.com/CeresDB/ceresdb) Distributed Time-Series Database
- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
- [CnosDB](https://github.com/cnosdb/cnosdb) Open Source Distributed Time Series Database
- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
- [Dask SQL](https://github.com/dask-contrib/dask-sql) Distributed SQL query engine in Python
- [datafusion-tui](https://github.com/datafusion-contrib/datafusion-tui) Text UI for DataFusion
- [delta-rs](https://github.com/delta-io/delta-rs) Native Rust implementation of Delta Lake
- [Flock](https://github.com/flock-lab/flock)
- [GreptimeDB](https://github.com/GreptimeTeam/greptimedb) Open Source & Cloud Native Distributed Time Series Database
- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
- [Kamu](https://github.com/kamu-data/kamu-cli/) Planet-scale streaming data pipeline
- [Parseable](https://github.com/parseablehq/parseable) Log storage and observability platform
- [qv](https://github.com/timvw/qv) Quickly view your data
- [ROAPI](https://github.com/roapi/roapi)
- [Seafowl](https://github.com/splitgraph/seafowl) CDN-friendly analytical database
- [Synnada](https://synnada.ai/) Streaming-first framework for data products
- [Tensorbase](https://github.com/tensorbase/tensorbase)
- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar
- [ZincObserve](https://github.com/zinclabs/zincobserve) Distributed cloud native observability platform

[ballista]: https://github.com/apache/arrow-ballista
[blaze]: https://github.com/blaze-init/blaze
[ceresdb]: https://github.com/CeresDB/ceresdb
[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust
[cnosdb]: https://github.com/cnosdb/cnosdb
[cube store]: https://github.com/cube-js/cube.js/tree/master/rust
[dask sql]: https://github.com/dask-contrib/dask-sql
[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui
[delta-rs]: https://github.com/delta-io/delta-rs
[flock]: https://github.com/flock-lab/flock
[kamu]: https://github.com/kamu-data/kamu-cli
[greptime db]: https://github.com/GreptimeTeam/greptimedb
[influxdb iox]: https://github.com/influxdata/influxdb_iox
[parseable]: https://github.com/parseablehq/parseable
[prql-query]: https://github.com/prql/prql-query
[qv]: https://github.com/timvw/qv
[roapi]: https://github.com/roapi/roapi
[seafowl]: https://github.com/splitgraph/seafowl
[synnada]: https://synnada.ai/
[tensorbase]: https://github.com/tensorbase/tensorbase
[vegafusion]: https://vegafusion.io/
[zincobserve]: https://github.com/zinclabs/zincobserve "if you know of another project, please submit a PR to add a link!"

## Integrations and Extensions

There are a number of community projects that extend DataFusion or
provide integrations with other systems.

### Language Bindings

- [datafusion-c](https://github.com/datafusion-contrib/datafusion-c)
- [datafusion-python](https://github.com/apache/arrow-datafusion-python)
- [datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby)
- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java)

### Integrations

- [datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable)
- [datafusion-catalogprovider-glue](https://github.com/datafusion-contrib/datafusion-catalogprovider-glue)

## Why DataFusion?

- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
Expand Down
Loading