Skip to content

Commit d059dd3

Browse files
tshauckalamb
andauthored
docs: Add Expr library developer page (#7359)
* docs: fill out expr page * fix: fixup a few issues * docs: links to examples * fix: impv variable name * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * fix: feedback updates * docs: update w/ feedback --------- Co-authored-by: Andrew Lamb <[email protected]>
1 parent d8ca0c2 commit d059dd3

File tree

2 files changed

+266
-2
lines changed

2 files changed

+266
-2
lines changed

docs/source/library-user-guide/adding-udfs.md

Lines changed: 103 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,106 @@
1919

2020
# Adding User Defined Functions: Scalar/Window/Aggregate
2121

22-
Coming Soon
22+
User Defined Functions (UDFs) are functions that can be used in the context of DataFusion execution.
23+
24+
This page covers how to add UDFs to DataFusion. In particular, it covers how to add Scalar, Window, and Aggregate UDFs.
25+
26+
| UDF Type | Description | Example |
27+
| --------- | ---------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
28+
| Scalar | A function that takes a row of data and returns a single value. | [simple_udf.rs](../../../datafusion-examples/examples/simple_udf.rs) |
29+
| Window | A function that takes a row of data and returns a single value, but also has access to the rows around it. | [simple_udwf.rs](../../../datafusion-examples/examples/simple_udwf.rs) |
30+
| Aggregate | A function that takes a group of rows and returns a single value. | [simple_udaf.rs](../../../datafusion-examples/examples/simple_udaf.rs) |
31+
32+
First we'll talk about adding an Scalar UDF end-to-end, then we'll talk about the differences between the different types of UDFs.
33+
34+
## Adding a Scalar UDF
35+
36+
A Scalar UDF is a function that takes a row of data and returns a single value. For example, this function takes a single i64 and returns a single i64 with 1 added to it:
37+
38+
```rust
39+
use std::sync::Arc;
40+
41+
use arrow::array::{ArrayRef, Int64Array};
42+
use datafusion::common::Result;
43+
44+
use datafusion::common::cast::as_int64_array;
45+
46+
pub fn add_one(args: &[ArrayRef]) -> Result<ArrayRef> {
47+
// Error handling omitted for brevity
48+
49+
let i64s = as_int64_array(&args[0])?;
50+
51+
let new_array = i64s
52+
.iter()
53+
.map(|array_elem| array_elem.map(|value| value + 1))
54+
.collect::<Int64Array>();
55+
56+
Ok(Arc::new(new_array))
57+
}
58+
```
59+
60+
For brevity, we'll skipped some error handling, but e.g. you may want to check that `args.len()` is the expected number of arguments.
61+
62+
This "works" in isolation, i.e. if you have a slice of `ArrayRef`s, you can call `add_one` and it will return a new `ArrayRef` with 1 added to each value.
63+
64+
```rust
65+
let input = vec![Some(1), None, Some(3)];
66+
let input = Arc::new(Int64Array::from(input)) as ArrayRef;
67+
68+
let result = add_one(&[input]).unwrap();
69+
let result = result.as_any().downcast_ref::<Int64Array>().unwrap();
70+
71+
assert_eq!(result, &Int64Array::from(vec![Some(2), None, Some(4)]));
72+
```
73+
74+
The challenge however is that DataFusion doesn't know about this function. We need to register it with DataFusion so that it can be used in the context of a query.
75+
76+
### Registering a Scalar UDF
77+
78+
To register a Scalar UDF, you need to wrap the function implementation in a `ScalarUDF` struct and then register it with the `SessionContext`. DataFusion provides the `create_udf` and `make_scalar_function` helper functions to make this easier.
79+
80+
```rust
81+
let udf = create_udf(
82+
"add_one",
83+
vec![DataType::Int64],
84+
Arc::new(DataType::Int64),
85+
Volatility::Immutable,
86+
make_scalar_function(add_one),
87+
);
88+
```
89+
90+
A few things to note:
91+
92+
- The first argument is the name of the function. This is the name that will be used in SQL queries.
93+
- The second argument is a vector of `DataType`s. This is the list of argument types that the function accepts. I.e. in this case, the function accepts a single `Int64` argument.
94+
- The third argument is the return type of the function. I.e. in this case, the function returns an `Int64`.
95+
- The fourth argument is the volatility of the function. In short, this is used to determine if the function's performance can be optimized in some situations. In this case, the function is `Immutable` because it always returns the same value for the same input. A random number generator would be `Volatile` because it returns a different value for the same input.
96+
- The fifth argument is the function implementation. This is the function that we defined above.
97+
98+
That gives us a `ScalarUDF` that we can register with the `SessionContext`:
99+
100+
```rust
101+
let mut ctx = SessionContext::new();
102+
103+
ctx.register_udf(udf);
104+
```
105+
106+
At this point, you can use the `add_one` function in your query:
107+
108+
```rust
109+
let sql = "SELECT add_one(1)";
110+
111+
let df = ctx.sql(&sql).await.unwrap();
112+
```
113+
114+
## Adding a Window UDF
115+
116+
Scalar UDFs are functions that take a row of data and return a single value. Window UDFs are similar, but they also have access to the rows around them. Access to the the proximal rows is helpful, but adds some complexity to the implementation.
117+
118+
Body coming soon.
119+
120+
## Adding an Aggregate UDF
121+
122+
Aggregate UDFs are functions that take a group of rows and return a single value. These are akin to SQL's `SUM` or `COUNT` functions.
123+
124+
Body coming soon.

docs/source/library-user-guide/working-with-exprs.md

Lines changed: 163 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,166 @@
1919

2020
# Working with Exprs
2121

22-
Coming Soon
22+
<!-- https://github.com/apache/arrow-datafusion/issues/7304 -->
23+
24+
`Expr` is short for "expression". It is a core abstraction in DataFusion for representing a computation, and follows the standard "expression tree" abstraction found in most compilers and databases.
25+
26+
For example, the SQL expression `a + b` would be represented as an `Expr` with a `BinaryExpr` variant. A `BinaryExpr` has a left and right `Expr` and an operator.
27+
28+
As another example, the SQL expression `a + b * c` would be represented as an `Expr` with a `BinaryExpr` variant. The left `Expr` would be `a` and the right `Expr` would be another `BinaryExpr` with a left `Expr` of `b` and a right `Expr` of `c`. As a classic expression tree, this would look like:
29+
30+
```text
31+
┌────────────────────┐
32+
│ BinaryExpr │
33+
│ op: + │
34+
└────────────────────┘
35+
▲ ▲
36+
┌───────┘ └────────────────┐
37+
│ │
38+
┌────────────────────┐ ┌────────────────────┐
39+
│ Expr::Col │ │ BinaryExpr │
40+
│ col: a │ │ op: * │
41+
└────────────────────┘ └────────────────────┘
42+
▲ ▲
43+
┌────────┘ └─────────┐
44+
│ │
45+
┌────────────────────┐ ┌────────────────────┐
46+
│ Expr::Col │ │ Expr::Col │
47+
│ col: b │ │ col: c │
48+
└────────────────────┘ └────────────────────┘
49+
```
50+
51+
As the writer of a library, you may want to use or create `Expr`s to represent computations that you want to perform. This guide will walk you through how to make your own scalar UDF as an `Expr` and how to rewrite `Expr`s to inline the simple UDF.
52+
53+
There are also executable examples for working with `Expr`s:
54+
55+
- [rewrite_expr.rs](../../../datafusion-examples/examples/catalog.rs)
56+
- [expr_api.rs](../../../datafusion-examples/examples/expr_api.rs)
57+
58+
## A Scalar UDF Example
59+
60+
We'll use a `ScalarUDF` expression as our example. This necessitates implementing an actual UDF, and for ease we'll use the same example from the [adding UDFs](./adding-udfs.md) guide.
61+
62+
So assuming you've written that function, you can use it to create an `Expr`:
63+
64+
```rust
65+
let add_one_udf = create_udf(
66+
"add_one",
67+
vec![DataType::Int64],
68+
Arc::new(DataType::Int64),
69+
Volatility::Immutable,
70+
make_scalar_function(add_one), // <-- the function we wrote
71+
);
72+
73+
// make the expr `add_one(5)`
74+
let expr = add_one_udf.call(vec![lit(5)]);
75+
76+
// make the expr `add_one(my_column)`
77+
let expr = add_one_udf.call(vec![col("my_column")]);
78+
```
79+
80+
If you'd like to learn more about `Expr`s, before we get into the details of creating and rewriting them, you can read the [expression user-guide](./../user-guide/expressions.md).
81+
82+
## Rewriting Exprs
83+
84+
Rewriting Expressions is the process of taking an `Expr` and transforming it into another `Expr`. This is useful for a number of reasons, including:
85+
86+
- Simplifying `Expr`s to make them easier to evaluate
87+
- Optimizing `Expr`s to make them faster to evaluate
88+
- Converting `Expr`s to other forms, e.g. converting a `BinaryExpr` to a `CastExpr`
89+
90+
In our example, we'll use rewriting to update our `add_one` UDF, to be rewritten as a `BinaryExpr` with a `Literal` of 1. We're effectively inlining the UDF.
91+
92+
### Rewriting with `transform`
93+
94+
To implement the inlining, we'll need to write a function that takes an `Expr` and returns a `Result<Expr>`. If the expression is _not_ to be rewritten `Transformed::No` is used to wrap the original `Expr`. If the expression _is_ to be rewritten, `Transformed::Yes` is used to wrap the new `Expr`.
95+
96+
```rust
97+
fn rewrite_add_one(expr: Expr) -> Result<Expr> {
98+
expr.transform(&|expr| {
99+
Ok(match expr {
100+
Expr::ScalarUDF(scalar_fun) if scalar_fun.fun.name == "add_one" => {
101+
let input_arg = scalar_fun.args[0].clone();
102+
let new_expression = input_arg + lit(1i64);
103+
104+
Transformed::Yes(new_expression)
105+
}
106+
_ => Transformed::No(expr),
107+
})
108+
})
109+
}
110+
```
111+
112+
### Creating an `OptimizerRule`
113+
114+
In DataFusion, an `OptimizerRule` is a trait that supports rewriting`Expr`s that appear in various parts of the `LogicalPlan`. It follows DataFusion's general mantra of trait implementations to drive behavior.
115+
116+
We'll call our rule `AddOneInliner` and implement the `OptimizerRule` trait. The `OptimizerRule` trait has two methods:
117+
118+
- `name` - returns the name of the rule
119+
- `try_optimize` - takes a `LogicalPlan` and returns an `Option<LogicalPlan>`. If the rule is able to optimize the plan, it returns `Some(LogicalPlan)` with the optimized plan. If the rule is not able to optimize the plan, it returns `None`.
120+
121+
```rust
122+
struct AddOneInliner {}
123+
124+
impl OptimizerRule for AddOneInliner {
125+
fn name(&self) -> &str {
126+
"add_one_inliner"
127+
}
128+
129+
fn try_optimize(
130+
&self,
131+
plan: &LogicalPlan,
132+
config: &dyn OptimizerConfig,
133+
) -> Result<Option<LogicalPlan>> {
134+
// Map over the expressions and rewrite them
135+
let new_expressions = plan
136+
.expressions()
137+
.into_iter()
138+
.map(|expr| rewrite_add_one(expr))
139+
.collect::<Result<Vec<_>>>()?;
140+
141+
let inputs = plan.inputs().into_iter().cloned().collect::<Vec<_>>();
142+
143+
let plan = plan.with_new_exprs(&new_expressions, &inputs);
144+
145+
plan.map(Some)
146+
}
147+
}
148+
```
149+
150+
Note the use of `rewrite_add_one` which is mapped over `plan.expressions()` to rewrite the expressions, then `plan.with_new_exprs` is used to create a new `LogicalPlan` with the rewritten expressions.
151+
152+
We're almost there. Let's just test our rule works properly.
153+
154+
## Testing the Rule
155+
156+
Testing the rule is fairly simple, we can create a SessionState with our rule and then create a DataFrame and run a query. The logical plan will be optimized by our rule.
157+
158+
```rust
159+
use datafusion::prelude::*;
160+
161+
let rules = Arc::new(AddOneInliner {});
162+
let state = ctx.state().with_optimizer_rules(vec![rules]);
163+
164+
let ctx = SessionContext::with_state(state);
165+
ctx.register_udf(add_one);
166+
167+
let sql = "SELECT add_one(1) AS added_one";
168+
let plan = ctx.sql(sql).await?.logical_plan();
169+
170+
println!("{:?}", plan);
171+
```
172+
173+
This results in the following output:
174+
175+
```text
176+
Projection: Int64(1) + Int64(1) AS added_one
177+
EmptyRelation
178+
```
179+
180+
I.e. the `add_one` UDF has been inlined into the projection.
181+
182+
## Conclusion
183+
184+
In this guide, we've seen how to create `Expr`s programmatically and how to rewrite them. This is useful for simplifying and optimizing `Expr`s. We've also seen how to test our rule to ensure it works properly.

0 commit comments

Comments
 (0)