Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Oct 20, 2021

Which issue does this PR close?

Closes #1070.

Rationale for this change

See #1070

This basically lets DataFusion rewrite expressions like the following at plan time rather than once per row

  • 5 < 10 --> true
  • substr('foobar', 1 2) --> 'oo'

This is useful when predicates are created by some other system (e.g. a GUI) or other optimization passes

It works for all expression types that can be evaluated, including simple (e.g. +), complex (e.g. CASE) , and user defned functions.

Check out some examples within the datafusion-cli:

> explain select 1 + 2;
+---------------+--------------------------------------+
| plan_type     | plan                                 |
+---------------+--------------------------------------+
| logical_plan  | Projection: Int64(3)                 |
|               |   EmptyRelation                      |
| physical_plan | ProjectionExec: expr=[3 as Int64(3)] |
|               |   EmptyExec: produce_one_row=true    |
|               |                                      |
+---------------+--------------------------------------+
2 rows in set. Query took 0.005 seconds.
> explain select case when true then 'Monday' else 'Friday' end;
+---------------+-------------------------------------------------+
| plan_type     | plan                                            |
+---------------+-------------------------------------------------+
| logical_plan  | Projection: Utf8("Monday")                      |
|               |   EmptyRelation                                 |
| physical_plan | ProjectionExec: expr=[Monday as Utf8("Monday")] |
|               |   EmptyExec: produce_one_row=true               |
|               |                                                 |
+---------------+-------------------------------------------------+
2 rows in set. Query took 0.002 seconds.

Note there is also prior work from @pjmore from #1128 which is partially included in this pr (see #1128 (comment))

What changes are included in this PR?

  1. ConstEvalator expression rewriter
  2. Integrate ConstEvaluator into ConstantFolding optimizer pass
  3. Additional test coverage

Follow on work

Are there any user-facing changes?

  1. New ConstEvalator in optimizer utilities
  2. More expressions are folded

@alamb alamb force-pushed the alamb/constant_eval branch from 2fdadd9 to 55b6c1e Compare October 21, 2021 13:56
.query_execution_start_time
.timestamp_nanos(),
))),
Expr::ScalarFunction {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rewrite of to_timestamp no longer needs to be special cased -- it now handled, along with other functions, by ConstEvaluator

impl<'a> ExprRewriter for Simplifier<'a> {
/// rewrite the expression simplifying any constant expressions
fn mutate(&mut self, expr: Expr) -> Result<Expr> {
let new_expr = match expr {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code to handle true != true type expressions should also be handled by ConstEvaluator but it turns out there is a missing support for such degenerate expressions in the Expr evaluation code. I will fix this but plan to do so as a follow on PR to keep this PR reasonable to review

.build()
.unwrap();

let expected = "Projection: totimestamp(Utf8(\"I\'M NOT A TIMESTAMP\"))\
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously such plans would error at runtime, now they error at plan time which I think is a reasonable change in behavior (earlier errors are better errors?)

It might have unfortunate edge cases where a plan like SELECT to_timestamp('FOOO') from table_with_no_rows previously would error but now it will error.

However, I don't think this is an important difference

}

#[test]
fn to_timestamp_expr_no_arg() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this plan would panic at runtime (because to_timestamp expects a single argument). Now it panics at plan time. Since to_timestamp() is illformed I didn't think this was a valuable test so I removed it

}

impl ExprRewriter for ConstEvaluator {
fn pre_visit(&mut self, expr: &Expr) -> Result<RewriteRecursion> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the guts of the logic for expression evaluation. I am quite pleased with how concise it is

let ctx_state = ExecutionContextState::new();
let input_schema = DFSchema::empty();

// The dummy column name uis used doesn't matter as only scalar
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes directly from @pjmore ❤️

Expr::Alias(..) => false,
Expr::AggregateFunction { .. } => false,
Expr::AggregateUDF { .. } => false,
// TODO handle in constantant propagator pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will file follow on ticket

Comment on lines +754 to +755
// parenthesization matters: can't rewrite
// (rand() + 1) + 2 --> (rand() + 1) + 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting one -- and something perhaps the Simplifer should do (use associative rules to rewrite rand() + 1) + 2 into rand() + (1 + 2) and then let the evaluator constant fold it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked this improvement in #1162

};
use datafusion::{execution::context::ExecutionContext, physical_plan::displayable};

/// A macro to assert that one string is contained within another with
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved into test_utils

@alamb alamb changed the title Generic constant expression evaluation (WIP) Generic constant expression evaluation Oct 21, 2021
@alamb alamb marked this pull request as ready for review October 21, 2021 14:00

// TODO constant folder hould be able to run again and fold
// this whole thing down
// TODO add ticket
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO add ticket
// https://github.com/apache/arrow-datafusion/issues/1160

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we commit this issue into the PR branch?

@alamb
Copy link
Contributor Author

alamb commented Oct 23, 2021

FYI @rdettai @pjmore @Dandandan @houqp this PR is ready for review

@alamb alamb removed the request for review from houqp October 23, 2021 11:44
@pjmore
Copy link
Contributor

pjmore commented Oct 25, 2021

@rdettai made a good point on my initial implementation about how stable functions should be evaluatable in a constant context. Maybe the constant folding optimizer should have a flag to allow for evaluating stable functions in a constant context if provided execution properties? This might be useful even if it's just to make the option available to the partition pruning logic. Doing this would allow the special case for now in the expression simplifier to be removed. While this doesn't matter much at the moment because it is the only stable function if more stable built-ins are added this might start mattering. Besides the question of constant evaluating stable functions this looks good to me,

@houqp houqp requested a review from Dandandan October 25, 2021 07:06
fn volatility_ok(volatility: Volatility) -> bool {
match volatility {
Volatility::Immutable => true,
// To evaluate stable functions, need ExecutionProps, see
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason not to provide ExecutionProps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rationale was to minimize the changes to datafusion/src/optimizer/constant_folding.rs and keep this PR smaller to review. I agree it makes sense to do now() constant folding in the simplifier.

Filed #1175 to track

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #1176 (built on this PR)

@alamb
Copy link
Contributor Author

alamb commented Oct 25, 2021

@pjmore

@rdettai made a good point on my initial implementation about how stable functions should be evaluatable in a constant context.

I think you are both right -- I will do so as a follow on PR under the aegis of #1175

@alamb
Copy link
Contributor Author

alamb commented Oct 27, 2021

@houqp / @rdettai would you be ok with my merging this PR (it does not yet have any approvals) as I have a few others backed up waiting on it to merge?

Comment on lines +839 to +853
let plan = LogicalPlanBuilder::from(table_scan)
.filter(
now_expr()
.lt(cast_to_int64_expr(to_timestamp_expr(ts_string)) + lit(50000)),
)
.unwrap()
.build()
.unwrap();

// Note that constant folder should be able to run again and fold
// this whole expression down to a single constant;
// https://github.com/apache/arrow-datafusion/issues/1160
let expected = "Filter: TimestampNanosecond(1599566400000000000) < CAST(totimestamp(Utf8(\"2020-09-08T12:05:00+00:00\")) AS Int64) + Int32(50000)\
\n TableScan: test projection=None";
let actual = get_optimized_plan_formatted(&plan, &time);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the ConstEvaluator supposed to solve this in one run as the expression in constant?

Copy link
Contributor

@rdettai rdettai Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, let's have this conversation in #1160 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversation in #1160 (comment) 👍

@alamb alamb merged commit 156ebff into apache:master Oct 27, 2021
@alamb alamb deleted the alamb/constant_eval branch October 27, 2021 23:19
@houqp houqp added the enhancement New feature or request label Nov 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement General Purpose Constant Folding with the Expression Evaluator

4 participants