Skip to content
12 changes: 8 additions & 4 deletions datafusion/optimizer/src/common_subexpr_eliminate.rs
Original file line number Diff line number Diff line change
Expand Up @@ -171,9 +171,13 @@ impl CommonSubexprEliminate {
input.schema().iter().for_each(|(qualifier, field)| {
Copy link
Contributor

@peter-toth peter-toth Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should suppose that all extracted aliases remain in the plan:

  • Let's suppose that we have a plan like this after some CSE optimization:
    ...
      Projection: #1 as c1, #1 as c2, #2 as c3, #2 as c4, (a + 2) as c5, (a + 1 + 1) as c6
        Projection: (a + b) as "#1", (a + c) as "#2", a
    ...
    
  • Then some other optimization rule prunes "c1" and "c2" from the plan because they turn out to be unnecessary:
    ...
      Projection: #2 as c3, #2 as c4, (a + 2) as c5, (a + 1 + 1) as c6
        Projection: (a + c) as "#2", a
    ...
    
  • And then some other rule creates new CSE possibilities:
    ...
      Projection: #2 as c3, #2 as c4, (a + 2) as c5, (a + 2) as c6
        Projection: (a + c) as "#2", a
    ...
    
  • CSE rule runs again but indexes here and at build_common_expr_project_plan() are out of sync...

IMO the best thing we can do is to choose a unique aliases for a common expressions in CommonSubexprRewriter when we found the expression and store the alias in common_exprs together with the expression. In that case we don't need to deal with index sync issues and don't get plans with unnecessary aliases like here: https://github.com/apache/datafusion/pull/10868/files#diff-351499880963d6a383c92e156e75019cd9ce33107724a9635853d7d4cd1898d0R1403

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both issues don't affect correctness.

One thing I'd like to point out is that adding unused columns (all input's columns) in intermediate projection is the behavior of current CSE, it's not introduced in this PR. You can try copying the new test and running it against main. You'll get this output.

        let plan = LogicalPlanBuilder::from(table_scan.clone())
            .project(vec![(col("a") + col("b")).alias("#1"), col("c")])?
            .project(vec![
                (col("c") + lit(2)).alias("c3"),
                (col("c") + lit(2)).alias("c4"),
            ])?
            .build()?;
Projection: {test.c + Int32(2)|{Int32(2)}|{test.c}} AS test.c + Int32(2) AS c3, {test.c + Int32(2)|{Int32(2)}|{test.c}} AS test.c + Int32(2) AS c4
  Projection: test.c + Int32(2) AS {test.c + Int32(2)|{Int32(2)}|{test.c}}, #1, test.c
    Projection: test.a + test.b AS #1, test.c
      TableScan: test

Extra projections are removed by other rules, so the final plan doesn't contain these projections.

Also, you may have noticed that extra projections make the aliases "out-of-sync" and to be honest I don't mind the #2 instead of #1 (as long as it's not something ridiculous like #1023 for example), and I don't see a way to fix that without patching some hacky global state/counter or asking other rules to reuse aliases when removing the extra projections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I meant by "idexes go out of sync" is that if your modified CSE rule runs on a plan that we got in the 3rd step (i.e. there is no #1 in the plan) e.g.:

let plan = LogicalPlanBuilder::from(table_scan.clone())
    .project(vec![(col("a") + col("b")).alias("#2"), col("c")])?
    .project(vec![
        col("#2").alias("c1"),
        col("#2").alias("c2"),
        (col("c") + lit(2)).alias("c3"),
        (col("c") + lit(2)).alias("c4"),
    ])?
    .build()?;

then it produces an incorrect plan:

Projection: #2 AS c1, #2 AS c2, #2 AS c3, #2 AS c4 // The issue here is that `#2` gets aliased to `#1` below, but `#2` doesn't change here.
  Projection: #2 AS #1, test.c + Int32(2) AS #2, test.c
    Projection: test.a + test.b AS #2, test.c
      TableScan: test

This is because you inject #2 into common_exprs, but you don't inject it to expr_stats (and others).

IMO modifying common_exprs is hacky if you don't do it in CommonSubexprRewriter, that's why I suggested the solution in my previous comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed index usage; now we keep the original alias inside the common_exprs.

Copy link
Contributor

@peter-toth peter-toth Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this starts to look as the suggested because we assign the unique aliases in CommonSubexprRewriter and store it in common_exprs together with the common expression.

But why do you still inject previous # aliases to common_exprs? I think you just need to find the biggest one here and pass that number to CommonSubexprRewriter and simply start assigning new # aliases in f_down() from that number + 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a solution that can produce a unique alias fast. There is no problem with having gaps if we can do it constant time (vs. no gaps with linear time to the number of common expressions).

Copy link
Contributor Author

@MohamedAbdeen21 MohamedAbdeen21 Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n is usually really small. I don't think this is a big performance hit, and so filling the gaps is a good tradeoff IMO

Copy link
Contributor

@peter-toth peter-toth Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do you want to fill the gaps? These are artifical aliases so having consecutive numbers has no use, all that matter is they are short, unique and easy to read. Also, if you don't inject anything into common_exprs then the pontless #1 AS #1 aliases won't get added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I don't think I'll be able to do that anytime soon.

If that's the only remaining issue, I can mark the PR as ready and a maintainer can push that change.

Copy link
Contributor

@peter-toth peter-toth Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do it and let me try to open a PR with the fix to your PR tomorrow or during the weekend.

let name = field.name();
if name.starts_with('#') {
let index = name.trim_start_matches('#').parse::<usize>().unwrap_or(1);
let expr = Expr::from((qualifier, field));
common_exprs.insert(name.clone(), (expr, index));
match name.trim_start_matches('#').parse::<usize>() {
Ok(index) => {
let expr = Expr::from((qualifier, field));
common_exprs.insert(name.clone(), (expr, index));
}
Err(_) => (), // probably user-assigned alias, skip if not numeric
}
}
});

Expand Down Expand Up @@ -342,7 +346,7 @@ impl CommonSubexprEliminate {
Aggregate::try_new(Arc::new(new_input), new_group_expr, new_aggr_expr)
.map(LogicalPlan::Aggregate)
} else {
let mut expr_number = common_exprs.len();
let mut expr_number = common_exprs.values().map(|t| t.1).max().unwrap_or(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a test case


let mut agg_exprs = common_exprs
.into_iter()
Expand Down