Fix to enable index scan with LIKE operator for non-const nodes#4589
Fix to enable index scan with LIKE operator for non-const nodes#4589RuchaSK1 wants to merge 27 commits intobabelfish-for-postgresql:BABEL_6_X_DEVfrom
Conversation
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
| */ | ||
|
|
||
| check_node = rightop; | ||
| while (IsA(check_node, RelabelType)) |
There was a problem hiding this comment.
Why do we need this? Is this needed for AI collations?
There was a problem hiding this comment.
Yes, needed for AI collations. When rightop is RelabelType(remove_accents_internal(...)), without the peek we can't detect remove_accents_internal inside. evaluate_expr would reduce it to a plain Const, losing AI semantics.
| { | ||
| rightop = eval_const_expressions(NULL, rightop); | ||
|
|
||
| if (!IsA(rightop, Const) && !IsA(rightop, Param) && |
There was a problem hiding this comment.
Why do we need these conditions?
There was a problem hiding this comment.
evaluate_expr actually executes the expression via PG executor. Without these checks, it crashes or produces wrong results for expressions that can't be evaluated at plan time.
There was a problem hiding this comment.
How do we make sure we are not leaving any other condition out? I mean handling this way can lead to cases which we are not handling explicitly. What happens then?
| RelabelType *relabel = (RelabelType *) leftop; | ||
| leftop = copyObject((Node*) relabel->arg); | ||
| prefix->consttype = rtypeId = ltypeId = exprType(leftop); | ||
| ltypeId = exprType(leftop); |
There was a problem hiding this comment.
What is the necessity to make this change? Is there any test case which fails?
There was a problem hiding this comment.
Yes. Without this, BABEL_SCHEMATA-vu-verify crashes:
use db_schemata_10;
SELECT * FROM information_schema.schemata ORDER BY SCHEMA_NAME;
TRAP: failed Assert("ltypeId == rtypeId")
rtypeId is set once at the top from original rightop. After evaluate_expr changes rightop to a Const, rtypeId becomes stale. Previously this code was never reached because rightop was never a Const for such queries. Now it is, and types mismatch.
|
|
||
| if (OidIsValid(like_entry.like_oid) && list_length(op->args) >= 2) | ||
| { | ||
| Node *left_arg = linitial(op->args); |
| { | ||
| Node *left_arg = linitial(op->args); | ||
| Node *right_arg = lsecond(op->args); | ||
| if (!IsA(left_arg, CollateExpr) && !IsA(right_arg, CollateExpr) && !contain_var_clause(right_arg)) |
There was a problem hiding this comment.
Please explain this if condition
There was a problem hiding this comment.
These conditions check if it's safe to consider the column's collation for optimization:
!IsA(left_arg, CollateExpr): If left has explicit COLLATE (eg Col COLLATE Latin1_General_CI_AI LIKE 'abc'), op->inputcollid already reflects it. Without this check, we would overwrite it with column's default collation, ignoring the explicit COLLATE.!IsA(right_arg, CollateExpr): If right has explicit COLLATE, evaluate_expr crashes on CollateExpr!contain_var_clause(right_arg): If pattern has column references (eg Col LIKE OtherCol), it cannot be reduced to a Const, so optimization won't apply.
There was a problem hiding this comment.
It seems like we are putting some hack in our code, maybe in some cases that is unavoidable but in most of the conditions I see that if not done then this crashes. I am not comfortable with such explaination. Is there anyway to have a generic solution?
There was a problem hiding this comment.
Yes, agreed. I've implemented a better approach now which removes the need for all the manual guards that solely arose from use of evaluate_expr. Replaced it with PG's built-in eval_const_expressions + estimate_expression_value - these handle safety checks internally. Any unhandled expression simply skips optimization with correct results.
The conditions in this block are unrelated to index optimization. They are for collation recovery needed because views lose column collation during storage due to ::text casts. They ensure we only recover when collation wasn't explicitly set and don't unnecessarily override collation. I think the test coverage for views with LIKE pattern was not much earlier and hence we couldn't catch this issue.
| Pattern_Prefix_Status pstatus; | ||
| int collidx_of_cs_as; | ||
| CollateExpr *prefix_collate; | ||
| Node *check_node; |
There was a problem hiding this comment.
On a very high level, I would rather prefer to call: simplify_function directly because the current implementation relies heavily on handling everything in our end. We can create a wrapper function over simplify_function (as this is static) and call that directly. Check: pattern_fixed_prefix_wrapper
There was a problem hiding this comment.
Tested simplify_function approach. It only handles FuncExpr nodes. Rightop for basic queries like LIKE N'exact match' is: RelabelType(CoerceToDomain(FuncExpr(...)))
simplify_function cannot process RelabelType or CoerceToDomain. All test cases fall back to Seq Scan then.
We use eval_const_expressions + evaluate_expr because:
- eval_const_expressions handles immutable functions, OpExpr (e.g., string concatenation N'exact' + N' match'), nested type casts
- evaluate_expr handles stable functions (e.g., CONVERT, TRY_CONVERT) that eval_const_expressions leaves
| @@ -0,0 +1,541 @@ | |||
| CREATE TABLE BABEL_5966_TestLikeIndex ( | |||
There was a problem hiding this comment.
This will pick db level collation. For AI collations to pick index, we need to create index from PG endpoint using the following:
create index <fully-qualified-table-name> (sys.remove_accents_internal(col_name));
check : test_like_index-vu-prepare
I would suggest not to have this table as in github we have 2 wfs. One for server default collation (CI_AS) and db level collation (CI_AI)
If you create the index from the TSQL endpoint only, CI_AS will pick index and NOT CI_AI, as evident from the expected output.
There was a problem hiding this comment.
Understood. Will remove BABEL_5966_TestLikeIndex and use only the explicit collation tables (CI_AS, CS_AS, CI_AI, CS_AI) which behave consistently across both workflows.
| ); | ||
| GO | ||
|
|
||
| CREATE INDEX BABEL_5966_idx_ci_as ON BABEL_5966_TestLikeCollation_CI_AS(Col); |
There was a problem hiding this comment.
Look at my comment on index creation for AI collations, these will never pick index scan
There was a problem hiding this comment.
Agreed. Will remove AI indexes (idx_ci_ai, idx_cs_ai). Plain TSQL indexes can't match remove_accents_internal(col) used in AI queries.
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
Signed-off-by: Rucha Kulkarni <[email protected]>
| * col LIKE PATTERN -> col = PATTERN | ||
| * Case 1: if the pattern is a constant string | ||
| * col LIKE PATTERN -> col = PATTERN AND col LIKE PATTERN | ||
| * col NOT LIKE PATTERN -> col <> PATTERN OR col NOT LIKE PATTERN |
There was a problem hiding this comment.
With this OR clause, planner may loose an ability to choose index scan. Hence in most of the cases where we have correctness issue will be fixed with this but perf might get degraded. But I think tht should be okay
| root.type = T_PlannerInfo; | ||
| root.glob = &glob; | ||
|
|
||
| return estimate_expression_value(&root, expr); |
There was a problem hiding this comment.
comment over fold_like_pattern_to_const clearly says following,
* This function attempts to estimate the value of an expression for
* planning purposes. It is in essence a more aggressive version of
* eval_const_expressions(): we will perform constant reductions that are
* not necessarily 100% safe, but are reasonable for estimation purposes.
In essence -- runtime value might be different from planned value. So I think we should avoid using it blindly. Instead we should only try to const fold CONVERT/TRY_CONVERT selectively.
btw, what reason CONVERT/TRY_CONVERT function are marked STABLE in Babelfish?
Description
Previously, LIKE operator index scan optimization only worked when the pattern was already a Const node. Expressions that could be simplified to a constant such as stable functions or type castswere left unreduced, forcing a sequential scan.
Additionally, view definitions after dump/restore lost the original column collation due to ::text casts, causing incorrect LIKE/ILIKE operator selection and skipping optimization entirely.
Changes
Constant folding of RHS expressions : Apply eval_const_expressions() and evaluate_expr() to fold reducible pattern expressions down to a Const before prefix extraction. like_escape and remove_accents_internal are excluded to preserve ESCAPE clause and AI collation correctness.
Collation recovery after dump/restore : Unwrap RelabelType coercions on the left-hand side to recover the original column collation when ::text casts have stripped it.
NOT LIKE fix : Changed exact-match NOT LIKE optimization from incorrect AND to correct OR semantics: col <> pattern OR col NOT LIKE pattern.
Type reconciliation : Ensure the generated equality/range expressions always use the column's actual type rather than the LIKE-coerced TEXT type, so the planner can match the correct index operator class.
Issues Resolved
Task: BABEL-5966
Authored-by: Rucha Kulkarni [email protected]
Signed-off-by: Rucha Kulkarni [email protected]
Test Scenarios Covered
Use case based - Yes
Boundary conditions - Yes
Arbitrary inputs -
Negative test cases -
Minor version upgrade tests -
Major version upgrade tests -
Performance tests -
Tooling impact -
Client tests -
Check List
By submitting this pull request, I confirm that my contribution is under the terms of the Apache 2.0 and PostgreSQL licenses, and grant any person obtaining a copy of the contribution permission to relicense all or a portion of my contribution to the PostgreSQL License solely to contribute all or a portion of my contribution to the PostgreSQL open source project.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.