Implement Parallel-aware Hash Left Anti Semi (Not-In) Join by avamingli · Pull Request #3 · avamingli/cloudberrydb

avamingli · 2023-08-14T06:24:23Z

Implement Parallel-aware Hash Left Anti Semi (Not-In) Join

For parallel-aware hash join, we need to sync between parallel
workers to tell the right results when there are NULL values.

If we are LASJ and found NULL value by ourself or sibling processes
had found NULL values, quit and tell siblings to quit if possible.
It's safe to fetch and set phs_lasj_has_null without lock here and at
other places. As it's a boolean and we don't need to have the most
recent value from CPU or Mem cache. And we should avoid more locks in
HashJion Impl.
If we miss it here and some others set it at the same time, just
bypass and we may get it at the next Hash batch.
If we missed it across all batches, we will know it when
PHJ_BUILD_HASHING_INNER ends with the help of build_barrier.
If we never participated in building hash table, check it when hash
table creation job is finished.

gpadmin=# explain(costs off) select c1 from ao1 where c1 not in(select c2 from ao2);
                                QUERY PLAN
---------------------------------------------------------------------------
 Gather Motion 12:1  (slice1; segments: 12)
   ->  Parallel Hash Left Anti Semi (Not-In) Join
         Hash Cond: (ao1.c1 = ao2.c2)
         ->  Parallel Seq Scan on ao1
         ->  Parallel Hash
               ->  Parallel Broadcast Motion 12:12  (slice2; segments: 12)
                     ->  Parallel Seq Scan on ao2
 Optimizer: Postgres query optimizer
(8 rows)
gpadmin=# set enable_parallel=off;
SET
Time: 1.020 ms
gpadmin=# explain(costs off) select c1 from ao1 where c1 not in(select c2 from ao2);
                          QUERY PLAN
---------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   ->  Hash Left Anti Semi (Not-In) Join
         Hash Cond: (ao1.c1 = ao2.c2)
         ->  Seq Scan on ao1
         ->  Hash
               ->  Broadcast Motion 3:3  (slice2; segments: 3)
                     ->  Seq Scan on ao2
 Optimizer: Postgres query optimizer
(8 rows)

performance:

A special case NOT IN subslect has null value:

Table ao2 has 1 billion rows in seg file 0-3 and with a NULL value in seg file 4, launch a 4-workers plan.

gpadmin=# select count(*) from ao2;
   count
------------
 1000000003
(1 row)

gpadmin=# select c1 from ao1 where c1 not in(select c2 from ao2);
 c1
----
(0 rows)

Time: 309224.911 ms (05:09.225)
set enable_parallel = on;
gpadmin=# select c1 from ao1 where c1 not in(select c2 from ao2);
 c1
----
(0 rows)

Time: 192.844 ms

Time: non-parallel plan 309224.911 ms to parallel-aware plan 192.844 ms, 1600x faster.

NOT IN subselect has no null values.

select count(*) from t2 where c1 not in (select c2 from t1);

parallel workers	avg duration(s)	1st	2nd	3rd
0	41.504	41.792	41.446	41.275
2	27.757	28.637	27.099	27.534
4	24.990	25.130	24.482	25.360
6	24.056	24.489	23.721	23.958

DDL & DML

create table t1(c1 int, c2 int);
create table t2(c1 int, c2 int);
insert into t1 select i, i+1 from generate_series(1,  40000000) i;
insert into t2 select i, i+1 from generate_series(1, 40000000) i

closes: #ISSUE_Number

Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
Sign the Contributor License Agreement as prompted for your first-time contribution.
List your communication in the GitHub Issues or Discussions (if has or needed).
Document changes.
Add tests for the change
Pass make installcheck
Pass make -C src/test installcheck-cbdb-parallel
Feel free to @cloudberrydb/dev team for review and approval when your PR is ready🥳

For parallel-aware hash join, we need to sync between parallel workers to tell the right results when there are NULL values. If we are LASJ and found NULL value by ourself or sibling processes had found NULL values, quit and tell siblings to quit if possible. It's safe to fetch and set phs_lasj_has_null without lock here and at other places. As it's a boolean and we don't need to have the most recent value from CPU or Mem cache. And we should avoid more locks in HashJion Impl. If we miss it here and some others set it at the same time, just bypass and we may get it at the next Hash batch. If we missed it across all batches, we will know it when PHJ_BUILD_HASHING_INNER ends with the help of build_barrier. If we never participated in building hash table, check it when hash table creation job is finished. explain(costs off) select c1 from ao1 where c1 not in(select c2 from ao2); QUERY PLAN ---------------------------------------------------------------------- Gather Motion 12:1 (slice1; segments: 12) -> Parallel Hash Left Anti Semi (Not-In) Join Hash Cond: (ao1.c1 = ao2.c2) -> Parallel Seq Scan on ao1 -> Parallel Hash -> Parallel Broadcast Motion 12:12 (slice2; segments:12) -> Parallel Seq Scan on ao2 Optimizer: Postgres query optimizer (8 rows) Authored-by: Zhang Mingli avamingli@gmail.com

For test case: create table t0(c0 inet) distributed randomly; create table t2(c0 inet) distributed randomly; create table t3(c0 inet) distributed randomly; SELECT ALL t2.c0, t3.c0, t0.c0 FROM t0, ONLY t3 FULL OUTER JOIN t2 ON ((t2.c0)=(t3.c0)) WHERE (((('0.5496844753539182')||(t3.c0)))LIKE(CAST((0.13292931)::MONEY AS VARCHAR(971)))) UNION ALL SELECT t2.c0, t3.c0, t0.c0 FROM t0, ONLY t3 FULL OUTER JOIN t2 ON ((t2.c0)=(t3.c0)) WHERE NOT ((((('0.5496844753539182')||(t3.c0)))LIKE((CAST(0.13292931 AS MONEY))::VARCHAR(971)))) UNION ALL SELECT ALL t2.c0, t3.c0, t0.c0 FROM t0*, ONLY t3 FULL OUTER JOIN t2 ON ((t2.c0)=(t3.c0)) WHERE ((((('0.5496844753539182')||(t3.c0)))LIKE((CAST(0.13292931 AS MONEY))::VARCHAR(971)))) ISNULL; will cause crash because of assert failure in 'create_plan_recurse'. '#3 0x00007fe94eccf476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007fe94ecb57f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007fe94fcdd548 in ExceptionalCondition (conditionName=0x7fe95043dcd0 "best_path->parallel_workers == best_path->locus.parallel_workers", errorType=0x7fe95043db06 "FailedAssertion", fileName=0x7fe95043dbdb "createplan.c", lineNumber=623) at assert.c:48 #6 0x00007fe94f94918f in create_plan_recurse (root=0x55d7cbe96f78, best_path=0x55d7cbec0380, flags=1) at createplan.c:623 #7 0x00007fe94f94a1f8 in create_append_plan (root=0x55d7cbe96f78, best_path=0x55d7cbec0700, flags=1) at createplan.c:1380 apache#8 0x00007fe94f948d37 in create_plan_recurse (root=0x55d7cbe96f78, best_path=0x55d7cbec0700, flags=1) at createplan.c:481 apache#9 0x00007fe94f94e2d1 in create_motion_plan (root=0x55d7cbe96f78, path=0x55d7cbec0e50) at createplan.c:3316 #10 0x00007fe94f9490dc in create_plan_recurse (root=0x55d7cbe96f78, best_path=0x55d7cbec0e50, flags=1) at createplan.c:608 apache#11 0x00007fe94f948ba3 in create_plan (root=0x55d7cbe96f78, best_path=0x55d7cbec0e50, curSlice=0x55d7cbe96f20) at createplan.c:392' The parallel_workers should be set to zero because parallel full join is not supported yet.

We've heard a couple of reports of people having trouble with multi-gigabyte-sized query-texts files. It occurred to me that on 32-bit platforms, there could be an issue with integer overflow of calculations associated with the total query text size. Address that with several changes: 1. Limit pg_stat_statements.max to INT_MAX / 2 not INT_MAX. The hashtable code will bound it to that anyway unless "long" is 64 bits. We still need overflow guards on its use, but this helps. 2. Add a check to prevent extending the query-texts file to more than MaxAllocHugeSize. If it got that big, qtext_load_file would certainly fail, so there's not much point in allowing it. Without this, we'd need to consider whether extent, query_offset, and related variables shouldn't be off_t not size_t. 3. Adjust the comparisons in need_gc_qtexts() to be done in 64-bit arithmetic on all platforms. It appears possible that under duress those multiplications could overflow 32 bits, yielding a false conclusion that we need to garbage-collect the texts file, which could lead to repeatedly garbage-collecting after every hash table insertion. Per report from Bruno da Silva. I'm not convinced that these issues fully explain his problem; there may be some other bug that's contributing to the query-texts file becoming so large in the first place. But it did get that big, so #2 is a reasonable defense, and #3 could explain the reported performance difficulties. (See also commit 8bbe4cb, which addressed some related bugs. The second Discussion: link is the thread that led up to that.) This issue is old, and is primarily a problem for old platforms, so back-patch. Discussion: https://postgr.es/m/CAB+Nuk93fL1Q9eLOCotvLP07g7RAv4vbdrkm0cVQohDVMpAb9A@mail.gmail.com Discussion: https://postgr.es/m/5601D354.5000703@BlueTreble.com

## Problem An error occurs in python lib when a plpython function is executed. After our analysis, in the user's cluster, a plpython UDF was running with the unstable network, and got a timeout error: `failed to acquire resources on one or more segments`. Then a plpython UDF was run in the same session, and the UDF failed with GC error. Here is the core dump: ``` 2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log: #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5 #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9 #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14 #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11 #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13 #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13 #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5 #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11 apache#8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9 apache#9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10 #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9 apache#11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13 apache#12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10 apache#13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5 apache#14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4 apache#15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4 apache#16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18 apache#17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10 ``` ## Reproduce We can use a simple procedure to reproduce the above problem: - set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari` - prepare function: ``` CREATE EXTENSION plpythonu; CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS $$ plpy.execute("select pg_backend_pid()") for i in range(0, 5): yield (i) $$ LANGUAGE plpythonu; ``` - exit from the current psql session. - stop the postmaster of segment: `gdb -p "the pid of segment postmaster"` - enter a psql session. - call `SELECT test_func();` and get error ``` gpadmin=# select test_func(); ERROR: function "test_func" error fetching next item from iterator (plpy_elog.c:121) DETAIL: Exception: failed to acquire resources on one or more segments CONTEXT: Traceback (most recent call last): PL/Python function "test_func" ``` - quit gdb and make postmaster runnable. - call `SELECT test_func();` again and get panic ``` gpadmin=# SELECT test_func(); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> ``` ## Analysis - There is an SPI call in test_func(): `plpy.execute()`. - Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin(); - Meanwhile, if the segment cannot receive the instruction from the coordinator, the subtransaction beginning procedure return fails. - BUT! The Python processor does not know whether an error happened and does not clean its environment. - Then the next plpython UDF in the same session will fail due to the wrong Python environment. ## Solution - Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin() - set the python error indicator by PLy_spi_exception_set() Co-authored-by: Chen Mulong <chenmulong@gmail.com>

avamingli force-pushed the implement_parallel_aware_lasj_hashjoin branch 2 times, most recently from 4991120 to e040b6e Compare August 15, 2023 04:22

avamingli force-pushed the implement_parallel_aware_lasj_hashjoin branch from e040b6e to 0078600 Compare August 31, 2023 10:01

avamingli force-pushed the implement_parallel_aware_lasj_hashjoin branch from 0078600 to ec90764 Compare October 8, 2023 08:21

avamingli closed this Nov 6, 2023

avamingli deleted the implement_parallel_aware_lasj_hashjoin branch December 16, 2024 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Parallel-aware Hash Left Anti Semi (Not-In) Join#3

Implement Parallel-aware Hash Left Anti Semi (Not-In) Join#3
avamingli wants to merge 1 commit into
mainfrom
implement_parallel_aware_lasj_hashjoin

avamingli commented Aug 14, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avamingli commented Aug 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

performance:

A special case NOT IN subslect has null value:

NOT IN subselect has no null values.

Change logs

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Contributor's Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

avamingli commented Aug 14, 2023 •

edited

Loading