[FLINK-37278] Optimize regular schema evolution topology's performance #3912

yuxiqian · 2025-02-08T03:44:57Z

Currently, regular SE topology uses the following process to drain existing DataChangeEvents in the pipeline:

SchemaOperator ("client") emits FlushEvent to downstream.
The "client" keeps polling the SchemaCoordinator ("server") with 1-second interval.
The "server" rejects all requests from clients until it has collected enough FlushSuccessEvent notifications from Sink.

As a result, all schema change requests will took at least 1 second to finish, after at least one polling interval.

This PR replaces the polling code with maintaining a pending schema change request queue, where SchemaCoordinator could manage all pending clients and effectively blocking them from handling upstream events. Schema evolution process could start immediately after FlushSuccessEvent got reported, needless to wait for polling requests from clients.

With this change, time consumption of testRegularTablesSourceInMultipleParallelism test case has been reduced from ~6 minutes to ~50 seconds.

yuxiqian · 2025-02-08T08:57:18Z

Would @hiliuxg like to take a look?

Shawn-Hx

It seems that SchemaChangeResponse#ResponseCode can only be SUCCESS now. Can we remove SchemaChangeResponse#ResponseCode and simplify the logic in SchemaOperator#handleSchemaChangeEvent ?

...e/src/main/java/org/apache/flink/cdc/runtime/operators/schema/regular/SchemaCoordinator.java

yuxiqian · 2025-02-18T03:37:47Z

Thanks for Shawn's kindly review, comments addressed.

Shawn-Hx

LGTM

gongzexin · 2025-02-25T04:38:25Z

It seems that SchemaChangeResponse#ResponseCode can only be SUCCESS now. Can we remove SchemaChangeResponse#ResponseCode and simplify the logic in SchemaOperator#handleSchemaChangeEvent ?

@yuxiqian @Shawn-Hx
Hi, Have you noticed that during fault tolerance, the same table will be flushed multiple times (related to the task parallelism).So I think SchemaChangeResponse#ResponseCode#DUPLICATE should not be deleted, but it should be strengthened.

yuxiqian · 2025-02-25T05:54:19Z

It seems that SchemaChangeResponse#ResponseCode can only be SUCCESS now. Can we remove SchemaChangeResponse#ResponseCode and simplify the logic in SchemaOperator#handleSchemaChangeEvent ?

@yuxiqian @Shawn-Hx Hi, Have you noticed that during fault tolerance, the same table will be flushed multiple times (related to the task parallelism).So I think SchemaChangeResponse#ResponseCode#DUPLICATE should not be deleted, but it should be strengthened.

Thanks for @gongzexin's report. IIUC, the root cause of this problem is PreTransformOperator invokes getUnionListState to store persistent schemas, all subTasks of SchemaOperators will obtain the same set of table schemas when restoring from state, and SchemaCoordinator is expected to receive $N$ duplicate requests ($N$ = parallelism). Worse still, UnionListState will block the checkpointing process when some subTasks have entered the FINISHED state (FLINK-37368).

I wonder if we can handle it in another PR, and focus on modifying the schema evolution request queueing logic here?

lvyanquan

LGTM.

...e/src/main/java/org/apache/flink/cdc/runtime/operators/schema/regular/SchemaCoordinator.java

lvyanquan · 2025-03-03T05:56:08Z

Hi, @leonardBang @ruanhang1993, could you take a look at this?

…'s performance This closes apache#3912.

nihaoya2025 · 2025-11-18T07:01:29Z

@yuxiqian I have a table that I want to delete and resynchronize with both full and incremental data. The schema recorded by the Schema Manager will not be automatically deleted when the table is deleted, skipping the new table creation statement. What should I do

 // For redundant schema change events (possibly coming from duplicate emitted
        // CreateTableEvents in snapshot stage), we just skip them.
        if (!SchemaUtils.isSchemaChangeEventRedundant(currentUpstreamSchema, originalEvent)) {
            schemaManager.applyOriginalSchemaChange(originalEvent);
            deducedSchemaChangeEvents.addAll(deduceEvolvedSchemaChanges(originalEvent));
        }

github-actions bot added the runtime label Feb 8, 2025

yuxiqian marked this pull request as ready for review February 8, 2025 08:58

leonardBang requested a review from ruanhang1993 February 10, 2025 14:16

Shawn-Hx reviewed Feb 12, 2025

View reviewed changes

...e/src/main/java/org/apache/flink/cdc/runtime/operators/schema/regular/SchemaCoordinator.java Show resolved Hide resolved

...e/src/main/java/org/apache/flink/cdc/runtime/operators/schema/regular/SchemaCoordinator.java Show resolved Hide resolved

yuxiqian added 3 commits February 18, 2025 11:36

[FLINK-37278] Optimize regular schema evolution topology's performance

540e2e0

fix ci

658fadc

remove codes that are no longer necessary

a0d9b68

yuxiqian force-pushed the FLINK-37278 branch from 2fa29b6 to a0d9b68 Compare February 18, 2025 03:36

Get rid of redundant ResponseCode

77b2cde

Shawn-Hx approved these changes Feb 19, 2025

View reviewed changes

github-actions bot added the reviewed label Feb 19, 2025

lvyanquan mentioned this pull request Feb 28, 2025

[FLINK-37386] Emit CreateTableEvent only when met the related SourceRecord. #3932

Merged

lvyanquan approved these changes Mar 3, 2025

View reviewed changes

...e/src/main/java/org/apache/flink/cdc/runtime/operators/schema/regular/SchemaCoordinator.java Show resolved Hide resolved

More descriptive comments

8d9a936

leonardBang approved these changes Mar 4, 2025

View reviewed changes

github-actions bot added the approved label Mar 4, 2025

leonardBang merged commit 602abde into apache:master Mar 5, 2025
25 of 26 checks passed

SML0127 pushed a commit to SML0127/flink-cdc-connectors that referenced this pull request Mar 12, 2025

[FLINK-37278][cdc-runtime] Optimize regular schema evolution topology…

cf76dbd

…'s performance This closes apache#3912.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-37278] Optimize regular schema evolution topology's performance #3912

[FLINK-37278] Optimize regular schema evolution topology's performance #3912

Uh oh!

yuxiqian commented Feb 8, 2025 •

edited

Loading

Uh oh!

yuxiqian commented Feb 8, 2025

Uh oh!

Shawn-Hx left a comment

Uh oh!

Uh oh!

Uh oh!

yuxiqian commented Feb 18, 2025

Uh oh!

Shawn-Hx left a comment

Uh oh!

gongzexin commented Feb 25, 2025

Uh oh!

yuxiqian commented Feb 25, 2025

Uh oh!

lvyanquan left a comment

Uh oh!

Uh oh!

lvyanquan commented Mar 3, 2025

Uh oh!

Uh oh!

nihaoya2025 commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[FLINK-37278] Optimize regular schema evolution topology's performance #3912

[FLINK-37278] Optimize regular schema evolution topology's performance #3912

Uh oh!

Conversation

yuxiqian commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuxiqian commented Feb 8, 2025

Uh oh!

Shawn-Hx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yuxiqian commented Feb 18, 2025

Uh oh!

Shawn-Hx left a comment

Choose a reason for hiding this comment

Uh oh!

gongzexin commented Feb 25, 2025

Uh oh!

yuxiqian commented Feb 25, 2025

Uh oh!

lvyanquan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lvyanquan commented Mar 3, 2025

Uh oh!

Uh oh!

nihaoya2025 commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yuxiqian commented Feb 8, 2025 •

edited

Loading