Fix/handle async lsn component receiver validation #2180

thuongle2210 · 2025-10-26T09:47:30Z

Summary

The PR resolve the asynchronous problems when event handling by using retry mechanism. The purpose of retries is to synchronize state from the next function wait_for_relevant_lsn_change, because the events are received from independent channels, which leads to asynchronous problems

Related Issues

links to related issues: #2181

Changes

Update mechanism to validate lsn ordering when reading state in ReadingStateManager
Update Error type in moonlink package

Checklist

Code builds correctly
Tests have been added or updated
Documentation updated if necessary
I have reviewed my own changes

release: bump toolchain chanel version

cursor

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on November 9

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2025-10-26T09:48:37Z

src/moonlink/src/union_read/read_state_manager.rs

        }
    }

+    #[inline]


Bug: Unwrapped None Causes Panic in Snapshot Read

Calling requested_lsn.unwrap() at line 131 will panic when requested_lsn is None. This occurs when try_read(None) is called to get the latest snapshot and LSN validation fails in can_satisfy_read_from_snapshot. The validation failure causes the function to return false without satisfying the read, leading to the call to wait_for_relevant_lsn_change with an unwrapped None value. The code should handle the None case explicitly, either by using unwrap_or with a sensible default or by restructuring the logic to avoid waiting when no specific LSN is requested.

dentiny · 2025-10-26T10:03:26Z

src/moonlink/src/union_read/read_state_manager.rs

+                snapshot_lsn, commit_lsn
+            )));
+        }
+        if commit_lsn > replication_lsn {


What's the purpose of retry?

The purpose of retries is to synchronize state from the next function wait_for_relevant_lsn_change, because the events are received from independent channels, which leads to asynchronous problems. You can reproduce the issue in my test case by setting const MAX_READ_SNAPSHOT_RETRIES: u8 = 0 to observe the problems.
In my case, if MAX_READ_SNAPSHOT_RETRIES: u8 = 0, I see 36 cases which exist the problem: replication_lsn<= commit_lsn. Causing a failure situation

Not sure if it's a good fix or hiding the problem? It reads a synchronization problem to me, retry could also fail.

Asynchronous problems are common in engineering, especially when handling events in systems like Kafka, Spark, and Flink. There are two main ways to address these issues:
First, combine all related events into sequential packages and use a single channel (similar to one partition in Kafka).
Second, add a retry mechanism with a timeout or retry limit to handle asynchronous events (such as watermarking).

curious why don't we pick (1)?

That's because I see MoonLink handles many background events, for example:
CdcEvent::PrimaryKeepAlive
CdcEvent::StreamCommit
CdcEvent::Commit
... etc.

Each Message Passing channel can be used in different Events or anywhere. To boost performance and make the code easier to scale, I see your team has broken it into many channels and run them asynchronously

There're not too many channels: one for replication LSN, another for commit LSN?
Seems like related to

Fix commit boundary race #2032

Swap send and receive order #2103

Let me check. What about table_snapshot_watch_receiver? It is also an independent receiver and completely operates asynchronously.

Honestly, retries are also a common mechanism to handle such issues. You can configure the retry count to be high or set an eventual timeout. This approach will help your code scale easily in the future.

github-actions · 2025-11-24T03:27:05Z

This PR has been inactive for 14 days and is now marked as stale. If this is still being worked on, please comment to keep it open.

thuong and others added 9 commits October 11, 2025 00:04

release: bump toolchain chanel version

3b7b597

Merge pull request #1 from thuongle2210/release/bump-toolchain-version

6cb52c0

release: bump toolchain chanel version

Merge remote-tracking branch 'refs/remotes/origin/main'

f65a7e1

Merge branch 'Mooncake-Labs:main' into main

465cddb

sync from upstream repo

d8dbace

Merge remote-tracking branch 'refs/remotes/origin/main'

15d807e

fix: handle read LSN ordering validation asynchronously

0229180

fix: remove redundant import

321496e

fix: update the row line for loc in test_error_propagation_with_source

1811b42

github-actions bot added the community label Oct 26, 2025

cursor bot reviewed Oct 26, 2025

View reviewed changes

dentiny reviewed Oct 26, 2025

View reviewed changes

github-actions bot added the stale label Nov 24, 2025

Fix/handle async lsn component receiver validation #2180

Are you sure you want to change the base?

Fix/handle async lsn component receiver validation #2180

Uh oh!

Conversation

thuongle2210 commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Changes

Checklist

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This is the final PR Bugbot will review for you during this billing cycle

Uh oh!

cursor bot Oct 26, 2025

Choose a reason for hiding this comment

Bug: Unwrapped None Causes Panic in Snapshot Read

Uh oh!

dentiny Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

thuongle2210 Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dentiny Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

thuongle2210 Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dentiny Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

thuongle2210 Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dentiny Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

thuongle2210 Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thuongle2210 Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thuongle2210 commented Oct 26, 2025 •

edited

Loading

Bug: Unwrapped `None` Causes Panic in Snapshot Read

thuongle2210 Oct 26, 2025 •

edited

Loading

thuongle2210 Oct 26, 2025 •

edited

Loading

thuongle2210 Oct 26, 2025 •

edited

Loading

thuongle2210 Oct 26, 2025 •

edited

Loading