Skip to content

Conversation

@yzc-yzc
Copy link
Contributor

@yzc-yzc yzc-yzc commented Aug 15, 2025

valkey/src/networking.c

Lines 2279 to 2293 in aefed3d

while (nwritten >= o->used) {
next = listNextNode(curr);
if (!next) break; /* End of list */
nwritten -= o->used;
o->refcount--;
curr = next;
o = listNodeValue(curr);
o->refcount++;
}
serverAssert(nwritten <= o->used);
c->repl_data->ref_repl_buf_node = curr;
c->repl_data->ref_block_pos = nwritten;

From above code, we can see that c->repl_data->ref_block_pos could be equal to o->used.
When o->used == o->size, we may call SSL_write() with num=0 which does not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)

What's worse is that it's still the case after the reconnection. See

valkey/src/replication.c

Lines 756 to 769 in aefed3d

while (node != NULL) {
replBufBlock *o = listNodeValue(node);
if (o->repl_offset + (long long)o->used >= offset) break;
node = listNextNode(node);
}
serverAssert(node != NULL);
/* Install a writer handler first.*/
prepareClientToWrite(c);
/* Setting output buffer of the replica. */
replBufBlock *o = listNodeValue(node);
o->refcount++;
c->repl_data->ref_repl_buf_node = node;
c->repl_data->ref_block_pos = offset - o->repl_offset;
. So in this case the replica will keep reconnecting again and again until it doesn't meet the requirements for partial synchronization.

Resolves #2119

@codecov
Copy link

codecov bot commented Aug 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.00%. Comparing base (aefed3d) to head (b24b55f).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2490      +/-   ##
============================================
- Coverage     72.05%   72.00%   -0.05%     
============================================
  Files           126      126              
  Lines         70446    70448       +2     
============================================
- Hits          50760    50727      -33     
- Misses        19686    19721      +35     
Files with missing lines Coverage Δ
src/networking.c 87.92% <100.00%> (-0.10%) ⬇️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: yzc-yzc <[email protected]>
@zuiderkwast
Copy link
Contributor

Great to see progress on this!

So there can exist reply blocks (list nodes) with zero bytes? How/why do we create them?

@yzc-yzc
Copy link
Contributor Author

yzc-yzc commented Aug 15, 2025

So there can exist reply blocks (list nodes) with zero bytes? How/why do we create them?

e.g. The last node of the replication buffer list has used==size at some point, and we send all the data successfully. Then in function postWriteToReplica the replica client buffer will pointer to the last node, and the block_pos will be exactly the size. Then the next time we want to send data, it will be a multi block case. And we triggered SSL_write(_, _, 0).

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense!

@zuiderkwast
Copy link
Contributor

Looking at git blame, it seems that this bug was introduced in

which was released in 8.1, so it means we need to backport this fix to 8.1 only. 8.0 and before are not affected AFAICT.

@zuiderkwast zuiderkwast merged commit e7bb235 into valkey-io:unstable Aug 15, 2025
51 of 52 checks passed
@github-project-automation github-project-automation bot moved this to To be backported in Valkey 8.1 Aug 15, 2025
allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/networking.c#L2279-L2293
From above code, we can see that `c->repl_data->ref_block_pos` could be
equal to `o->used`.
When `o->used == o->size`, we may call SSL_write() with num=0 which does
not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)

What's worse is that it's still the case after the reconnection. See
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/replication.c#L756-L769.
So in this case the replica will keep reconnecting again and again until
it doesn't meet the requirements for partial synchronization.

Resolves valkey-io#2119

---------

Signed-off-by: yzc-yzc <[email protected]>
asagege pushed a commit to asagege/valkey that referenced this pull request Aug 19, 2025
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/networking.c#L2279-L2293
From above code, we can see that `c->repl_data->ref_block_pos` could be
equal to `o->used`.
When `o->used == o->size`, we may call SSL_write() with num=0 which does
not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)

What's worse is that it's still the case after the reconnection. See
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/replication.c#L756-L769.
So in this case the replica will keep reconnecting again and again until
it doesn't meet the requirements for partial synchronization.

Resolves valkey-io#2119

---------

Signed-off-by: yzc-yzc <[email protected]>
rjd15372 pushed a commit to rjd15372/valkey that referenced this pull request Sep 19, 2025
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/networking.c#L2279-L2293
From above code, we can see that `c->repl_data->ref_block_pos` could be
equal to `o->used`.
When `o->used == o->size`, we may call SSL_write() with num=0 which does
not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)

What's worse is that it's still the case after the reconnection. See
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/replication.c#L756-L769.
So in this case the replica will keep reconnecting again and again until
it doesn't meet the requirements for partial synchronization.

Resolves valkey-io#2119

---------

Signed-off-by: yzc-yzc <[email protected]>
rjd15372 pushed a commit that referenced this pull request Sep 23, 2025
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/networking.c#L2279-L2293
From above code, we can see that `c->repl_data->ref_block_pos` could be
equal to `o->used`.
When `o->used == o->size`, we may call SSL_write() with num=0 which does
not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)

What's worse is that it's still the case after the reconnection. See
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/replication.c#L756-L769.
So in this case the replica will keep reconnecting again and again until
it doesn't meet the requirements for partial synchronization.

Resolves #2119

---------

Signed-off-by: yzc-yzc <[email protected]>
@ranshid ranshid moved this from To be backported to In Progress in Valkey 8.1 Sep 30, 2025
ranshid pushed a commit to ranshid/valkey that referenced this pull request Sep 30, 2025
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/networking.c#L2279-L2293
From above code, we can see that `c->repl_data->ref_block_pos` could be
equal to `o->used`.
When `o->used == o->size`, we may call SSL_write() with num=0 which does
not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)

What's worse is that it's still the case after the reconnection. See
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/replication.c#L756-L769.
So in this case the replica will keep reconnecting again and again until
it doesn't meet the requirements for partial synchronization.

Resolves valkey-io#2119

---------

Signed-off-by: yzc-yzc <[email protected]>
ranshid pushed a commit to ranshid/valkey that referenced this pull request Sep 30, 2025
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/networking.c#L2279-L2293
From above code, we can see that `c->repl_data->ref_block_pos` could be
equal to `o->used`.
When `o->used == o->size`, we may call SSL_write() with num=0 which does
not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)

What's worse is that it's still the case after the reconnection. See
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/replication.c#L756-L769.
So in this case the replica will keep reconnecting again and again until
it doesn't meet the requirements for partial synchronization.

Resolves valkey-io#2119

---------

Signed-off-by: yzc-yzc <[email protected]>
@ranshid ranshid moved this from In Progress to 8.1.4 in Valkey 8.1 Sep 30, 2025
@ranshid ranshid moved this from 8.1.4 to To be backported in Valkey 8.1 Sep 30, 2025
zuiderkwast pushed a commit that referenced this pull request Oct 1, 2025
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/networking.c#L2279-L2293
From above code, we can see that `c->repl_data->ref_block_pos` could be
equal to `o->used`.
When `o->used == o->size`, we may call SSL_write() with num=0 which does
not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)

What's worse is that it's still the case after the reconnection. See
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/replication.c#L756-L769.
So in this case the replica will keep reconnecting again and again until
it doesn't meet the requirements for partial synchronization.

Resolves #2119

---------

Signed-off-by: yzc-yzc <[email protected]>
@zuiderkwast zuiderkwast moved this from To be backported to 8.1.4 in Valkey 8.1 Oct 1, 2025
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/networking.c#L2279-L2293
From above code, we can see that `c->repl_data->ref_block_pos` could be
equal to `o->used`.
When `o->used == o->size`, we may call SSL_write() with num=0 which does
not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)

What's worse is that it's still the case after the reconnection. See
https://github.com/valkey-io/valkey/blob/aefed3d363d1c7d7cb391d3f605484c78c9a88f2/src/replication.c#L756-L769.
So in this case the replica will keep reconnecting again and again until
it doesn't meet the requirements for partial synchronization.

Resolves valkey-io#2119

---------

Signed-off-by: yzc-yzc <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 8.1.4

Development

Successfully merging this pull request may close these issues.

[BUG] Unsuccessful (extremely high number) of attempts to perform Partial resynchronization causing CPU/Memory spikes.

2 participants