Skip to content

Conversation

@enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Mar 6, 2025

There will be two issues in this test:

test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}

The first one is the test output, we can see that after executing FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:

flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576

The second one is there is a crash when loading the functions during the async
flush:

=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------

The reason is that, in the old implementation (introduced in 7.0), FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not trigger the gc,
it causes us to not be able to effectively reclaim memory after the FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like SCRIPT FLUSH.

@enjoy-binbin enjoy-binbin requested a review from rjd15372 March 6, 2025 11:58
…load crash

There will be two issues in this test:
```
test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}
```

The first one is the test output, we can see that after executing FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576`
```

The second one is there is a crash when loading the functions during the async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------
```

The reason is that, in the old implementation (introduced in 7.0), FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not trigger the gc,
it causes us to not be able to effectively reclaim memory after the FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like SCRIPT FLUSH.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin force-pushed the function_flush_async branch from 3454da4 to ae694e0 Compare March 6, 2025 11:59
@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Mar 6, 2025

This problem has existed since 7.0, which means it affects our 7.2 and 8.0. So should we backport it? This code cannot be backported due to the refactoring related to the script module. If we want to backport it, i can re-write the code based on the old branchs.

@codecov
Copy link

codecov bot commented Mar 6, 2025

Codecov Report

❌ Patch coverage is 94.11765% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.55%. Comparing base (95154fe) to head (fafffa8).
⚠️ Report is 1 commits behind head on unstable.

Files with missing lines Patch % Lines
src/scripting_engine.c 62.50% 3 Missing ⚠️
src/functions.c 96.87% 1 Missing ⚠️
src/replication.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1826      +/-   ##
============================================
- Coverage     72.56%   72.55%   -0.02%     
============================================
  Files           128      128              
  Lines         71278    71301      +23     
============================================
+ Hits          51725    51733       +8     
- Misses        19553    19568      +15     
Files with missing lines Coverage Δ
src/db.c 93.24% <100.00%> (ø)
src/eval.c 89.23% <100.00%> (ø)
src/lazyfree.c 92.56% <100.00%> (+6.17%) ⬆️
src/lua/engine_lua.c 88.12% <100.00%> (-0.22%) ⬇️
src/lua/function_lua.c 98.24% <100.00%> (-0.46%) ⬇️
src/server.h 100.00% <ø> (ø)
src/functions.c 94.52% <96.87%> (+0.20%) ⬆️
src/replication.c 85.96% <0.00%> (-0.08%) ⬇️
src/scripting_engine.c 73.43% <62.50%> (-1.36%) ⬇️

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rjd15372
Copy link
Member

rjd15372 commented Mar 6, 2025

@enjoy-binbin thanks for fixing these bugs! I'll do a review later today.

Copy link
Member

@rjd15372 rjd15372 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the problem of the struct used to represent the compiled function that I mentioned in the inline comment, everything else looks good.

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Mar 10, 2025
@madolson madolson moved this to In Progress in Valkey 8.1 Mar 17, 2025
@ranshid ranshid moved this to To be backported in Valkey 8.0 Mar 18, 2025
@ranshid ranshid moved this from In Progress to To be backported in Valkey 8.1 Mar 18, 2025
@ranshid ranshid moved this from To be backported to In Progress in Valkey 8.1 Mar 18, 2025
@ranshid ranshid moved this to To be backported in Valkey 7.2 Mar 18, 2025
@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Mar 19, 2025
@madolson madolson self-assigned this Apr 7, 2025
@zuiderkwast zuiderkwast moved this from To be backported to In Progress in Valkey 7.2 Apr 15, 2025
@zuiderkwast zuiderkwast moved this from To be backported to In Progress in Valkey 8.0 Apr 15, 2025
@zuiderkwast
Copy link
Contributor

@madolson You assigned yourself to this PR. Will you review it so we can include it in the patch releases?

@madolson madolson removed their assignment Jul 21, 2025
@madolson madolson moved this to In Progress in Valkey 9.0 Jul 21, 2025
@rjd15372 rjd15372 added major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Oct 17, 2025
@rjd15372 rjd15372 dismissed their stale review October 17, 2025 09:20

I've made the changes that I requested in my review.

@rjd15372 rjd15372 merged commit b4c93cc into valkey-io:unstable Oct 17, 2025
58 of 60 checks passed
@github-project-automation github-project-automation bot moved this from Todo to To be backported in Valkey 7.2 Oct 17, 2025
@github-project-automation github-project-automation bot moved this from Todo to To be backported in Valkey 8.0 Oct 17, 2025
@github-project-automation github-project-automation bot moved this from Todo to To be backported in Valkey 8.1 Oct 17, 2025
@github-project-automation github-project-automation bot moved this from In Progress to To be backported in Valkey 9.0 Oct 17, 2025
@zuiderkwast
Copy link
Contributor

Note: For backporting to 8.0 and 7.2, we need separate PRs. Cherry-picking will not work for this one.

cherukum-Amazon pushed a commit to cherukum-Amazon/valkey that referenced this pull request Oct 17, 2025
…load crash (valkey-io#1826)

There will be two issues in this test:
```
test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}
```

The first one is the test output, we can see that after executing
FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576
```

The second one is there is a crash when loading the functions during the
async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------
```

The reason is that, in the old implementation (introduced in 7.0),
FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not
trigger the gc,
it causes us to not be able to effectively reclaim memory after the
FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION
FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash
because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like
SCRIPT FLUSH.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ricardo Dias <[email protected]>
Co-authored-by: Ricardo Dias <[email protected]>
(cherry picked from commit b4c93cc)
Signed-off-by: cherukum-amazon <[email protected]>
@enjoy-binbin enjoy-binbin deleted the function_flush_async branch October 18, 2025 08:16
cherukum-Amazon pushed a commit to cherukum-Amazon/valkey that referenced this pull request Oct 19, 2025
…load crash (valkey-io#1826)

There will be two issues in this test:
```
test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}
```

The first one is the test output, we can see that after executing
FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576
```

The second one is there is a crash when loading the functions during the
async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------
```

The reason is that, in the old implementation (introduced in 7.0),
FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not
trigger the gc,
it causes us to not be able to effectively reclaim memory after the
FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION
FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash
because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like
SCRIPT FLUSH.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ricardo Dias <[email protected]>
Co-authored-by: Ricardo Dias <[email protected]>
(cherry picked from commit b4c93cc)
Signed-off-by: cherukum-amazon <[email protected]>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Oct 19, 2025
zuiderkwast pushed a commit that referenced this pull request Oct 20, 2025
This was introduced in #1826. This create an `Uninitialised value was
created by a heap allocation` in the CI.

Signed-off-by: Binbin <[email protected]>
@zuiderkwast
Copy link
Contributor

I think we should include this one in the release notes but the title is very hard to understand for users.

Can we come up with a better title to use in release notes? How about Fix Lua VM crash after FUNCTION FLUSH ASYNC + FUNCTION LOAD ?

cherukum-Amazon pushed a commit to cherukum-Amazon/valkey that referenced this pull request Oct 21, 2025
…load crash (valkey-io#1826)

There will be two issues in this test:
```
test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}
```

The first one is the test output, we can see that after executing
FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576
```

The second one is there is a crash when loading the functions during the
async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------
```

The reason is that, in the old implementation (introduced in 7.0),
FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not
trigger the gc,
it causes us to not be able to effectively reclaim memory after the
FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION
FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash
because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like
SCRIPT FLUSH.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ricardo Dias <[email protected]>
Co-authored-by: Ricardo Dias <[email protected]>
(cherry picked from commit b4c93cc)
Signed-off-by: cherukum-amazon <[email protected]>
cherukum-Amazon pushed a commit to cherukum-Amazon/valkey that referenced this pull request Oct 21, 2025
This was introduced in valkey-io#1826. This create an `Uninitialised value was
created by a heap allocation` in the CI.

Signed-off-by: Binbin <[email protected]>
(cherry picked from commit 5d3cb3d)
Signed-off-by: cherukum-amazon <[email protected]>
diego-ciciani01 pushed a commit to diego-ciciani01/valkey that referenced this pull request Oct 21, 2025
…load crash (valkey-io#1826)

There will be two issues in this test:
```
test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}
```

The first one is the test output, we can see that after executing
FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576
```

The second one is there is a crash when loading the functions during the
async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------
```

The reason is that, in the old implementation (introduced in 7.0),
FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not
trigger the gc,
it causes us to not be able to effectively reclaim memory after the
FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION
FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash
because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like
SCRIPT FLUSH.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ricardo Dias <[email protected]>
Co-authored-by: Ricardo Dias <[email protected]>
diego-ciciani01 pushed a commit to diego-ciciani01/valkey that referenced this pull request Oct 21, 2025
This was introduced in valkey-io#1826. This create an `Uninitialised value was
created by a heap allocation` in the CI.

Signed-off-by: Binbin <[email protected]>
zuiderkwast pushed a commit to cherukum-Amazon/valkey that referenced this pull request Oct 21, 2025
This was introduced in valkey-io#1826. This create an `Uninitialised value was
created by a heap allocation` in the CI.

Signed-off-by: Binbin <[email protected]>
(cherry picked from commit 5d3cb3d)
Signed-off-by: cherukum-amazon <[email protected]>
madolson pushed a commit that referenced this pull request Oct 21, 2025
…load crash (#1826)

There will be two issues in this test:
```
test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}
```

The first one is the test output, we can see that after executing
FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576
```

The second one is there is a crash when loading the functions during the
async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------
```

The reason is that, in the old implementation (introduced in 7.0),
FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not
trigger the gc,
it causes us to not be able to effectively reclaim memory after the
FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION
FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash
because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like
SCRIPT FLUSH.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ricardo Dias <[email protected]>
Co-authored-by: Ricardo Dias <[email protected]>
(cherry picked from commit b4c93cc)
Signed-off-by: cherukum-amazon <[email protected]>
madolson pushed a commit that referenced this pull request Oct 21, 2025
This was introduced in #1826. This create an `Uninitialised value was
created by a heap allocation` in the CI.

Signed-off-by: Binbin <[email protected]>
(cherry picked from commit 5d3cb3d)
Signed-off-by: cherukum-amazon <[email protected]>
@cherukum-Amazon cherukum-Amazon moved this from To be backported to Done in Valkey 9.0 Nov 3, 2025
@enjoy-binbin enjoy-binbin changed the title FUNCTION FLUSH re-create lua VM, fix flush not gc, fix flush async + load crash Fix Lua VM crash after FUNCTION FLUSH ASYNC + FUNCTION LOAD Nov 13, 2025
zhijun42 pushed a commit to zhijun42/valkey that referenced this pull request Nov 15, 2025
…load crash (valkey-io#1826)

There will be two issues in this test:
```
test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}
```

The first one is the test output, we can see that after executing
FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576
```

The second one is there is a crash when loading the functions during the
async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------
```

The reason is that, in the old implementation (introduced in 7.0),
FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not
trigger the gc,
it causes us to not be able to effectively reclaim memory after the
FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION
FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash
because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like
SCRIPT FLUSH.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ricardo Dias <[email protected]>
Co-authored-by: Ricardo Dias <[email protected]>
zhijun42 pushed a commit to zhijun42/valkey that referenced this pull request Nov 15, 2025
This was introduced in valkey-io#1826. This create an `Uninitialised value was
created by a heap allocation` in the CI.

Signed-off-by: Binbin <[email protected]>
zhijun42 pushed a commit to zhijun42/valkey that referenced this pull request Nov 15, 2025
…load crash (valkey-io#1826)

There will be two issues in this test:
```
test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}
```

The first one is the test output, we can see that after executing
FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576
```

The second one is there is a crash when loading the functions during the
async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------
```

The reason is that, in the old implementation (introduced in 7.0),
FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not
trigger the gc,
it causes us to not be able to effectively reclaim memory after the
FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION
FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash
because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like
SCRIPT FLUSH.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ricardo Dias <[email protected]>
Co-authored-by: Ricardo Dias <[email protected]>
zhijun42 pushed a commit to zhijun42/valkey that referenced this pull request Nov 15, 2025
This was introduced in valkey-io#1826. This create an `Uninitialised value was
created by a heap allocation` in the CI.

Signed-off-by: Binbin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major-decision-approved Major decision approved by TSC team release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: To be backported
Status: To be backported
Status: To be backported
Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants