-
Notifications
You must be signed in to change notification settings - Fork 119
llama-swap should never need to be restarted due to upstream issues #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- remove Process.shutdownCtx, Process.shutdownCancel - simplify logic by checking Process.CurrentState() instead of more complex channel management
|
""" WalkthroughA new Go program is introduced to test process termination behaviors using Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant MainProgram
participant ChildProcess
User->>MainProgram: Start program
MainProgram->>ChildProcess: Start child process (with/without SIGTERM ignore)
MainProgram->>MainProgram: Set up signal listener (SIGINT, SIGTERM)
User->>MainProgram: Send OS signal (SIGINT/SIGTERM)
MainProgram->>MainProgram: Cancel context (triggers custom cancel logic)
MainProgram->>ChildProcess: Send SIGTERM
alt Child ignores SIGTERM
MainProgram->>ChildProcess: Wait 500ms, then send SIGKILL
end
ChildProcess-->>MainProgram: Exit
MainProgram->>User: Print exit status and errors
Possibly related PRs
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (3)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🧠 Learnings (1)proxy/process.go (1)⏰ Context from checks skipped due to timeout of 90000ms (1)
🔇 Additional comments (12)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (3)
proxy/process.go (3)
139-139: 🛠️ Refactor suggestion
⚠️ Potential issueRemove StateFailed transitions per PR objectives
The PR objectives require removing
StateFailedto ensure automatic retries. However, the code still transitions toStateFailedin multiple places:
- Line 139: StateStarting → StateFailed transition
- Lines 203-209: Command start failure → StateFailed
- Lines 269-273: Health check timeout → StateFailed
- Lines 277-281: Command exit with error → StateFailed
- Lines 284-288: Command exit without error → StateFailed
These should transition to
StateStoppinginstead to align with the PR's reliability goals.Also applies to: 203-209, 269-273, 277-281, 284-288
394-394:⚠️ Potential issueUse swapState() for thread-safe state transitions
Directly assigning to
p.statebypasses thread safety and state transition validation. This could lead to race conditions.- // just force it to this state since there is no recovery from shutdown - p.state = StateShutdown + if curState, err := p.swapState(StateStopping, StateShutdown); err != nil { + p.proxyLogger.Warnf("<%s> Shutdown() failed to transition to StateShutdown: %v, current state: %v", p.ID, err, curState) + }
28-28: 💡 Verification agent🧩 Analysis chain
StateFailed should be removed according to PR objectives
The PR objectives state that
StateFailedstatus should be removed so that upstream processes will always be retried automatically. However, this constant is still defined in the code.
🏁 Script executed:
#!/bin/bash # Description: Check for remaining references to StateFailed in the codebase # Search for StateFailed usage ast-grep --pattern 'StateFailed' # Also check with ripgrep for any string references rg "StateFailed" -A 2 -B 2Length of output: 6796
Remove all
StateFailedreferences as per PR objectivesWe still have numerous definitions and usages of
StateFailedthat contradict the goal of always retrying upstream processes. Please removeStateFailed(constant, state‐machine transitions, error handling, and tests) and adjust the logic to skip any “failed” state.Key locations to update:
- proxy/process.go
- Line 28:
StateFailedconstant definition- Line 139–144:
isValidTransitioncases includingStateFailed- Swap‐state calls around lines 203, 269, 277, 284, 362–363 for transitions into/out of
StateFailed- Health‐check and exit‐error branches (lines 262–265, 286–295) that swap to
StateFailed- Stop logic at lines 362–365 and request handling at lines 507–510
- proxy/proxymanager.go
- Case for
StateFailedin status‐string mapping (around line 351)- proxy/process_test.go
- All test cases expecting
StateFailedtransitions or current stateEnsure the state machine, logging, HTTP responses, and tests are updated so that failures immediately trigger retry behavior instead of a “failed” terminal state.
🧹 Nitpick comments (1)
misc/process-cmd-test/main.go (1)
23-24: Consider orphaned process handling in production codeThe test reveals that child processes become orphaned when the parent is killed with SIGKILL. This could lead to resource leaks in production if llama-swap crashes unexpectedly.
Consider implementing a process group or using
cmd.SysProcAttrwithSetpgidto ensure child processes are cleaned up even when the parent crashes. This would improve the reliability goals mentioned in the PR objectives.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
misc/process-cmd-test/main.go(1 hunks)proxy/process.go(1 hunks)
🧰 Additional context used
🪛 golangci-lint (1.64.8)
misc/process-cmd-test/main.go
52-52: Error return value of cmd.Process.Signal is not checked
(errcheck)
| cmd.Cancel = func() error { | ||
| fmt.Println("✔︎ Cancel() called, sending SIGTERM") | ||
| cmd.Process.Signal(syscall.SIGTERM) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check error return value from Signal()
The error return value from cmd.Process.Signal should be checked, even in test code, to help debug signal delivery issues.
cmd.Cancel = func() error {
fmt.Println("✔︎ Cancel() called, sending SIGTERM")
- cmd.Process.Signal(syscall.SIGTERM)
- return nil
+ if err := cmd.Process.Signal(syscall.SIGTERM); err != nil {
+ fmt.Printf("Failed to send SIGTERM: %v\n", err)
+ return err
+ }
+ return nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cmd.Cancel = func() error { | |
| fmt.Println("✔︎ Cancel() called, sending SIGTERM") | |
| cmd.Process.Signal(syscall.SIGTERM) | |
| return nil | |
| } | |
| cmd.Cancel = func() error { | |
| fmt.Println("✔︎ Cancel() called, sending SIGTERM") | |
| if err := cmd.Process.Signal(syscall.SIGTERM); err != nil { | |
| fmt.Printf("Failed to send SIGTERM: %v\n", err) | |
| return err | |
| } | |
| return nil | |
| } |
🧰 Tools
🪛 golangci-lint (1.64.8)
52-52: Error return value of cmd.Process.Signal is not checked
(errcheck)
🤖 Prompt for AI Agents
In misc/process-cmd-test/main.go around lines 50 to 54, the error returned by
cmd.Process.Signal(syscall.SIGTERM) is not checked. Modify the Cancel function
to capture the error returned by Signal, check if it is non-nil, and handle it
appropriately, such as logging the error or returning it, to ensure any issues
with signal delivery are detected and can be debugged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
misc/process-cmd-test/main.go (1)
51-54:⚠️ Potential issueCheck error return value from Signal()
The error return value from
cmd.Process.Signalshould be checked, even in test code, to help debug signal delivery issues.cmd.Cancel = func() error { fmt.Println("✔︎ Cancel() called, sending SIGTERM") - cmd.Process.Signal(syscall.SIGTERM) + if err := cmd.Process.Signal(syscall.SIGTERM); err != nil { + fmt.Printf("Failed to send SIGTERM: %v\n", err) + return err + } //return nil🧰 Tools
🪛 golangci-lint (1.64.8)
53-53: Error return value of
cmd.Process.Signalis not checked(errcheck)
🧹 Nitpick comments (1)
misc/process-cmd-test/main.go (1)
40-44: Make the test binary path configurable for cross-platform compatibility.The hardcoded path
../../build/simple-responder_darwin_arm64is platform-specific. Consider making it configurable via command-line arguments or environment variables to support testing on different platforms.+import "flag" + +var binaryPath = flag.String("binary", "../../build/simple-responder_darwin_arm64", "Path to the test binary") + func main() { + flag.Parse() // swap between these to use kill -9 <pid> on the cli to sim external crash ctx, cancel := context.WithCancel(context.Background()) //ctx, cancel := context.WithTimeout(context.Background(), 1000*time.Millisecond) defer cancel() //cmd := exec.CommandContext(ctx, "sleep", "1") cmd := exec.CommandContext(ctx, - "../../build/simple-responder_darwin_arm64", + *binaryPath, //"-ignore-sig-term", /* so it doesn't exit on receiving SIGTERM, test cmd.WaitTimeout */ )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
misc/process-cmd-test/main.go(1 hunks)proxy/process.go(9 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
proxy/process.go (1)
proxy/config.go (1)
SanitizeCommand(297-330)
🪛 golangci-lint (1.64.8)
misc/process-cmd-test/main.go
53-53: Error return value of cmd.Process.Signal is not checked
(errcheck)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: run-tests
🔇 Additional comments (7)
misc/process-cmd-test/main.go (1)
68-91: Well-structured test program for process lifecycle management.The main execution flow properly handles command lifecycle, signal handling, and provides clear logging for debugging different termination scenarios.
proxy/process.go (6)
46-78: Good migration path with clear deprecation markers.The addition of
cancelUpstreamand clear documentation of fields to be removed provides a clean migration path toexec.CommandContext.
201-226: Correct implementation of exec.CommandContext with proper lifecycle management.The command setup properly configures cancellation context, custom Cancel function, and wait delay. The
waitForCmd()goroutine ensures proper handling of command exit.
392-428: Clean refactoring to use context cancellation.The simplified
stopCommand()properly uses the new cancellation mechanism and handles various exit scenarios comprehensively.
544-565: Well-designed command exit handler with proper race condition handling.The
waitForCmd()method correctly handles command exit, state transitions, and race conditions with SIGKILL.
373-388: Excellent simplification of stop and shutdown logic.The refactored methods are much cleaner and rely on the context cancellation mechanism, aligning perfectly with the PR objectives.
567-609:✅ Verification successful
Comprehensive graceful shutdown implementation with platform-specific handling.
The
cmdStopUpstreamProcess()method properly handles:
- Windows-specific stop commands with taskkill
- Custom stop command execution with PID substitution
- Fallback to SIGTERM for Unix systems
- Proper error handling and logging
Let me verify that the
SanitizeCommandfunction properly handles the PID-substituted command string:
🏁 Script executed:
#!/bin/bash # Description: Verify SanitizeCommand usage and implementation # Test: Check if SanitizeCommand is defined and handles command strings properly ast-grep --pattern 'func SanitizeCommand($_) $$$'Length of output: 1587
Review Approved: cmdStopUpstreamProcess() and SanitizeCommand Verified
The
cmdStopUpstreamProcess()method correctly handles:
- Windows default (
taskkill /f /t /pid ${PID}) and custom stop commands- PID substitution followed by proper sanitization via
SanitizeCommand(proxy/config.go:297–330)- OS-specific argument splitting (Posix vs. Windows)
- Fallback to
SIGTERMon Unix- Thorough error handling and logging
No further changes are required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
proxy/process.go (1)
260-260: Error message could be more accurate.The error message "health check interrupted due to shutdown" may be misleading since the health check can be interrupted when transitioning to states other than shutdown (e.g., when the upstream exits and transitions to
StateStopped).
🧹 Nitpick comments (2)
proxy/process.go (2)
532-535: Consider logging when forcing StateStopped.While forcing
StateStoppedensures the process ends in a recoverable state, the current implementation at line 534 could mask unexpected state transitions. Consider usingswapStatewith appropriate error handling or at least log at ERROR level when this forced transition occurs to aid debugging.
540-581: Well-implemented graceful stop handler.The
cmdStopUpstreamProcessmethod provides excellent flexibility:
- Supports custom stop commands with PID substitution
- Falls back to SIGTERM for standard cases
- Good Windows support with default taskkill command
- Proper error handling and logging
Consider adding a timeout for the custom stop command execution to prevent hanging if the stop command itself becomes unresponsive.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
proxy/process.go(9 hunks)proxy/process_test.go(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
proxy/process_test.go (1)
proxy/process.go (1)
StateStopped(22-22)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: run-tests
🔇 Additional comments (5)
proxy/process_test.go (1)
342-342: LGTM! State transition aligns with PR objectives.The change from
StateFailedtoStateStoppedis correct. This ensures that when an upstream process exits prematurely, it transitions to a recoverable state that allows automatic retry on the next request, which aligns with the PR's goal of improving reliability.proxy/process.go (4)
46-78: Good refactoring approach with clear migration path.The struct changes properly support the migration to
exec.CommandContext. The temporary fields are clearly marked for removal, which helps maintain code clarity during the transition. The change fromchan errortochan struct{}forcmdWaitChanis appropriate since it's now used purely for signaling command completion.
201-211: Excellent implementation of exec.CommandContext with custom cancellation.The migration to
exec.CommandContextis well-implemented:
- Custom
Cancelfunction allows graceful shutdown viacmdStopUpstreamProcessWaitDelayprovides time for cleanup before forceful termination- Creating a fresh
cmdWaitChanfor each start prevents channel reuse issues- Launching
waitForCmd()as a separate goroutine properly handles command exitThis approach significantly simplifies the shutdown logic compared to manual signal handling.
Also applies to: 226-226
366-379: Excellent simplification of stopCommand().The refactored method is much cleaner by leveraging
exec.CommandContext:
- Removes complex manual signal handling
- Eliminates timeout management (now handled by
cmd.WaitDelay)- Good defensive programming with the nil check
- Clear and concise implementation
This is a significant improvement in code maintainability.
257-259: Critical change for automatic retry functionality.This logic correctly handles the case where the upstream process exits cleanly during health check. By checking for
StateStoppedand returning a descriptive error, the system can automatically retry the process on the next request instead of permanently failing. This is a key improvement for reliability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
proxy/process.go (2)
248-254: Inconsistent with PR objectives and past review feedback.Two issues here:
- The different handling for
StateStoppedvs other states contradicts the PR objective of always retrying upstream processes- The error message "health check interrupted due to shutdown" was flagged in past reviews as misleading
currentState := p.CurrentState() if currentState != StateStarting { - if currentState == StateStopped { - return fmt.Errorf("upstream command exited prematurely but successfully") - } - return errors.New("health check interrupted due to shutdown") + return fmt.Errorf("health check interrupted: process state changed to %s", currentState) }
207-216: Part of StateFailed inconsistency issue.This error handling transitions to
StateFailedwhen command start fails, which contradicts the PR objective of removingStateFailedto ensure automatic retries.Consider updating this to transition to
StateStoppedinstead, allowing for automatic retry on the next request.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
proxy/process.go(9 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: run-tests
🔇 Additional comments (4)
proxy/process.go (4)
46-50: LGTM! Struct field changes align with new cancellation approach.The addition of
cancelUpstreamand conversion ofcmdWaitChanto a simple notification channel properly supports the newexec.CommandContextpattern.Also applies to: 85-85, 98-99
28-28:⚠️ Potential issueCritical inconsistency with PR objectives.
The PR objectives state that
StateFailedshould be removed so upstream processes are always retried automatically, but the code still referencesStateFailedin multiple locations. This contradicts the reliability improvements intended by this PR.Consider removing all
StateFailedreferences and updating the logic to retry failed processes automatically:const ( StateStopped ProcessState = ProcessState("stopped") StateStarting ProcessState = ProcessState("starting") StateReady ProcessState = ProcessState("ready") StateStopping ProcessState = ProcessState("stopping") - - // failed a health check on start and will not be recovered - StateFailed ProcessState = ProcessState("failed") // process is shutdown and will not be restarted StateShutdown ProcessState = ProcessState("shutdown") )Update state transition logic and error handling to remove
StateFailedtransitions and ensure automatic retry behavior.Also applies to: 140-140, 146-146, 209-209, 328-328, 404-404
Likely an incorrect or invalid review comment.
1-567: Good technical approach but critical alignment issues with PR objectives.The refactor successfully implements the
exec.CommandContextpattern and simplifies the stop logic as intended. However, the retention ofStateFailedthroughout the codebase directly contradicts the PR objectives of ensuring upstream processes are always retried automatically.The key goals of this PR are not fully met. Consider completing the
StateFailedremoval to achieve the reliability improvements described in the PR objectives.
524-566:✅ Verification successful
LGTM! Well-implemented graceful stop mechanism.
The method properly handles both custom stop commands and default SIGTERM, with appropriate Windows support. The PID substitution and command sanitization appear secure.
Verify that the
SanitizeCommandfunction properly handles the PID substitution to prevent command injection:
🏁 Script executed:
#!/bin/bash # Description: Check SanitizeCommand implementation for security # Expected: Should find proper input validation and sanitization ast-grep --pattern 'func SanitizeCommand($$$) { $$$ }'Length of output: 58
🏁 Script executed:
#!/bin/bash # Locate and display the SanitizeCommand implementation for review rg -n "SanitizeCommand" -C 5 .Length of output: 5131
🏁 Script executed:
#!/bin/bash # Display the SanitizeCommand implementation for security review rg -n "func SanitizeCommand" -A 50 proxy/config.goLength of output: 1044
Safe PID Substitution Confirmed
SanitizeCommand simply parses the command string into a slice of arguments (handling quotes, comments, backslashes) without invoking a shell, and
${PID}is replaced viafmt.Sprintf("%d", pid), ensuring it’s strictly numeric. Since we callexec.Command(args[0], args[1:]...), there is no shell interpolation or injection risk from the PID value.
| func (p *Process) stopCommand() { | ||
| stopStartTime := time.Now() | ||
| defer func() { | ||
| p.proxyLogger.Debugf("<%s> stopCommand took %v", p.ID, time.Since(stopStartTime)) | ||
| }() | ||
|
|
||
| sigtermTimeout, cancelTimeout := context.WithTimeout(context.Background(), sigtermTTL) | ||
| defer cancelTimeout() | ||
|
|
||
| if p.cmd == nil || p.cmd.Process == nil { | ||
| p.proxyLogger.Debugf("<%s> cmd or cmd.Process is nil (normal during config reload)", p.ID) | ||
| if p.cancelUpstream == nil { | ||
| p.proxyLogger.Errorf("<%s> stopCommand has a nil p.cancelUpstream()", p.ID) | ||
| return | ||
| } | ||
|
|
||
| // if err := p.terminateProcess(); err != nil { | ||
| // p.proxyLogger.Debugf("<%s> Process already terminated: %v (normal during shutdown)", p.ID, err) | ||
| // } | ||
| // the default cmdStop to taskkill /f /t /pid ${PID} | ||
| if runtime.GOOS == "windows" && strings.TrimSpace(p.config.CmdStop) == "" { | ||
| p.config.CmdStop = "taskkill /f /t /pid ${PID}" | ||
| } | ||
|
|
||
| if p.config.CmdStop != "" { | ||
| // replace ${PID} with the pid of the process | ||
| stopArgs, err := SanitizeCommand(strings.ReplaceAll(p.config.CmdStop, "${PID}", fmt.Sprintf("%d", p.cmd.Process.Pid))) | ||
| if err != nil { | ||
| p.proxyLogger.Errorf("<%s> Failed to sanitize stop command: %v", p.ID, err) | ||
| return | ||
| } | ||
|
|
||
| p.proxyLogger.Debugf("<%s> Executing stop command: %s", p.ID, strings.Join(stopArgs, " ")) | ||
|
|
||
| stopCmd := exec.Command(stopArgs[0], stopArgs[1:]...) | ||
| stopCmd.Stdout = p.processLogger | ||
| stopCmd.Stderr = p.processLogger | ||
| stopCmd.Env = p.config.Env | ||
|
|
||
| if err := stopCmd.Run(); err != nil { | ||
| p.proxyLogger.Errorf("<%s> Failed to exec stop command: %v", p.ID, err) | ||
| return | ||
| } | ||
| } else { | ||
| if err := p.cmd.Process.Signal(syscall.SIGTERM); err != nil { | ||
| p.proxyLogger.Errorf("<%s> Failed to send SIGTERM to process: %v", p.ID, err) | ||
| return | ||
| } | ||
| } | ||
|
|
||
| select { | ||
| case <-sigtermTimeout.Done(): | ||
| p.proxyLogger.Debugf("<%s> Process timed out waiting to stop, sending KILL signal (normal during shutdown)", p.ID) | ||
| p.upstreamWasStoppedWithKill = true | ||
| if err := p.cmd.Process.Kill(); err != nil { | ||
| p.proxyLogger.Errorf("<%s> Failed to kill process: %v", p.ID, err) | ||
| } | ||
| case err := <-p.cmdWaitChan: | ||
| // Note: in start(), p.cmdWaitChan also has a select { ... }. That should be OK | ||
| // because if we make it here then the cmd has been successfully running and made it | ||
| // through the health check. There is a possibility that the cmd crashed after the health check | ||
| // succeeded but that's not a case llama-swap is handling for now. | ||
| if err != nil { | ||
| if errno, ok := err.(syscall.Errno); ok { | ||
| p.proxyLogger.Errorf("<%s> errno >> %v", p.ID, errno) | ||
| } else if exitError, ok := err.(*exec.ExitError); ok { | ||
| if strings.Contains(exitError.String(), "signal: terminated") { | ||
| p.proxyLogger.Debugf("<%s> Process stopped OK", p.ID) | ||
| } else if strings.Contains(exitError.String(), "signal: interrupt") { | ||
| p.proxyLogger.Debugf("<%s> Process interrupted OK", p.ID) | ||
| } else { | ||
| p.proxyLogger.Warnf("<%s> ExitError >> %v, exit code: %d", p.ID, exitError, exitError.ExitCode()) | ||
| } | ||
| } else { | ||
| p.proxyLogger.Errorf("<%s> Process exited >> %v", p.ID, err) | ||
| } | ||
| } | ||
| } | ||
| p.cancelUpstream() | ||
| <-p.cmdWaitChan | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Potential indefinite blocking in stopCommand.
The method waits on cmdWaitChan without a timeout, which could block indefinitely if the command doesn't exit properly. Consider adding a timeout mechanism.
func (p *Process) stopCommand() {
stopStartTime := time.Now()
defer func() {
p.proxyLogger.Debugf("<%s> stopCommand took %v", p.ID, time.Since(stopStartTime))
}()
if p.cancelUpstream == nil {
p.proxyLogger.Errorf("<%s> stopCommand has a nil p.cancelUpstream()", p.ID)
return
}
p.cancelUpstream()
- <-p.cmdWaitChan
+ select {
+ case <-p.cmdWaitChan:
+ // Command exited normally
+ case <-time.After(p.gracefulStopTimeout + 5*time.Second):
+ p.proxyLogger.Warnf("<%s> stopCommand timed out waiting for command to exit", p.ID)
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (p *Process) stopCommand() { | |
| stopStartTime := time.Now() | |
| defer func() { | |
| p.proxyLogger.Debugf("<%s> stopCommand took %v", p.ID, time.Since(stopStartTime)) | |
| }() | |
| sigtermTimeout, cancelTimeout := context.WithTimeout(context.Background(), sigtermTTL) | |
| defer cancelTimeout() | |
| if p.cmd == nil || p.cmd.Process == nil { | |
| p.proxyLogger.Debugf("<%s> cmd or cmd.Process is nil (normal during config reload)", p.ID) | |
| if p.cancelUpstream == nil { | |
| p.proxyLogger.Errorf("<%s> stopCommand has a nil p.cancelUpstream()", p.ID) | |
| return | |
| } | |
| // if err := p.terminateProcess(); err != nil { | |
| // p.proxyLogger.Debugf("<%s> Process already terminated: %v (normal during shutdown)", p.ID, err) | |
| // } | |
| // the default cmdStop to taskkill /f /t /pid ${PID} | |
| if runtime.GOOS == "windows" && strings.TrimSpace(p.config.CmdStop) == "" { | |
| p.config.CmdStop = "taskkill /f /t /pid ${PID}" | |
| } | |
| if p.config.CmdStop != "" { | |
| // replace ${PID} with the pid of the process | |
| stopArgs, err := SanitizeCommand(strings.ReplaceAll(p.config.CmdStop, "${PID}", fmt.Sprintf("%d", p.cmd.Process.Pid))) | |
| if err != nil { | |
| p.proxyLogger.Errorf("<%s> Failed to sanitize stop command: %v", p.ID, err) | |
| return | |
| } | |
| p.proxyLogger.Debugf("<%s> Executing stop command: %s", p.ID, strings.Join(stopArgs, " ")) | |
| stopCmd := exec.Command(stopArgs[0], stopArgs[1:]...) | |
| stopCmd.Stdout = p.processLogger | |
| stopCmd.Stderr = p.processLogger | |
| stopCmd.Env = p.config.Env | |
| if err := stopCmd.Run(); err != nil { | |
| p.proxyLogger.Errorf("<%s> Failed to exec stop command: %v", p.ID, err) | |
| return | |
| } | |
| } else { | |
| if err := p.cmd.Process.Signal(syscall.SIGTERM); err != nil { | |
| p.proxyLogger.Errorf("<%s> Failed to send SIGTERM to process: %v", p.ID, err) | |
| return | |
| } | |
| } | |
| select { | |
| case <-sigtermTimeout.Done(): | |
| p.proxyLogger.Debugf("<%s> Process timed out waiting to stop, sending KILL signal (normal during shutdown)", p.ID) | |
| p.upstreamWasStoppedWithKill = true | |
| if err := p.cmd.Process.Kill(); err != nil { | |
| p.proxyLogger.Errorf("<%s> Failed to kill process: %v", p.ID, err) | |
| } | |
| case err := <-p.cmdWaitChan: | |
| // Note: in start(), p.cmdWaitChan also has a select { ... }. That should be OK | |
| // because if we make it here then the cmd has been successfully running and made it | |
| // through the health check. There is a possibility that the cmd crashed after the health check | |
| // succeeded but that's not a case llama-swap is handling for now. | |
| if err != nil { | |
| if errno, ok := err.(syscall.Errno); ok { | |
| p.proxyLogger.Errorf("<%s> errno >> %v", p.ID, errno) | |
| } else if exitError, ok := err.(*exec.ExitError); ok { | |
| if strings.Contains(exitError.String(), "signal: terminated") { | |
| p.proxyLogger.Debugf("<%s> Process stopped OK", p.ID) | |
| } else if strings.Contains(exitError.String(), "signal: interrupt") { | |
| p.proxyLogger.Debugf("<%s> Process interrupted OK", p.ID) | |
| } else { | |
| p.proxyLogger.Warnf("<%s> ExitError >> %v, exit code: %d", p.ID, exitError, exitError.ExitCode()) | |
| } | |
| } else { | |
| p.proxyLogger.Errorf("<%s> Process exited >> %v", p.ID, err) | |
| } | |
| } | |
| } | |
| p.cancelUpstream() | |
| <-p.cmdWaitChan | |
| } | |
| func (p *Process) stopCommand() { | |
| stopStartTime := time.Now() | |
| defer func() { | |
| p.proxyLogger.Debugf("<%s> stopCommand took %v", p.ID, time.Since(stopStartTime)) | |
| }() | |
| if p.cancelUpstream == nil { | |
| p.proxyLogger.Errorf("<%s> stopCommand has a nil p.cancelUpstream()", p.ID) | |
| return | |
| } | |
| p.cancelUpstream() | |
| select { | |
| case <-p.cmdWaitChan: | |
| // Command exited normally | |
| case <-time.After(p.gracefulStopTimeout + 5*time.Second): | |
| p.proxyLogger.Warnf("<%s> stopCommand timed out waiting for command to exit", p.ID) | |
| } | |
| } |
🤖 Prompt for AI Agents
In proxy/process.go around lines 359 to 372, the stopCommand method waits
indefinitely on cmdWaitChan, risking a deadlock if the command never exits.
Modify the code to use a select statement that waits on cmdWaitChan with a
timeout case, such as using time.After with a reasonable duration, to prevent
indefinite blocking and allow the function to proceed or log an error if the
timeout occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this is fine as the p.cmdWaitChan is closed, it's just a signal channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mostlygeek You're absolutely right! Thank you for the clarification. I see now that cmdWaitChan is indeed a signal channel that gets closed by waitForCmd() when cmd.Wait() returns. Reading from a closed channel returns immediately, so <-p.cmdWaitChan won't block indefinitely - it will unblock as soon as the command exits and the channel is closed.
I appreciate you pointing out this signal channel pattern. My suggestion was based on a misunderstanding of how the channel was being used.
(\ /)
( ._.)
o_(")(")
learned something new
✏️ Learnings added
Learnt from: mostlygeek
PR: mostlygeek/llama-swap#155
File: proxy/process.go:359-372
Timestamp: 2025-06-05T22:57:53.766Z
Learning: In Go, cmdWaitChan in proxy/process.go is a signal channel (chan struct{}) that follows the pattern where waitForCmd() closes the channel when cmd.Wait() returns, and stopCommand() waits on it. Reading from a closed channel returns immediately, so <-cmdWaitChan won't block indefinitely.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| // waitForCmd waits for the command to exit and handles exit conditions depending on current state | ||
| func (p *Process) waitForCmd() { | ||
| exitErr := p.cmd.Wait() | ||
| p.proxyLogger.Debugf("<%s> cmd.Wait() returned error: %v", p.ID, exitErr) | ||
|
|
||
| if exitErr != nil { | ||
| if errno, ok := exitErr.(syscall.Errno); ok { | ||
| p.proxyLogger.Errorf("<%s> errno >> %v", p.ID, errno) | ||
| } else if exitError, ok := exitErr.(*exec.ExitError); ok { | ||
| if strings.Contains(exitError.String(), "signal: terminated") { | ||
| p.proxyLogger.Debugf("<%s> Process stopped OK", p.ID) | ||
| } else if strings.Contains(exitError.String(), "signal: interrupt") { | ||
| p.proxyLogger.Debugf("<%s> Process interrupted OK", p.ID) | ||
| } else { | ||
| p.proxyLogger.Warnf("<%s> ExitError >> %v, exit code: %d", p.ID, exitError, exitError.ExitCode()) | ||
| } | ||
| } else { | ||
| if exitErr.Error() != "context canceled" /* this is normal */ { | ||
| p.proxyLogger.Errorf("<%s> Process exited >> %v", p.ID, exitErr) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| currentState := p.CurrentState() | ||
| switch currentState { | ||
| case StateStopping: | ||
| if curState, err := p.swapState(StateStopping, StateStopped); err != nil { | ||
| p.proxyLogger.Errorf("<%s> Process exited but could not swap to StateStopped. curState=%s, err: %v", p.ID, curState, err) | ||
| p.state = StateStopped | ||
| } | ||
| default: | ||
| p.proxyLogger.Infof("<%s> process exited but not StateStopping, current state: %s", p.ID, currentState) | ||
| p.state = StateStopped // force it to be in this state | ||
| } | ||
| close(p.cmdWaitChan) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concurrency safety violation in state management.
Lines 515 and 519 directly assign to p.state without proper locking, breaking the concurrency safety pattern used elsewhere in the code.
switch currentState {
case StateStopping:
if curState, err := p.swapState(StateStopping, StateStopped); err != nil {
p.proxyLogger.Errorf("<%s> Process exited but could not swap to StateStopped. curState=%s, err: %v", p.ID, curState, err)
- p.state = StateStopped
+ // Force state transition with proper locking
+ p.stateMutex.Lock()
+ p.state = StateStopped
+ p.stateMutex.Unlock()
}
default:
p.proxyLogger.Infof("<%s> process exited but not StateStopping, current state: %s", p.ID, currentState)
- p.state = StateStopped // force it to be in this state
+ // Force state transition with proper locking
+ p.stateMutex.Lock()
+ p.state = StateStopped
+ p.stateMutex.Unlock()
}🤖 Prompt for AI Agents
In proxy/process.go around lines 515 and 519, the code directly assigns to
p.state without using locks, violating concurrency safety. Replace these direct
assignments with a proper state swap method that handles locking, such as using
p.swapState or another thread-safe mechanism consistent with the rest of the
code to update p.state safely.
| cmdContext, ctxCancelUpstream := context.WithCancel(context.Background()) | ||
| p.cmd = exec.CommandContext(cmdContext, args[0], args[1:]...) | ||
| p.cmd.Stdout = p.processLogger | ||
| p.cmd.Stderr = p.processLogger | ||
| p.cmd.Env = p.config.Env | ||
|
|
||
| p.cmd.Cancel = p.cmdStopUpstreamProcess | ||
| p.cmd.WaitDelay = p.gracefulStopTimeout | ||
| p.cancelUpstream = ctxCancelUpstream | ||
| p.cmdWaitChan = make(chan struct{}) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Potential race condition with channel recreation.
Creating a new cmdWaitChan at line 203 could cause issues if other goroutines are waiting on the previous channel. Consider moving this to the beginning of the method or ensuring proper synchronization.
p.waitStarting.Add(1)
defer p.waitStarting.Done()
+ p.cmdWaitChan = make(chan struct{})
cmdContext, ctxCancelUpstream := context.WithCancel(context.Background())
p.cmd = exec.CommandContext(cmdContext, args[0], args[1:]...)
p.cmd.Stdout = p.processLogger
p.cmd.Stderr = p.processLogger
p.cmd.Env = p.config.Env
p.cmd.Cancel = p.cmdStopUpstreamProcess
p.cmd.WaitDelay = p.gracefulStopTimeout
p.cancelUpstream = ctxCancelUpstream
- p.cmdWaitChan = make(chan struct{})Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In proxy/process.go around lines 194 to 204, the creation of the cmdWaitChan
channel at line 203 may cause a race condition if other goroutines are still
waiting on the previous channel. To fix this, move the initialization of
cmdWaitChan to the start of the method before any goroutines might access it, or
add synchronization mechanisms such as mutexes to ensure no concurrent access
occurs during channel recreation.
- removed StateFailed - StateStarting --> StartStopped is OK - updated tests
|
Nice! llama-server crashed and llama-swap automatically restarted it: llama-swap detecting the crash/exit and restarting it: llama-server crash logs |
When upstream processes fail, crash or get stuck for some reason the only way to resolve it is to restart llama-swap. Reliability is a goal and having to restart llama-swap due to upstream process issues goes against that. Ideally, llama-swap should only need to be restarted for:
Since the Process management code is quite complex at this point the top line goals are:
StateFailedso upstream processes will always be retried. Make it an operator task to resolve starting issues (ref: Failed state considered unrecoverable? #120)Summary by CodeRabbit
New Features
Refactor
Bug Fixes