Skip to content

Conversation

@mostlygeek
Copy link
Owner

@mostlygeek mostlygeek commented Jun 5, 2025

When upstream processes fail, crash or get stuck for some reason the only way to resolve it is to restart llama-swap. Reliability is a goal and having to restart llama-swap due to upstream process issues goes against that. Ideally, llama-swap should only need to be restarted for:

  • configuration changes
  • upgrading versions.

Since the Process management code is quite complex at this point the top line goals are:

  1. remove StateFailed so upstream processes will always be retried. Make it an operator task to resolve starting issues (ref: Failed state considered unrecoverable? #120)
  2. Simplify start/stop/shutdown with golang's built in exec.CommandContext. This hopefully will reduce code as well.
  3. None/little changes to current test suite

Summary by CodeRabbit

  • New Features

    • Introduced a new executable program for testing process termination scenarios, including handling of signals and process timeouts.
  • Refactor

    • Improved process management by integrating context-based cancellation for cleaner shutdown and stop operations.
    • Enhanced process lifecycle handling with dedicated wait routines and consolidated stop logic for better reliability.
    • Removed the failed state from process lifecycle, simplifying state transitions and request handling.
    • Streamlined health check and error logging for more straightforward process monitoring.
  • Bug Fixes

    • Updated process state handling to correctly reflect stopped status after upstream command exits prematurely but successfully.
    • Adjusted error responses for broken model configurations to improve clarity and consistency.

- remove Process.shutdownCtx, Process.shutdownCancel
- simplify logic by checking Process.CurrentState() instead of more
  complex channel management
@coderabbitai
Copy link

coderabbitai bot commented Jun 5, 2025

"""

Walkthrough

A new Go program is introduced to test process termination behaviors using exec.Cmd.CommandContext. Additionally, the shutdown context and cancellation mechanism are removed from the Process struct in the proxy, simplifying shutdown management to rely solely on process state checks and context cancellation with a custom cancel function.

Changes

File(s) Change Summary
misc/process-cmd-test/main.go Added a new Go program for testing process termination scenarios and signal handling with subprocesses.
proxy/process.go Removed shutdown context/cancel fields and logic from Process; replaced inline wait goroutine with waitForCmd() method; introduced cancelUpstream context cancel func and custom Cancel for exec.CommandContext; refactored shutdown and stop logic to use context cancellation and process state checks; removed StateFailed state and related logic.
proxy/process_test.go Updated test TestProcess_ExitInterruptsHealthCheck to expect StateStopped instead of StateFailed after upstream command exits prematurely but successfully; removed tests involving StateFailed; adjusted error expectations in TestProcess_BrokenModelConfig.
proxy/proxymanager.go Removed handling of StateFailed in process state string representation, causing StateFailed to be shown as "Unknown".

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant MainProgram
    participant ChildProcess

    User->>MainProgram: Start program
    MainProgram->>ChildProcess: Start child process (with/without SIGTERM ignore)
    MainProgram->>MainProgram: Set up signal listener (SIGINT, SIGTERM)
    User->>MainProgram: Send OS signal (SIGINT/SIGTERM)
    MainProgram->>MainProgram: Cancel context (triggers custom cancel logic)
    MainProgram->>ChildProcess: Send SIGTERM
    alt Child ignores SIGTERM
        MainProgram->>ChildProcess: Wait 500ms, then send SIGKILL
    end
    ChildProcess-->>MainProgram: Exit
    MainProgram->>User: Print exit status and errors
Loading

Possibly related PRs


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f68dd6 and df36c41.

📒 Files selected for processing (3)
  • proxy/process.go (12 hunks)
  • proxy/process_test.go (3 hunks)
  • proxy/proxymanager.go (0 hunks)
💤 Files with no reviewable changes (1)
  • proxy/proxymanager.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • proxy/process_test.go
🧰 Additional context used
🧠 Learnings (1)
proxy/process.go (1)
Learnt from: mostlygeek
PR: mostlygeek/llama-swap#155
File: proxy/process.go:359-372
Timestamp: 2025-06-05T22:57:53.766Z
Learning: In Go, cmdWaitChan in proxy/process.go is a signal channel (chan struct{}) that follows the pattern where waitForCmd() closes the channel when cmd.Wait() returns, and stopCommand() waits on it. Reading from a closed channel returns immediately, so <-cmdWaitChan won't block indefinitely.
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: run-tests
🔇 Additional comments (12)
proxy/process.go (12)

43-47: Good simplification of the cancellation mechanism.

The change from chan error to chan struct{} aligns with the signal channel pattern and the introduction of cancelUpstream provides a cleaner cancellation mechanism using Go's context.


68-73: LGTM: Maintaining backward compatibility during migration.

The comments and field definitions look good for the transition period.


85-85: Constructor properly initializes new fields.

The initialization of cancelUpstream to nil and cmdWaitChan as a struct channel is correct for the new design.

Also applies to: 95-99


140-140: Correctly removes StateFailed from valid transitions.

The updated transition rule properly eliminates StateFailed while maintaining logical state flow.


208-210: Properly handles start failure without StateFailed.

The change to transition to StateStopped instead of StateFailed on start failure aligns with the PR objective to always retry processes.


219-219: Good refactoring to use dedicated waitForCmd method.

Moving the command waiting logic to a separate method improves code organization and testability.


302-303: Proper reset of failure count on successful start.

Resetting failedStartCount to 0 on successful start maintains proper retry tracking.


327-333: Clean simplification of StopImmediately method.

The streamlined logic with a single call to stopCommand() improves clarity and maintainability.


344-347: Appropriate simplification for shutdown logic.

The streamlined approach with direct state setting for the terminal StateShutdown state is correct.


351-364: Excellent simplification leveraging Go's cancellation mechanism.

The new implementation is much cleaner, using cancelUpstream() and the signal channel pattern. This aligns with the learning that cmdWaitChan is safe to wait on since it gets closed when the command exits.


396-396: Correctly removes StateFailed from request blocking conditions.

Removing StateFailed from the blocking states aligns with the PR objective to always retry processes instead of permanently failing them.


516-558: Comprehensive graceful stop implementation.

The method provides excellent graceful stopping functionality with proper Windows support, PID replacement, and error handling. This enhances the reliability of process management.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@mostlygeek mostlygeek self-assigned this Jun 5, 2025
@mostlygeek mostlygeek added the enhancement New feature or request label Jun 5, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (3)
proxy/process.go (3)

139-139: 🛠️ Refactor suggestion

⚠️ Potential issue

Remove StateFailed transitions per PR objectives

The PR objectives require removing StateFailed to ensure automatic retries. However, the code still transitions to StateFailed in multiple places:

  • Line 139: StateStarting → StateFailed transition
  • Lines 203-209: Command start failure → StateFailed
  • Lines 269-273: Health check timeout → StateFailed
  • Lines 277-281: Command exit with error → StateFailed
  • Lines 284-288: Command exit without error → StateFailed

These should transition to StateStopping instead to align with the PR's reliability goals.

Also applies to: 203-209, 269-273, 277-281, 284-288


394-394: ⚠️ Potential issue

Use swapState() for thread-safe state transitions

Directly assigning to p.state bypasses thread safety and state transition validation. This could lead to race conditions.

-	// just force it to this state since there is no recovery from shutdown
-	p.state = StateShutdown
+	if curState, err := p.swapState(StateStopping, StateShutdown); err != nil {
+		p.proxyLogger.Warnf("<%s> Shutdown() failed to transition to StateShutdown: %v, current state: %v", p.ID, err, curState)
+	}

28-28: 💡 Verification agent

🧩 Analysis chain

StateFailed should be removed according to PR objectives

The PR objectives state that StateFailed status should be removed so that upstream processes will always be retried automatically. However, this constant is still defined in the code.


🏁 Script executed:

#!/bin/bash
# Description: Check for remaining references to StateFailed in the codebase

# Search for StateFailed usage
ast-grep --pattern 'StateFailed'

# Also check with ripgrep for any string references
rg "StateFailed" -A 2 -B 2

Length of output: 6796


Remove all StateFailed references as per PR objectives

We still have numerous definitions and usages of StateFailed that contradict the goal of always retrying upstream processes. Please remove StateFailed (constant, state‐machine transitions, error handling, and tests) and adjust the logic to skip any “failed” state.

Key locations to update:

  • proxy/process.go
    • Line 28: StateFailed constant definition
    • Line 139–144: isValidTransition cases including StateFailed
    • Swap‐state calls around lines 203, 269, 277, 284, 362–363 for transitions into/out of StateFailed
    • Health‐check and exit‐error branches (lines 262–265, 286–295) that swap to StateFailed
    • Stop logic at lines 362–365 and request handling at lines 507–510
  • proxy/proxymanager.go
    • Case for StateFailed in status‐string mapping (around line 351)
  • proxy/process_test.go
    • All test cases expecting StateFailed transitions or current state

Ensure the state machine, logging, HTTP responses, and tests are updated so that failures immediately trigger retry behavior instead of a “failed” terminal state.

🧹 Nitpick comments (1)
misc/process-cmd-test/main.go (1)

23-24: Consider orphaned process handling in production code

The test reveals that child processes become orphaned when the parent is killed with SIGKILL. This could lead to resource leaks in production if llama-swap crashes unexpectedly.

Consider implementing a process group or using cmd.SysProcAttr with Setpgid to ensure child processes are cleaned up even when the parent crashes. This would improve the reliability goals mentioned in the PR objectives.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a84098d and c699aab.

📒 Files selected for processing (2)
  • misc/process-cmd-test/main.go (1 hunks)
  • proxy/process.go (1 hunks)
🧰 Additional context used
🪛 golangci-lint (1.64.8)
misc/process-cmd-test/main.go

52-52: Error return value of cmd.Process.Signal is not checked

(errcheck)

Comment on lines 50 to 54
cmd.Cancel = func() error {
fmt.Println("✔︎ Cancel() called, sending SIGTERM")
cmd.Process.Signal(syscall.SIGTERM)
return nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Check error return value from Signal()

The error return value from cmd.Process.Signal should be checked, even in test code, to help debug signal delivery issues.

 	cmd.Cancel = func() error {
 		fmt.Println("✔︎ Cancel() called, sending SIGTERM")
-		cmd.Process.Signal(syscall.SIGTERM)
-		return nil
+		if err := cmd.Process.Signal(syscall.SIGTERM); err != nil {
+			fmt.Printf("Failed to send SIGTERM: %v\n", err)
+			return err
+		}
+		return nil
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cmd.Cancel = func() error {
fmt.Println("✔︎ Cancel() called, sending SIGTERM")
cmd.Process.Signal(syscall.SIGTERM)
return nil
}
cmd.Cancel = func() error {
fmt.Println("✔︎ Cancel() called, sending SIGTERM")
if err := cmd.Process.Signal(syscall.SIGTERM); err != nil {
fmt.Printf("Failed to send SIGTERM: %v\n", err)
return err
}
return nil
}
🧰 Tools
🪛 golangci-lint (1.64.8)

52-52: Error return value of cmd.Process.Signal is not checked

(errcheck)

🤖 Prompt for AI Agents
In misc/process-cmd-test/main.go around lines 50 to 54, the error returned by
cmd.Process.Signal(syscall.SIGTERM) is not checked. Modify the Cancel function
to capture the error returned by Signal, check if it is non-nil, and handle it
appropriately, such as logging the error or returning it, to ensure any issues
with signal delivery are detected and can be debugged.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
misc/process-cmd-test/main.go (1)

51-54: ⚠️ Potential issue

Check error return value from Signal()

The error return value from cmd.Process.Signal should be checked, even in test code, to help debug signal delivery issues.

 	cmd.Cancel = func() error {
 		fmt.Println("✔︎ Cancel() called, sending SIGTERM")
-		cmd.Process.Signal(syscall.SIGTERM)
+		if err := cmd.Process.Signal(syscall.SIGTERM); err != nil {
+			fmt.Printf("Failed to send SIGTERM: %v\n", err)
+			return err
+		}
 
 		//return nil
🧰 Tools
🪛 golangci-lint (1.64.8)

53-53: Error return value of cmd.Process.Signal is not checked

(errcheck)

🧹 Nitpick comments (1)
misc/process-cmd-test/main.go (1)

40-44: Make the test binary path configurable for cross-platform compatibility.

The hardcoded path ../../build/simple-responder_darwin_arm64 is platform-specific. Consider making it configurable via command-line arguments or environment variables to support testing on different platforms.

+import "flag"
+
+var binaryPath = flag.String("binary", "../../build/simple-responder_darwin_arm64", "Path to the test binary")
+
 func main() {
+    flag.Parse()
 
     // swap between these to use kill -9 <pid> on the cli to sim external crash
     ctx, cancel := context.WithCancel(context.Background())
     //ctx, cancel := context.WithTimeout(context.Background(), 1000*time.Millisecond)
     defer cancel()
 
     //cmd := exec.CommandContext(ctx, "sleep", "1")
     cmd := exec.CommandContext(ctx,
-        "../../build/simple-responder_darwin_arm64",
+        *binaryPath,
         //"-ignore-sig-term", /* so it doesn't exit on receiving SIGTERM, test cmd.WaitTimeout */
     )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e16c83 and 16da4c3.

📒 Files selected for processing (2)
  • misc/process-cmd-test/main.go (1 hunks)
  • proxy/process.go (9 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
proxy/process.go (1)
proxy/config.go (1)
  • SanitizeCommand (297-330)
🪛 golangci-lint (1.64.8)
misc/process-cmd-test/main.go

53-53: Error return value of cmd.Process.Signal is not checked

(errcheck)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: run-tests
🔇 Additional comments (7)
misc/process-cmd-test/main.go (1)

68-91: Well-structured test program for process lifecycle management.

The main execution flow properly handles command lifecycle, signal handling, and provides clear logging for debugging different termination scenarios.

proxy/process.go (6)

46-78: Good migration path with clear deprecation markers.

The addition of cancelUpstream and clear documentation of fields to be removed provides a clean migration path to exec.CommandContext.


201-226: Correct implementation of exec.CommandContext with proper lifecycle management.

The command setup properly configures cancellation context, custom Cancel function, and wait delay. The waitForCmd() goroutine ensures proper handling of command exit.


392-428: Clean refactoring to use context cancellation.

The simplified stopCommand() properly uses the new cancellation mechanism and handles various exit scenarios comprehensively.


544-565: Well-designed command exit handler with proper race condition handling.

The waitForCmd() method correctly handles command exit, state transitions, and race conditions with SIGKILL.


373-388: Excellent simplification of stop and shutdown logic.

The refactored methods are much cleaner and rely on the context cancellation mechanism, aligning perfectly with the PR objectives.


567-609:

✅ Verification successful

Comprehensive graceful shutdown implementation with platform-specific handling.

The cmdStopUpstreamProcess() method properly handles:

  • Windows-specific stop commands with taskkill
  • Custom stop command execution with PID substitution
  • Fallback to SIGTERM for Unix systems
  • Proper error handling and logging

Let me verify that the SanitizeCommand function properly handles the PID-substituted command string:


🏁 Script executed:

#!/bin/bash
# Description: Verify SanitizeCommand usage and implementation
# Test: Check if SanitizeCommand is defined and handles command strings properly
ast-grep --pattern 'func SanitizeCommand($_) $$$'

Length of output: 1587


Review Approved: cmdStopUpstreamProcess() and SanitizeCommand Verified

The cmdStopUpstreamProcess() method correctly handles:

  • Windows default (taskkill /f /t /pid ${PID}) and custom stop commands
  • PID substitution followed by proper sanitization via SanitizeCommand (proxy/config.go:297–330)
  • OS-specific argument splitting (Posix vs. Windows)
  • Fallback to SIGTERM on Unix
  • Thorough error handling and logging

No further changes are required.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
proxy/process.go (1)

260-260: Error message could be more accurate.

The error message "health check interrupted due to shutdown" may be misleading since the health check can be interrupted when transitioning to states other than shutdown (e.g., when the upstream exits and transitions to StateStopped).

🧹 Nitpick comments (2)
proxy/process.go (2)

532-535: Consider logging when forcing StateStopped.

While forcing StateStopped ensures the process ends in a recoverable state, the current implementation at line 534 could mask unexpected state transitions. Consider using swapState with appropriate error handling or at least log at ERROR level when this forced transition occurs to aid debugging.


540-581: Well-implemented graceful stop handler.

The cmdStopUpstreamProcess method provides excellent flexibility:

  • Supports custom stop commands with PID substitution
  • Falls back to SIGTERM for standard cases
  • Good Windows support with default taskkill command
  • Proper error handling and logging

Consider adding a timeout for the custom stop command execution to prevent hanging if the stop command itself becomes unresponsive.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 16da4c3 and fddc14f.

📒 Files selected for processing (2)
  • proxy/process.go (9 hunks)
  • proxy/process_test.go (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
proxy/process_test.go (1)
proxy/process.go (1)
  • StateStopped (22-22)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: run-tests
🔇 Additional comments (5)
proxy/process_test.go (1)

342-342: LGTM! State transition aligns with PR objectives.

The change from StateFailed to StateStopped is correct. This ensures that when an upstream process exits prematurely, it transitions to a recoverable state that allows automatic retry on the next request, which aligns with the PR's goal of improving reliability.

proxy/process.go (4)

46-78: Good refactoring approach with clear migration path.

The struct changes properly support the migration to exec.CommandContext. The temporary fields are clearly marked for removal, which helps maintain code clarity during the transition. The change from chan error to chan struct{} for cmdWaitChan is appropriate since it's now used purely for signaling command completion.


201-211: Excellent implementation of exec.CommandContext with custom cancellation.

The migration to exec.CommandContext is well-implemented:

  • Custom Cancel function allows graceful shutdown via cmdStopUpstreamProcess
  • WaitDelay provides time for cleanup before forceful termination
  • Creating a fresh cmdWaitChan for each start prevents channel reuse issues
  • Launching waitForCmd() as a separate goroutine properly handles command exit

This approach significantly simplifies the shutdown logic compared to manual signal handling.

Also applies to: 226-226


366-379: Excellent simplification of stopCommand().

The refactored method is much cleaner by leveraging exec.CommandContext:

  • Removes complex manual signal handling
  • Eliminates timeout management (now handled by cmd.WaitDelay)
  • Good defensive programming with the nil check
  • Clear and concise implementation

This is a significant improvement in code maintainability.


257-259: Critical change for automatic retry functionality.

This logic correctly handles the case where the upstream process exits cleanly during health check. By checking for StateStopped and returning a descriptive error, the system can automatically retry the process on the next request instead of permanently failing. This is a key improvement for reliability.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
proxy/process.go (2)

248-254: Inconsistent with PR objectives and past review feedback.

Two issues here:

  1. The different handling for StateStopped vs other states contradicts the PR objective of always retrying upstream processes
  2. The error message "health check interrupted due to shutdown" was flagged in past reviews as misleading
	currentState := p.CurrentState()
	if currentState != StateStarting {
-		if currentState == StateStopped {
-			return fmt.Errorf("upstream command exited prematurely but successfully")
-		}
-		return errors.New("health check interrupted due to shutdown")
+		return fmt.Errorf("health check interrupted: process state changed to %s", currentState)
	}

207-216: Part of StateFailed inconsistency issue.

This error handling transitions to StateFailed when command start fails, which contradicts the PR objective of removing StateFailed to ensure automatic retries.

Consider updating this to transition to StateStopped instead, allowing for automatic retry on the next request.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fddc14f and 8f68dd6.

📒 Files selected for processing (1)
  • proxy/process.go (9 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: run-tests
🔇 Additional comments (4)
proxy/process.go (4)

46-50: LGTM! Struct field changes align with new cancellation approach.

The addition of cancelUpstream and conversion of cmdWaitChan to a simple notification channel properly supports the new exec.CommandContext pattern.

Also applies to: 85-85, 98-99


28-28: ⚠️ Potential issue

Critical inconsistency with PR objectives.

The PR objectives state that StateFailed should be removed so upstream processes are always retried automatically, but the code still references StateFailed in multiple locations. This contradicts the reliability improvements intended by this PR.

Consider removing all StateFailed references and updating the logic to retry failed processes automatically:

const (
	StateStopped  ProcessState = ProcessState("stopped")
	StateStarting ProcessState = ProcessState("starting")
	StateReady    ProcessState = ProcessState("ready")
	StateStopping ProcessState = ProcessState("stopping")
-
-	// failed a health check on start and will not be recovered
-	StateFailed ProcessState = ProcessState("failed")

	// process is shutdown and will not be restarted
	StateShutdown ProcessState = ProcessState("shutdown")
)

Update state transition logic and error handling to remove StateFailed transitions and ensure automatic retry behavior.

Also applies to: 140-140, 146-146, 209-209, 328-328, 404-404

Likely an incorrect or invalid review comment.


1-567: Good technical approach but critical alignment issues with PR objectives.

The refactor successfully implements the exec.CommandContext pattern and simplifies the stop logic as intended. However, the retention of StateFailed throughout the codebase directly contradicts the PR objectives of ensuring upstream processes are always retried automatically.

The key goals of this PR are not fully met. Consider completing the StateFailed removal to achieve the reliability improvements described in the PR objectives.


524-566:

✅ Verification successful

LGTM! Well-implemented graceful stop mechanism.

The method properly handles both custom stop commands and default SIGTERM, with appropriate Windows support. The PID substitution and command sanitization appear secure.

Verify that the SanitizeCommand function properly handles the PID substitution to prevent command injection:


🏁 Script executed:

#!/bin/bash
# Description: Check SanitizeCommand implementation for security
# Expected: Should find proper input validation and sanitization

ast-grep --pattern 'func SanitizeCommand($$$) {
  $$$
}'

Length of output: 58


🏁 Script executed:

#!/bin/bash
# Locate and display the SanitizeCommand implementation for review
rg -n "SanitizeCommand" -C 5 .

Length of output: 5131


🏁 Script executed:

#!/bin/bash
# Display the SanitizeCommand implementation for security review
rg -n "func SanitizeCommand" -A 50 proxy/config.go

Length of output: 1044


Safe PID Substitution Confirmed

SanitizeCommand simply parses the command string into a slice of arguments (handling quotes, comments, backslashes) without invoking a shell, and ${PID} is replaced via fmt.Sprintf("%d", pid), ensuring it’s strictly numeric. Since we call exec.Command(args[0], args[1:]...), there is no shell interpolation or injection risk from the PID value.

Comment on lines +359 to 372
func (p *Process) stopCommand() {
stopStartTime := time.Now()
defer func() {
p.proxyLogger.Debugf("<%s> stopCommand took %v", p.ID, time.Since(stopStartTime))
}()

sigtermTimeout, cancelTimeout := context.WithTimeout(context.Background(), sigtermTTL)
defer cancelTimeout()

if p.cmd == nil || p.cmd.Process == nil {
p.proxyLogger.Debugf("<%s> cmd or cmd.Process is nil (normal during config reload)", p.ID)
if p.cancelUpstream == nil {
p.proxyLogger.Errorf("<%s> stopCommand has a nil p.cancelUpstream()", p.ID)
return
}

// if err := p.terminateProcess(); err != nil {
// p.proxyLogger.Debugf("<%s> Process already terminated: %v (normal during shutdown)", p.ID, err)
// }
// the default cmdStop to taskkill /f /t /pid ${PID}
if runtime.GOOS == "windows" && strings.TrimSpace(p.config.CmdStop) == "" {
p.config.CmdStop = "taskkill /f /t /pid ${PID}"
}

if p.config.CmdStop != "" {
// replace ${PID} with the pid of the process
stopArgs, err := SanitizeCommand(strings.ReplaceAll(p.config.CmdStop, "${PID}", fmt.Sprintf("%d", p.cmd.Process.Pid)))
if err != nil {
p.proxyLogger.Errorf("<%s> Failed to sanitize stop command: %v", p.ID, err)
return
}

p.proxyLogger.Debugf("<%s> Executing stop command: %s", p.ID, strings.Join(stopArgs, " "))

stopCmd := exec.Command(stopArgs[0], stopArgs[1:]...)
stopCmd.Stdout = p.processLogger
stopCmd.Stderr = p.processLogger
stopCmd.Env = p.config.Env

if err := stopCmd.Run(); err != nil {
p.proxyLogger.Errorf("<%s> Failed to exec stop command: %v", p.ID, err)
return
}
} else {
if err := p.cmd.Process.Signal(syscall.SIGTERM); err != nil {
p.proxyLogger.Errorf("<%s> Failed to send SIGTERM to process: %v", p.ID, err)
return
}
}

select {
case <-sigtermTimeout.Done():
p.proxyLogger.Debugf("<%s> Process timed out waiting to stop, sending KILL signal (normal during shutdown)", p.ID)
p.upstreamWasStoppedWithKill = true
if err := p.cmd.Process.Kill(); err != nil {
p.proxyLogger.Errorf("<%s> Failed to kill process: %v", p.ID, err)
}
case err := <-p.cmdWaitChan:
// Note: in start(), p.cmdWaitChan also has a select { ... }. That should be OK
// because if we make it here then the cmd has been successfully running and made it
// through the health check. There is a possibility that the cmd crashed after the health check
// succeeded but that's not a case llama-swap is handling for now.
if err != nil {
if errno, ok := err.(syscall.Errno); ok {
p.proxyLogger.Errorf("<%s> errno >> %v", p.ID, errno)
} else if exitError, ok := err.(*exec.ExitError); ok {
if strings.Contains(exitError.String(), "signal: terminated") {
p.proxyLogger.Debugf("<%s> Process stopped OK", p.ID)
} else if strings.Contains(exitError.String(), "signal: interrupt") {
p.proxyLogger.Debugf("<%s> Process interrupted OK", p.ID)
} else {
p.proxyLogger.Warnf("<%s> ExitError >> %v, exit code: %d", p.ID, exitError, exitError.ExitCode())
}
} else {
p.proxyLogger.Errorf("<%s> Process exited >> %v", p.ID, err)
}
}
}
p.cancelUpstream()
<-p.cmdWaitChan
}
Copy link

@coderabbitai coderabbitai bot Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Potential indefinite blocking in stopCommand.

The method waits on cmdWaitChan without a timeout, which could block indefinitely if the command doesn't exit properly. Consider adding a timeout mechanism.

func (p *Process) stopCommand() {
	stopStartTime := time.Now()
	defer func() {
		p.proxyLogger.Debugf("<%s> stopCommand took %v", p.ID, time.Since(stopStartTime))
	}()

	if p.cancelUpstream == nil {
		p.proxyLogger.Errorf("<%s> stopCommand has a nil p.cancelUpstream()", p.ID)
		return
	}

	p.cancelUpstream()
-	<-p.cmdWaitChan
+	select {
+	case <-p.cmdWaitChan:
+		// Command exited normally
+	case <-time.After(p.gracefulStopTimeout + 5*time.Second):
+		p.proxyLogger.Warnf("<%s> stopCommand timed out waiting for command to exit", p.ID)
+	}
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (p *Process) stopCommand() {
stopStartTime := time.Now()
defer func() {
p.proxyLogger.Debugf("<%s> stopCommand took %v", p.ID, time.Since(stopStartTime))
}()
sigtermTimeout, cancelTimeout := context.WithTimeout(context.Background(), sigtermTTL)
defer cancelTimeout()
if p.cmd == nil || p.cmd.Process == nil {
p.proxyLogger.Debugf("<%s> cmd or cmd.Process is nil (normal during config reload)", p.ID)
if p.cancelUpstream == nil {
p.proxyLogger.Errorf("<%s> stopCommand has a nil p.cancelUpstream()", p.ID)
return
}
// if err := p.terminateProcess(); err != nil {
// p.proxyLogger.Debugf("<%s> Process already terminated: %v (normal during shutdown)", p.ID, err)
// }
// the default cmdStop to taskkill /f /t /pid ${PID}
if runtime.GOOS == "windows" && strings.TrimSpace(p.config.CmdStop) == "" {
p.config.CmdStop = "taskkill /f /t /pid ${PID}"
}
if p.config.CmdStop != "" {
// replace ${PID} with the pid of the process
stopArgs, err := SanitizeCommand(strings.ReplaceAll(p.config.CmdStop, "${PID}", fmt.Sprintf("%d", p.cmd.Process.Pid)))
if err != nil {
p.proxyLogger.Errorf("<%s> Failed to sanitize stop command: %v", p.ID, err)
return
}
p.proxyLogger.Debugf("<%s> Executing stop command: %s", p.ID, strings.Join(stopArgs, " "))
stopCmd := exec.Command(stopArgs[0], stopArgs[1:]...)
stopCmd.Stdout = p.processLogger
stopCmd.Stderr = p.processLogger
stopCmd.Env = p.config.Env
if err := stopCmd.Run(); err != nil {
p.proxyLogger.Errorf("<%s> Failed to exec stop command: %v", p.ID, err)
return
}
} else {
if err := p.cmd.Process.Signal(syscall.SIGTERM); err != nil {
p.proxyLogger.Errorf("<%s> Failed to send SIGTERM to process: %v", p.ID, err)
return
}
}
select {
case <-sigtermTimeout.Done():
p.proxyLogger.Debugf("<%s> Process timed out waiting to stop, sending KILL signal (normal during shutdown)", p.ID)
p.upstreamWasStoppedWithKill = true
if err := p.cmd.Process.Kill(); err != nil {
p.proxyLogger.Errorf("<%s> Failed to kill process: %v", p.ID, err)
}
case err := <-p.cmdWaitChan:
// Note: in start(), p.cmdWaitChan also has a select { ... }. That should be OK
// because if we make it here then the cmd has been successfully running and made it
// through the health check. There is a possibility that the cmd crashed after the health check
// succeeded but that's not a case llama-swap is handling for now.
if err != nil {
if errno, ok := err.(syscall.Errno); ok {
p.proxyLogger.Errorf("<%s> errno >> %v", p.ID, errno)
} else if exitError, ok := err.(*exec.ExitError); ok {
if strings.Contains(exitError.String(), "signal: terminated") {
p.proxyLogger.Debugf("<%s> Process stopped OK", p.ID)
} else if strings.Contains(exitError.String(), "signal: interrupt") {
p.proxyLogger.Debugf("<%s> Process interrupted OK", p.ID)
} else {
p.proxyLogger.Warnf("<%s> ExitError >> %v, exit code: %d", p.ID, exitError, exitError.ExitCode())
}
} else {
p.proxyLogger.Errorf("<%s> Process exited >> %v", p.ID, err)
}
}
}
p.cancelUpstream()
<-p.cmdWaitChan
}
func (p *Process) stopCommand() {
stopStartTime := time.Now()
defer func() {
p.proxyLogger.Debugf("<%s> stopCommand took %v", p.ID, time.Since(stopStartTime))
}()
if p.cancelUpstream == nil {
p.proxyLogger.Errorf("<%s> stopCommand has a nil p.cancelUpstream()", p.ID)
return
}
p.cancelUpstream()
select {
case <-p.cmdWaitChan:
// Command exited normally
case <-time.After(p.gracefulStopTimeout + 5*time.Second):
p.proxyLogger.Warnf("<%s> stopCommand timed out waiting for command to exit", p.ID)
}
}
🤖 Prompt for AI Agents
In proxy/process.go around lines 359 to 372, the stopCommand method waits
indefinitely on cmdWaitChan, risking a deadlock if the command never exits.
Modify the code to use a select statement that waits on cmdWaitChan with a
timeout case, such as using time.After with a reasonable duration, to prevent
indefinite blocking and allow the function to proceed or log an error if the
timeout occurs.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is fine as the p.cmdWaitChan is closed, it's just a signal channel.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mostlygeek You're absolutely right! Thank you for the clarification. I see now that cmdWaitChan is indeed a signal channel that gets closed by waitForCmd() when cmd.Wait() returns. Reading from a closed channel returns immediately, so <-p.cmdWaitChan won't block indefinitely - it will unblock as soon as the command exits and the channel is closed.

I appreciate you pointing out this signal channel pattern. My suggestion was based on a misunderstanding of how the channel was being used.

  (\   /)
 ( ._.)
o_(")(")

learned something new


✏️ Learnings added
Learnt from: mostlygeek
PR: mostlygeek/llama-swap#155
File: proxy/process.go:359-372
Timestamp: 2025-06-05T22:57:53.766Z
Learning: In Go, cmdWaitChan in proxy/process.go is a signal channel (chan struct{}) that follows the pattern where waitForCmd() closes the channel when cmd.Wait() returns, and stopCommand() waits on it. Reading from a closed channel returns immediately, so <-cmdWaitChan won't block indefinitely.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +487 to +522
// waitForCmd waits for the command to exit and handles exit conditions depending on current state
func (p *Process) waitForCmd() {
exitErr := p.cmd.Wait()
p.proxyLogger.Debugf("<%s> cmd.Wait() returned error: %v", p.ID, exitErr)

if exitErr != nil {
if errno, ok := exitErr.(syscall.Errno); ok {
p.proxyLogger.Errorf("<%s> errno >> %v", p.ID, errno)
} else if exitError, ok := exitErr.(*exec.ExitError); ok {
if strings.Contains(exitError.String(), "signal: terminated") {
p.proxyLogger.Debugf("<%s> Process stopped OK", p.ID)
} else if strings.Contains(exitError.String(), "signal: interrupt") {
p.proxyLogger.Debugf("<%s> Process interrupted OK", p.ID)
} else {
p.proxyLogger.Warnf("<%s> ExitError >> %v, exit code: %d", p.ID, exitError, exitError.ExitCode())
}
} else {
if exitErr.Error() != "context canceled" /* this is normal */ {
p.proxyLogger.Errorf("<%s> Process exited >> %v", p.ID, exitErr)
}
}
}

currentState := p.CurrentState()
switch currentState {
case StateStopping:
if curState, err := p.swapState(StateStopping, StateStopped); err != nil {
p.proxyLogger.Errorf("<%s> Process exited but could not swap to StateStopped. curState=%s, err: %v", p.ID, curState, err)
p.state = StateStopped
}
default:
p.proxyLogger.Infof("<%s> process exited but not StateStopping, current state: %s", p.ID, currentState)
p.state = StateStopped // force it to be in this state
}
close(p.cmdWaitChan)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Concurrency safety violation in state management.

Lines 515 and 519 directly assign to p.state without proper locking, breaking the concurrency safety pattern used elsewhere in the code.

	switch currentState {
	case StateStopping:
		if curState, err := p.swapState(StateStopping, StateStopped); err != nil {
			p.proxyLogger.Errorf("<%s> Process exited but could not swap to StateStopped. curState=%s, err: %v", p.ID, curState, err)
-			p.state = StateStopped
+			// Force state transition with proper locking
+			p.stateMutex.Lock()
+			p.state = StateStopped
+			p.stateMutex.Unlock()
		}
	default:
		p.proxyLogger.Infof("<%s> process exited but not StateStopping, current state: %s", p.ID, currentState)
-		p.state = StateStopped // force it to be in this state
+		// Force state transition with proper locking
+		p.stateMutex.Lock()
+		p.state = StateStopped
+		p.stateMutex.Unlock()
	}
🤖 Prompt for AI Agents
In proxy/process.go around lines 515 and 519, the code directly assigns to
p.state without using locks, violating concurrency safety. Replace these direct
assignments with a proper state swap method that handles locking, such as using
p.swapState or another thread-safe mechanism consistent with the rest of the
code to update p.state safely.

Comment on lines +194 to +204
cmdContext, ctxCancelUpstream := context.WithCancel(context.Background())
p.cmd = exec.CommandContext(cmdContext, args[0], args[1:]...)
p.cmd.Stdout = p.processLogger
p.cmd.Stderr = p.processLogger
p.cmd.Env = p.config.Env

p.cmd.Cancel = p.cmdStopUpstreamProcess
p.cmd.WaitDelay = p.gracefulStopTimeout
p.cancelUpstream = ctxCancelUpstream
p.cmdWaitChan = make(chan struct{})

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Potential race condition with channel recreation.

Creating a new cmdWaitChan at line 203 could cause issues if other goroutines are waiting on the previous channel. Consider moving this to the beginning of the method or ensuring proper synchronization.

	p.waitStarting.Add(1)
	defer p.waitStarting.Done()
+	p.cmdWaitChan = make(chan struct{})
	cmdContext, ctxCancelUpstream := context.WithCancel(context.Background())
	p.cmd = exec.CommandContext(cmdContext, args[0], args[1:]...)
	p.cmd.Stdout = p.processLogger
	p.cmd.Stderr = p.processLogger
	p.cmd.Env = p.config.Env

	p.cmd.Cancel = p.cmdStopUpstreamProcess
	p.cmd.WaitDelay = p.gracefulStopTimeout
	p.cancelUpstream = ctxCancelUpstream
-	p.cmdWaitChan = make(chan struct{})

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In proxy/process.go around lines 194 to 204, the creation of the cmdWaitChan
channel at line 203 may cause a race condition if other goroutines are still
waiting on the previous channel. To fix this, move the initialization of
cmdWaitChan to the start of the method before any goroutines might access it, or
add synchronization mechanisms such as mutexes to ensure no concurrent access
occurs during channel recreation.

- removed StateFailed
- StateStarting --> StartStopped is OK
- updated tests
@mostlygeek mostlygeek merged commit 2dc0ca0 into main Jun 5, 2025
3 checks passed
@mostlygeek mostlygeek deleted the process-cmd-test branch June 5, 2025 23:25
@mostlygeek
Copy link
Owner Author

mostlygeek commented Jun 6, 2025

Nice! llama-server crashed and llama-swap automatically restarted it:

llama-swap detecting the crash/exit and restarting it:

(everything is normal ...) 

[INFO] Request 10.0.1.135 "POST /upstream/qwen-coder-14B-FIM/infill HTTP/1.1" 200 15667 "axios/1.8.4" 190.311173ms
[DEBUG] Exclusive mode for group SUPDev, stopping other process groups
[DEBUG] <qwen-coder-14B-FIM> request /upstream/qwen-coder-14B-FIM/infill - start: 0s, total: 123.181511ms
[INFO] Request 10.0.1.135 "POST /upstream/qwen-coder-14B-FIM/infill HTTP/1.1" 200 15656 "axios/1.8.4" 123.315148ms
[DEBUG] Exclusive mode for group SUPDev, stopping other process groups
[DEBUG] <qwen-coder-14B-FIM> request /upstream/qwen-coder-14B-FIM/infill - start: 0s, total: 111.109629ms
[INFO] Request 10.0.1.135 "POST /upstream/qwen-coder-14B-FIM/infill HTTP/1.1" 200 14893 "axios/1.8.4" 111.279777ms

(llama-server crashes ... )

[DEBUG] Exclusive mode for group SUPDev, stopping other process groups
[INFO] Request 10.0.1.135 "POST /upstream/qwen-coder-14B-FIM/infill HTTP/1.1" 502 41 "axios/1.8.4" 293.769421ms
[DEBUG] Exclusive mode for group SUPDev, stopping other process groups
[INFO] Request 10.0.1.135 "POST /upstream/qwen-coder-14B-FIM/infill HTTP/1.1" 502 90 "axios/1.8.4" 492.078µs
[DEBUG] <qwen-coder-14B-FIM> cmd.Wait() returned error: signal: aborted (core dumped)
[WARN] <qwen-coder-14B-FIM> ExitError >> signal: aborted (core dumped), exit code: -1
[INFO] <qwen-coder-14B-FIM> process exited but not StateStopping, current state: ready

(llama-server is restarted, previously would require a llama-swap restart! )

[DEBUG] Exclusive mode for group SUPDev, stopping other process groups
[DEBUG] <qwen-coder-14B-FIM> swapState() State transitioned from stopped to starting
[DEBUG] <qwen-coder-14B-FIM> Health check error on http://localhost:5815/health, status code: 503 (normal during startup)
[INFO] <qwen-coder-14B-FIM> Health check passed on http://localhost:5815/health
[DEBUG] <qwen-coder-14B-FIM> swapState() State transitioned from starting to ready
[DEBUG] <qwen-coder-14B-FIM> request /upstream/qwen-coder-14B-FIM/infill - start: 5.255732227s, total: 6.851577272s
[INFO] Request 10.0.1.135 "POST /upstream/qwen-coder-14B-FIM/infill HTTP/1.1" 200 15831 "axios/1.8.4" 6.851691632s

llama-server crash logs

srv  update_slots: all slots are idle
srv  log_server_r: request: POST /infill 127.0.0.1 200
slot launch_slot_: id  0 | task 1513 | processing task
slot update_slots: id  0 | task 1513 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 3077
slot update_slots: id  0 | task 1513 | kv cache rm [2606, end)
slot update_slots: id  0 | task 1513 | prompt processing progress, n_past = 3077, n_tokens = 471, progress = 0.153071
slot update_slots: id  0 | task 1513 | prompt done, n_past = 3077, n_tokens = 471
/home/mostlygeek/llama.cpp/src/llama-kv-cache-unified.cpp:1188: GGML_ASSERT(nf == nh && "KV defrag bug: nf != nh") failed
/path/to/llama-server/llama-server-latest(+0x75683b)[0x5a59595d683b]
/path/to/llama-server/llama-server-latest(+0x756dfc)[0x5a59595d6dfc]
/path/to/llama-server/llama-server-latest(+0x756f51)[0x5a59595d6f51]
/path/to/llama-server/llama-server-latest(+0x38b79d)[0x5a595920b79d]
/path/to/llama-server/llama-server-latest(+0x38bb64)[0x5a595920bb64]
/path/to/llama-server/llama-server-latest(+0x2c2bf0)[0x5a5959142bf0]
/path/to/llama-server/llama-server-latest(+0x2c7b3d)[0x5a5959147b3d]
/path/to/llama-server/llama-server-latest(+0x2c933f)[0x5a595914933f]
/path/to/llama-server/llama-server-latest(+0xfcde2)[0x5a5958f7cde2]
/path/to/llama-server/llama-server-latest(+0xc460c)[0x5a5958f4460c]
/path/to/llama-server/llama-server-latest(+0x5eee0)[0x5a5958edeee0]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x737f6082a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x737f6082a28b]
/path/to/llama-server/llama-server-latest(+0x8ee45)[0x5a5958f0ee45]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants