Current Problem / 当前问题
The current embedding flow has partial concurrency protection but no single authority that coordinates all background sync jobs.
Today, provider calls are serialized through the existing provider queue, but job orchestration is still fragmented across multiple entrypoints. This creates the following risks:
- Overlapping jobs (
init, reconnect, partial sync, full resync, cleanup).
- Redundant work caused by duplicate triggers for the same server/tool.
- Cross-job races (for example, cleanup deletes vectors while a stale sync writes them again).
- Fragile full resync behavior based on local flags/timers instead of centralized scheduling.
- Increased probability of hitting provider limits (RPM) or producing inconsistent state under load.
In short, provider-level serialization exists, but system-level job coordination is incomplete.
Proposed Solution / 建议方案
Introduce a centralized embedding sync scheduler that becomes the single entrypoint for all background embedding mutation work, while preserving the existing provider queue as the final mandatory execution barrier.
Architecture and invariants:
- Exactly one embedding mutation job runs at a time.
- All real provider calls must continue to pass through the current provider queue.
- A full resync must never run in parallel with other mutation jobs.
- Cleanup must be coordinated with sync jobs so stale writes cannot follow deletions.
Recommended implementation:
- Add a dedicated scheduler service (preferred:
src/services/embeddingSyncScheduler.ts).
- Expose high-level APIs:
enqueueServerSync
enqueueSingleToolSync
requestFullResync
enqueueServerCleanup
waitForIdle (for tests)
Queue behavior requirements:
- Deduplicate server sync requests.
- Coalesce single-tool sync into pending server sync when applicable.
- Coalesce multiple full resync requests into one real execution.
- Let cleanup supersede/invalidate pending sync work for the same server.
- Add a generation counter to invalidate stale queued work after provider/model/dimension changes.
Refactor scope:
- Route all embedding job triggers through the scheduler (no direct background mutation execution from controllers/services).
- Replace current full resync timer/flag behavior with scheduler-managed coalesced requests.
- Keep provider queue logic unchanged as second-layer safety.
Observability and reliability:
- Add structured logs for
enqueue, merge, superseded, cleanup-wins, full-resync-requested, full-resync-coalesced, start, finish, fail.
- Preserve
EMBED_SYNC_ERROR behavior.
- Guarantee failure recovery so scheduler state cannot remain stuck.
Test and validation plan:
- Unit tests for serialization, deduplication, coalescing, cleanup-vs-sync ordering, and recovery after errors.
- Tests proving provider queue remains the only path for real provider calls.
- Verify that repeated full resync requests produce one real run.
- Verify cleanup prevents stale post-delete writes.
- Run:
pnpm lint
pnpm backend:build
pnpm test:ci
- Manual
pnpm dev checks for expected scheduler lifecycle logs.
Alternatives / 替代方案
Considered alternatives:
- Keep current architecture and only patch individual race conditions.
- Rejected: does not provide a global concurrency model and is likely to regress.
- Replace in-process scheduling with an external distributed queue.
- Rejected for now: increases complexity and operational cost beyond current scope.
- Move all logic into the existing provider queue only.
- Rejected: provider queue serializes API calls but does not resolve job-level deduplication/supersession semantics.
Chosen direction: centralized in-process scheduler + existing provider queue as a mandatory final barrier.
Additional Context / 补充说明
This proposal derives from Copilot's comment in pull request #702:
#702 (comment)
Primary files impacted:
src/services/vectorSearchService.ts
src/services/mcpService.ts
src/controllers/serverController.ts
src/services/embeddingSyncScheduler.ts (new, recommended)
Out of scope for this proposal:
- Introducing external queue infrastructure.
- Redesigning 403/429 retry policy unless a direct implementation bug is discovered.
Expected outcome:
- No parallel provider calls.
- No overlapping or stale embedding sync jobs.
- Reduced redundant processing and safer behavior during reconnects, config changes, and dimension resets.
Current Problem / 当前问题
The current embedding flow has partial concurrency protection but no single authority that coordinates all background sync jobs.
Today, provider calls are serialized through the existing provider queue, but job orchestration is still fragmented across multiple entrypoints. This creates the following risks:
init,reconnect, partial sync, full resync, cleanup).In short, provider-level serialization exists, but system-level job coordination is incomplete.
Proposed Solution / 建议方案
Introduce a centralized embedding sync scheduler that becomes the single entrypoint for all background embedding mutation work, while preserving the existing provider queue as the final mandatory execution barrier.
Architecture and invariants:
Recommended implementation:
src/services/embeddingSyncScheduler.ts).enqueueServerSyncenqueueSingleToolSyncrequestFullResyncenqueueServerCleanupwaitForIdle(for tests)Queue behavior requirements:
Refactor scope:
Observability and reliability:
enqueue,merge,superseded,cleanup-wins,full-resync-requested,full-resync-coalesced,start,finish,fail.EMBED_SYNC_ERRORbehavior.Test and validation plan:
pnpm lintpnpm backend:buildpnpm test:cipnpm devchecks for expected scheduler lifecycle logs.Alternatives / 替代方案
Considered alternatives:
Chosen direction: centralized in-process scheduler + existing provider queue as a mandatory final barrier.
Additional Context / 补充说明
This proposal derives from Copilot's comment in pull request #702:
#702 (comment)
Primary files impacted:
src/services/vectorSearchService.tssrc/services/mcpService.tssrc/controllers/serverController.tssrc/services/embeddingSyncScheduler.ts(new, recommended)Out of scope for this proposal:
Expected outcome: