fix: allow SaveModel after Complete() in VowpalWabbitThreadedLearning#4913
Merged
JohnLangford merged 4 commits intomasterfrom Mar 19, 2026
Merged
fix: allow SaveModel after Complete() in VowpalWabbitThreadedLearning#4913JohnLangford merged 4 commits intomasterfrom
JohnLangford merged 4 commits intomasterfrom
Conversation
…WabbitThreadedLearning Previously, calling SaveModel() or accessing PerformanceStatistics after Complete() threw InvalidOperationException because Complete() immediately closed the sync action queue. This forced users into a counter-intuitive pattern of enqueuing saves before signaling completion. Changes: - Move CompleteAdding from Complete() into the root completion continuation using a new atomic CompleteAndRemoveAll(), so sync actions can be enqueued between the Complete() call and the continuation executing - Make SaveModel/PerformanceStatistics detect post-completion state and operate directly on the root VW instance via TryAdd fallback - Add Flush() method to force AllReduce sync on demand without waiting for ExampleCountPerRun threshold Fixes #4911
Task.CompletedTask was introduced in .NET 5 and is not available in netstandard2.0 which is the target framework for vw.parallel.
Improve XML doc comments on VowpalWabbitThreadedLearning to explain: - Learn() enqueues and returns immediately (async dispatch, not blocking) - Typical usage flow with code example (learn, complete, save) - What Complete() guarantees (all examples learned, final allreduce done) - That SaveModel/PerformanceStatistics work synchronously after Complete Addresses feedback from #4911 about the TPL completion model being unclear.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #4911 —
VowpalWabbitThreadedLearningwas nearly impossible to use becauseSaveModel()andPerformanceStatisticshad to be called beforeComplete(), which is the opposite of what users expect.Root cause:
Complete()immediately calledsyncActions.CompleteAdding(), closing the queue before the completion continuation (which actually drains it) had a chance to run. AnySaveModel()call afterComplete()threwInvalidOperationException.Changes:
TryAddon the sync action list as a fallback for the race window betweenComplete()and the continuation.CompleteAddinginto the root completion continuation — uses a new atomicCompleteAndRemoveAll()onConcurrentList<T>, so sync actions can be enqueued between callingComplete()and the continuation executing.Flush()method — forces an AllReduce synchronization and drains pending sync actions on demand, without requiring the example count to hit a multiple ofExampleCountPerRun.All changes are backward-compatible — existing code that calls
SaveModelbeforeComplete()continues to work.The natural pattern now works:
Mid-training saves also work without count alignment:
Test plan
TestSaveModelAfterComplete— save afterComplete()produces a valid model fileTestPerformanceStatisticsAfterComplete— stats accessible afterComplete()TestFlush—Flush()triggers sync and save mid-training, learning continues afterTestSaveModelBeforeComplete— original pattern (save before complete) still worksTestAllReducecontinues to pass (CI)