fix(coverage): fail fast when concurrent runs share coverage.reportsDirectory#10466
fix(coverage): fail fast when concurrent runs share coverage.reportsDirectory#10466jgamaraalv wants to merge 14 commits into
Conversation
…ent same-dir runs don't ENOENT
…NOENT-safe cleanup
…same-dir runs don't ENOENT clean() and cleanAfterRun() removed the shared reportsDirectory recursively, deleting a concurrent run's in-flight .tmp dir. clean() now preserves .tmp* entries (only prior reports are removed) and cleanAfterRun() uses a non-recursive rmdir, so 3 parallel 'vitest run --coverage' in one cwd all exit 0.
…ts from orphan temp dirs Add unit tests proving clean()/cleanAfterRun() preserve a concurrent run's temp dir; make temporary-files and empty-coverage-directory tests start from a clean reportsDirectory now that clean() no longer sweeps other runs' .tmp* dirs.
clean() and cleanAfterRun() read reportsDirectory with readdirSync guarded only by a separate preceding existsSync, leaving a TOCTOU window where a concurrent same-cwd run can remove the directory between the check and the read and surface a raw ENOENT — the exact failure class this feature promises to eliminate. Replace the existsSync-then- readdirSync pattern with an ENOENT-safe safeReaddir helper in both call sites. Also drop the task-added explanatory comments on the rewritten lines per the repo's no-explanatory-comments rule.
Remove the task-id (T-01, AC-1, AC-3) and explanatory comments added to the coverage temp-dir tests; they leak internal planning ids into the suite and violate the repo's no-explanatory-comments rule. Assertions unchanged.
|
Hello @jgamaraalv. Your PR has been labeled To keep your PR open, please follow these steps:
Please, do not generate or format the response with AI. If you do not speak English, reply in your native language or use translation software like Google Translate or Deepl. If the response is generated, the PR will be closed automatically. These measures help us reduce maintenance burden and keep the team's work efficient. See our AI contributions policy for more context. |
✅ Deploy Preview for vitest-dev ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
My problem before saw issue #10111: v8 coverage: concurrent runs in the same project dir crash with ENOENT lstat coverage/.tmp
Repro:In any project with coverage.provider: 'v8' and default reportsDirectory: Expected:concurrent same-dir runs either isolate their tmp dir per process or fail with a clear "coverage dir already in use" message, instead of an unhandled |
|
AriPerkkio
left a comment
There was a problem hiding this comment.
Two or more
vitest run --coverageprocesses running in the same working directory
Each of them will also remove ./coverage during start, removing each others possible results. This fix is not enough.
Proper fix is from user side to use vitest run --coverage --coverage.reportsDirectory=some-unique-path.
@AriPerkkio You're right — isolating only the temp dir wasn't enough. Even with separate .tmp dirs, both runs still write their final reports into the same coverage/ directory and clean() wipes the other run's results, so the data loss stayed; I had only hidden the ENOENT crash. I pushed a different approach. Instead of isolating, Vitest now takes a lock on coverage.reportsDirectory (stored in the OS temp dir, keyed by the resolved path) for the duration of the run. If a second process targets the same directory while another is still running, it fails right away with an actionable error telling the user to give each run its own --coverage.reportsDirectory, which is exactly the fix you described, just surfaced by Vitest instead of an intermittent crash. The lock is re-entrant for the same process (watch reruns) and stale locks from dead processes are reclaimed automatically. The temp dir name is back to the deterministic .tmp[-shard]. So this doesn't try to make same-directory concurrent runs "work" — it makes them fail fast with a clear message and points to the unique-directory fix. If you'd prefer not to add this and just document the unique reportsDirectory requirement instead let me know. |
…irectory Two or more `vitest run --coverage` processes pointed at the same reportsDirectory delete each other's reports, which previously surfaced as an intermittent raw `ENOENT: lstat coverage/.tmp` crash. Per-run temp-dir isolation stopped the crash but still let concurrent runs clobber each other's final reports, so it was not enough. Replace it with a cross-process lock, keyed by the resolved reportsDirectory and stored in the OS temp dir (outside the reports directory it guards): - A second live process targeting the same reportsDirectory now fails immediately with an actionable error pointing to `--coverage.reportsDirectory`, instead of crashing or silently losing reports. - The lock is re-entrant for the owning process (watch mode reruns). - Stale locks left by processes that no longer exist are reclaimed automatically. The temp directory name is restored to the deterministic `.tmp[-shard]`.
2d33ce7 to
231ef67
Compare
I like this approach, but need some time to think through it before commenting about the implementation. 👍 Maybe the lockfiles could be stored in |
AriPerkkio
left a comment
There was a problem hiding this comment.
Overall looks good. This will prevent two simultaneous Vitest runs from deleting each other's results. Some minor comments below.
Addressing review feedback on the reportsDirectory lock: - Simplify acquisition from a 10-iteration loop to a single exclusive create plus one retry, which is only reached after reclaiming a stale lock. The bounded retry was the only reason to loop at all. - Reclaim a stale lock with an atomic rename instead of an unconditional remove-then-recreate. Two processes racing to reclaim the same stale lock could previously both succeed and run coverage concurrently in the same directory; now only the process whose rename wins removes the file, the loser retries the exclusive create and fails fast. - Release the lock only when this process can confirm it owns it, so a transient unreadable read can no longer delete another process's lock. - Document the signal-0 liveness check and drop the practically unreachable secondary throw in favour of a single shared error helper.
Replace the module-level cleanups array and afterEach hook with onTestFinished registrations inside the helpers that allocate the temp reports directory, lock file, and child process. Cleanup now lives next to the resource it tears down. Assertions unchanged.
AriPerkkio
left a comment
There was a problem hiding this comment.
Looks good to me. @hi-ogawa & @sheremet-va any concerns?
Move the cross-process lock logic out of BaseCoverageProvider into a standalone ReportsDirectoryLock class with acquire()/release(), and drop the path-returning getter in favor of the lock owning its own lockFile. Addresses review feedback that the lock implementation cluttered the provider and the getter felt unjustified.
… locks Address review findings on the cross-process reportsDirectory lock: - Close a write-window TOCTOU: publish the lock atomically by writing the full payload to a temp file then fs.link()-ing it into place, so the lock file is never observable empty/partial and can't be stolen mid-write. - Release the lock on process exit (best-effort, owner-checked) so a throwing report path, reportOnFailure, watch keepResults, or graceful close no longer leaks a stale lock into the OS temp directory. - Make stale-lock reclaim race-safe: rename the contested lock to a private path, re-read that inode, and only discard it when it is still the stale owner observed; otherwise restore it (never destroying a displaced live owner's copy) and fail closed. Document the residual, inherent 3+-process advisory-lock window. - Validate the full owner identity (pid + reportsDirectory + timestamp + per-acquire nonce) before honoring or releasing a lock, defeating PID reuse and hash-collision false matches. - Treat only ESRCH as a dead process in the liveness check; let unexpected readOwner I/O errors (EACCES/EMFILE) propagate instead of silently stealing an unreadable lock. - Tolerate EPERM/EBUSY/EACCES on the reclaim rename (Windows) and surface the lock timestamp in the actionable in-use error.
|
@hi-ogawa @AriPerkkio @sheremet-va Last two review comments (resolved in e5199e2) Two nits are addressed by e5199e2 (
About the follow-up commit 4d81d3a Heads up that I pushed one more commit after the approval —
This sits on top of an already-reviewed/approved feature, so if you'd rather keep the PR minimal and scoped to just the requested fixes, I'm happy to drop 4d81d3a and leave only the review-comment resolution — just let me know your preference. Equally happy to keep it if the extra hardening is welcome. |
|
@AriPerkkio any news on this issue? |
AriPerkkio
left a comment
There was a problem hiding this comment.
This PR started as small improvement but now the LLM behind this code is getting quite verbose. I'm starting to consider whether we should just discard this PR and let human rewrite it without LLMs wall-of-text comments and never-going-to-happen-edge-case -handlings. In the end we will be the ones maintaining this for years. 😬
Maybe this lockfile stuff and all possible edge cases should be extracted into its own npm package even.
Drop the speculative machinery that crept into the lock (atomic write+link publish, nonce/timestamp owner identity, rename-aside stale reclaim with restore, process-exit handler) in favour of a plain create-with-wx lock that reclaims a dead owner's lock on the next run. Same fail-fast behaviour for concurrent same-directory runs, far less to maintain. A leaked lock from a hard-killed process is cleared by the next run's dead-pid check, so the exit handler was redundant.
|
Yeah, fair, and you're right. That last hardening commit got away from me.. I went down a rabbit hole guarding race windows that don't really happen (the 3-process SIGKILL reclaim especially) and ended up with comments longer than the code they were explaining. Pushed 888edb5 that rips all of it back out. What's left is about as small as this lock gets:
No more nonce/timestamp identity, no write+link publish, no rename-aside/restore dance, no exit handler. coverage.ts is ~280 lines lighter and the tests dropped the cases that only existed to prove the machinery I just deleted. Honestly I'd rather keep it this small than spin it into its own package, there isn't much left to extract now. If you still want a human to rewrite it from scratch I won't argue, but I think this version is one you'd actually be ok maintaining for years. Sorry for the noise on the way here. |
A propósito, quero frisar que nenhuma alteração feita com o auxilio de LLM foi feita sem a minha revisão e testes de que as implementações realmente funcionavam. Os comentários extras foram adicionados depois de surgirem algumas dúvidas sobre algumas implementações, conforme respondi em um dos seus request changes e mais uma vez peço desculpas por isso. Novamente, acho as correções desse PR 100% válidas, não descartaria, mas fica ao seu critério e a critério do time, que respeito muito o trabalho. Encerro as minhas contribuições nesse PR por aqui. |
Description
Two or more
vitest run --coverageprocesses pointed at the same working directory (the defaultcoverage.reportsDirectory) delete each other's reports. Bothclean()(start of a run) andcleanAfterRun()(end of a run) operate on the sharedcoverage/directory, so this surfaced intermittently as a rawENOENT: lstat coverage/.tmpcrash — and even when it didn't crash, the runs silently clobbered each other's final reports.What changed
The first approach in this PR isolated each run's temp directory (
.tmp-<nanoid>). As raised in review, that stopped the crash but was not enough: the final reports still sharecoverage/, so concurrent same-directory runs keep destroying each other's results.So instead of isolating, this PR makes concurrent same-directory runs fail fast with an actionable error:
BaseCoverageProvidernow acquires a cross-process lock oncoverage.reportsDirectoryfor the duration of the run. The lock lives in the OS temp directory, keyed by the resolved reports directory, so it is not wiped by the directory cleanup it guards.--coverage.reportsDirectory, instead of crashing or losing reports..tmp[-shard].The fix lives entirely in
BaseCoverageProvider, so both thev8andistanbulproviders inherit it.Running coverage for several processes in parallel is still done the supported way — giving each run its own
coverage.reportsDirectory— which is exactly what the new error message tells users to do.Related to #10111.
Please don't delete this checklist!
Before submitting the PR, please make sure you do the following:
It's really useful if your PR references an issue where it is discussed ahead of time.
reportsDirectory/.tmpexists before writingcoverage-*.json#10111 (related)Ideally, include a test that fails without this PR but passes with it.
test/coverage-test/test/temporary-files.unit.test.tsPlease, don't make changes to
pnpm-lock.yamlunless you introduce a new test example.Please check Allow edits by maintainers
Tests
pnpm test:ciDocumentation
coverage.reportsDirectory.Changesets