Windows: Preliminary lit tooling pass#2914
Conversation
Signed-off-by: thomthehound <[email protected]>
Signed-off-by: thomthehound <[email protected]>
|
I did not want to have to do this, but this patch now also issues a file lock and serializes NPU test runners to avoid contention issues and false-failures. |
@hunhoffe for concurrency consultation |
|
@thomthehound Overall this looks good to me! However, can you elaborate on the concurrency issues you were having and what you believe is the root cause? |
Well, it appeared that Erwei and I pushed commits are roughly the same time. The CI ran fine, but we both ended up with timeout errors during the NPU tests (again, at suspiciously similar points). I looked back at some other PRs/merges with failures, and it seems like this may be an intermittent issue (usually showing up as some flavor of failure in "Build and Test with AIE tools on Ryzen AI / Run Tests on Ryzen AI (amdhx370)" or the phoenix variety) when the hardware is heavily loaded. A contention issue was my best guess at the root of the problem, because there doesn't seem to be much of a pattern to it aside from high demand. I'm not sure if the simple lock I'm issuing here will be enough to fully fix the problem, but hopefully it might reduce it. |
This is a preliminary pass on the lit tooling to make it more convenient to run in Windows environments/runners. This patch is the first of several I have planned in order for testability on Windows to reach near-parity with that on Linux.
tl;dr breakdown of changes:
lit.site.cfg.pyfiles tolerant of Windows paths and flags..pytool substitutions more reliably in lit on Windows.The changes here are scoped and gated so they should remain low-risk/transparent for existing Linux workflows. The most notable single change is swapping
utils/run_on_npu.shfor a cross-platform Python counterpart,utils/run_on_npu.py. Because multiple identical copies oflit.site.cfg.py.inexist, the line-alteration count of this patch is artificially inflated. It may be worth consolidating/centralizing that file in the future.This patch is not intended to fully solve Windows lit coverage by itself, and it deliberately does not enable the Windows lit test targets in CI. We are not ready for that yet. Once I can prove stability of the local/manual path, we can begin activating them at a later date.
I am currently building and exercising the lit configuration manually using the following commands (assuming the Windows wheels have been built and installed into
ironenvas-per #2861):At the moment, the vast majority of tests still fail on Windows. This is due to drift between the current
aieccAPI and the relatively ancient llvm-aie wheels presently distributed for Windows. It is not due to the lit-plumbing changes here. In fact, this patch elucidated that discovery.I am working on the upstream Windows wheel/distribution side in parallel (first patch in Xilinx/llvm-aie#809). Once the wheel situation is corrected, and the local/manual path is more stable, the next steps will be closing the remaining Windows lit gaps and beginning the enablement of the existing Windows CI test targets in
.github/workflows/buildAndTestMulti.yml.