Skip to content

Windows: Preliminary lit tooling pass#2914

Open
thomthehound wants to merge 8 commits intoXilinx:mainfrom
thomthehound:win-lit-patch-1
Open

Windows: Preliminary lit tooling pass#2914
thomthehound wants to merge 8 commits intoXilinx:mainfrom
thomthehound:win-lit-patch-1

Conversation

@thomthehound
Copy link
Contributor

This is a preliminary pass on the lit tooling to make it more convenient to run in Windows environments/runners. This patch is the first of several I have planned in order for testability on Windows to reach near-parity with that on Linux.

tl;dr breakdown of changes:

  • Make the generated lit.site.cfg.py files tolerant of Windows paths and flags.
  • Handle .py tool substitutions more reliably in lit on Windows.
  • Replace the shell-only utils/run_on_npu.sh path with a cross-platform Python wrapper.

The changes here are scoped and gated so they should remain low-risk/transparent for existing Linux workflows. The most notable single change is swapping utils/run_on_npu.sh for a cross-platform Python counterpart, utils/run_on_npu.py. Because multiple identical copies of lit.site.cfg.py.in exist, the line-alteration count of this patch is artificially inflated. It may be worth consolidating/centralizing that file in the future.

This patch is not intended to fully solve Windows lit coverage by itself, and it deliberately does not enable the Windows lit test targets in CI. We are not ready for that yet. Once I can prove stability of the local/manual path, we can begin activating them at a later date.

I am currently building and exercising the lit configuration manually using the following commands (assuming the Windows wheels have been built and installed into ironenv as-per #2861):

python utils/iron_setup.py env --shell pwsh | iex

$lit = (Get-Command lit).Source

# Using staged MLIR from the wheel-building step
cmake -S . -B build_lit -G Ninja `
  -DMLIR_DIR="$PWD\utils\mlir_aie_wheels\python_bindings\mlir\lib\cmake\mlir" `
  -DLLVM_DIR="$PWD\utils\mlir_aie_wheels\python_bindings\mlir\lib\cmake\llvm" `
  -DPEANO_INSTALL_DIR="$env:PEANO_INSTALL_DIR" `
  -DLLVM_EXTERNAL_LIT="$lit" `
  -DAIE_COMPILER=NONE `
  -DAIE_LINKER=NONE `
  -DHOST_COMPILER=NONE

New-Item -ItemType Directory -Force .\build_lit\bin
Copy-Item .\ironenv\Lib\site-packages\mlir_aie\bin\* .\build_lit\bin\

At the moment, the vast majority of tests still fail on Windows. This is due to drift between the current aiecc API and the relatively ancient llvm-aie wheels presently distributed for Windows. It is not due to the lit-plumbing changes here. In fact, this patch elucidated that discovery.

I am working on the upstream Windows wheel/distribution side in parallel (first patch in Xilinx/llvm-aie#809). Once the wheel situation is corrected, and the local/manual path is more stable, the next steps will be closing the remaining Windows lit gaps and beginning the enablement of the existing Windows CI test targets in .github/workflows/buildAndTestMulti.yml.

Signed-off-by: thomthehound <[email protected]>
@thomthehound
Copy link
Contributor Author

I did not want to have to do this, but this patch now also issues a file lock and serializes NPU test runners to avoid contention issues and false-failures.

@jgmelber
Copy link
Collaborator

jgmelber commented Mar 3, 2026

I did not want to have to do this, but this patch now also issues a file lock and serializes NPU test runners to avoid contention issues and false-failures.

@hunhoffe for concurrency consultation

@hunhoffe
Copy link
Collaborator

hunhoffe commented Mar 4, 2026

@thomthehound Overall this looks good to me! However, can you elaborate on the concurrency issues you were having and what you believe is the root cause?

@thomthehound
Copy link
Contributor Author

@thomthehound Overall this looks good to me! However, can you elaborate on the concurrency issues you were having and what you believe is the root cause?

Well, it appeared that Erwei and I pushed commits are roughly the same time. The CI ran fine, but we both ended up with timeout errors during the NPU tests (again, at suspiciously similar points). I looked back at some other PRs/merges with failures, and it seems like this may be an intermittent issue (usually showing up as some flavor of failure in "Build and Test with AIE tools on Ryzen AI / Run Tests on Ryzen AI (amdhx370)" or the phoenix variety) when the hardware is heavily loaded. A contention issue was my best guess at the root of the problem, because there doesn't seem to be much of a pattern to it aside from high demand.

I'm not sure if the simple lock I'm issuing here will be enough to fully fix the problem, but hopefully it might reduce it.

Copy link
Collaborator

@hunhoffe hunhoffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, @jgmelber you have spent more time on the windows path than I have, does this look good to you too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants