Add lock over cache dir by mawad-amd · Pull Request #2622 · Xilinx/mlir-aie

mawad-amd · 2025-09-30T22:28:57Z

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

fifield · 2025-10-02T22:42:56Z

python/iron/jit.py

+    try:
+        # Create lock file if it doesn't exist
+        os.makedirs(os.path.dirname(lock_file_path), exist_ok=True)
+        lock_file = open(lock_file_path, "w")


Suggested change

lock_file = open(lock_file_path, "w")

try:

f = os.open(lock_file_path, os.O_CREAT | os.O_EXCL)

os.close(f)

except FileExistsError:

pass # File already exists

lock_file = open(lock_file_path, "a")

I'm suspicious the lock file creation is not atomic, and that is why errors remain. I'm testing a version of this locally and haven't seen any errors...so far.

Yeah I tested locally and can't reproduce it yet. Was just about to push some debug printfs.

I saw another error twice running tests in a loop:

FAILED ../../../../aie/test/python/cache_functionality.py::test_cache_lambda_functions

seems less frequent, but hard to know for sure.

So we did serialize compilation which I think rules out #2544 and anything related to the compiler. I will test locally. I would like to resolve any weird concurrency issues.

I think they maybe a driver/runtime issue:

Failed: Only 3/5 processes succeeded E E Process details: E E Process 0: FAILED (return code: 1) E STDOUT: ERROR: RuntimeError: DRM_IOCTL_AMDXDNA_CREATE_HWCTX IOCTL failed (err=-2): No such file or directory E E Process 1: SUCCESS (return code: 0) E STDOUT: SUCCESS E E Process 2: SUCCESS (return code: 0) E STDOUT: SUCCESS E E Process 3: SUCCESS (return code: 0) E STDOUT: SUCCESS E E Process 4: FAILED (return code: 1) E STDOUT: ERROR: RuntimeError: DRM_IOCTL_AMDXDNA_CREATE_HWCTX IOCTL failed (err=-2): No such file or directory

I was running into a similar issue the other day. This actually explains the weird behavior I was seeing. There seems to be some limit on how many things we can run in parallel. In #2611 I had to keep lowering down the number of cached kernels to 1 and didn’t have time to debug it back then. I now think we should be doing something similar to make run -JMAX_CONCURRENT_HW_CONTEXTS or something similar.

Not sure how to fix this in a robust way at the moment but I am pretty sure we are running out of hardware contexts. Eddie said the max is 6 hardware contexts on Phoenix.

Setting parallel jobs to a lower numbers should fix this but it doesn't make sense to lower all tests concurrency to get one test passing and will hide other concurrency bugs

Some tests may have multiple kernels in flight which means multiple contexts in flight so even if we remove this test, other tests will not be always reliable. Making tests run one kernel at a time doesn't make sense either.

Treating the DRM_IOCTL_AMDXDNA_CREATE_HWCTX as a non error is not a solution either

Let me know your thoughts.

Co-authored-by: Jeff Fifield <[email protected]>

ypapadop-amd · 2025-10-03T15:00:19Z

filelock using a different approach to locking (https://github.com/tox-dev/filelock/blob/main/src/filelock/_unix.py) where the file is closed when the lock is released. When I implemented file locking in C++ it was the same - you close the file at lock release; I don't remember the reason why, but that was the recommendation I've read. Maybe it's causing some of the problems?

andrej · 2025-10-03T21:00:36Z

python/iron/jit.py

+            try:
+                fcntl.flock(lock_file.fileno(), fcntl.LOCK_UN)
+            except OSError:
+                pass  # Ignore errors when releasing lock


Why? I think we'd rather know if it fails and why.

From man(2) flock:

Furthermore, the lock is released either by an explicit LOCK_UN operation on any of these duplicate file descriptors, or when all such file descriptors have been closed.

Keeping in mind lock_file is presumably the only file descriptor we have in the current process of the lock file, the explicit unlock here is redundant with the lock_file.close() in the next line, which will also release the lock.

fifield · 2025-10-06T16:36:29Z

.github/workflows/buildAndTestRyzenAI.yml

+          elif [ x"${{ matrix.runner_type }}" == x"amd7940hs" ]; then
+            LIT_OPTS="-j6 $LIT_OPTS"


I'd rather not to do this. It adds an artificial bottleneck to something that's already a bottleneck. In practice we don't see concurrency failures with -j12. Testing in mlir-aie is setup to control concurrency at the top level and tests are supposed to be good citizens by using 1 thread or process to not blow things up.

I suggest pulling the concurrency tests into a new top level target with no inter-test concurrency, so that those tests can control it for themselves.

Addressed here 67dbfc3

Signed-off-by: Muhammad Awad <[email protected]>

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Signed-off-by: Muhammad Awad <[email protected]>

Signed-off-by: Muhammad Awad <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Jeff Fifield <[email protected]> Co-authored-by: Joseph Melber <[email protected]>

mawad-amd added 4 commits September 30, 2025 15:27

Add lock over cache dir

570e041

Use double quote

621577b

Apply lint change

802cb3f

Add parallel build test

f85e7a2

mawad-amd marked this pull request as ready for review September 30, 2025 23:08

mawad-amd requested review from AndraBisca, andrej, fifield, jackl-xilinx, jgmelber, pvasireddy-amd and stephenneuendorffer as code owners September 30, 2025 23:08

mawad-amd and others added 5 commits September 30, 2025 16:10

Update test/python/jit_compilation.py

9117ab9

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update test/python/jit_compilation.py

9b1b882

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update test/python/jit_compilation.py

f5c9dbc

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Format the test

2249f29

Remove all files except the lock

613a007

This was referenced Oct 1, 2025

Make sure CI tests do not use previous run's JIT cache #2620

Closed

JIT cache is not atomic #2621

Closed

Merge branch 'main' into muhaawad/process-safe-jit

4b19595

fifield approved these changes Oct 2, 2025

View reviewed changes

fifield reviewed Oct 2, 2025

View reviewed changes

mawad-amd and others added 4 commits October 2, 2025 15:52

Avoid racing over creating the file (Thanks Jeff)

d94dd75

Co-authored-by: Jeff Fifield <[email protected]>

Add debug printfs

9ffcfd2

Format the test file

c89d9ad

Set the max number of parallel jobs to 6 on Phoenix

e924d98

mawad-amd requested a review from erwei-xilinx as a code owner October 3, 2025 00:17

andrej reviewed Oct 3, 2025

View reviewed changes

fifield requested changes Oct 6, 2025

View reviewed changes

mawad-amd and others added 6 commits October 6, 2025 12:46

Add concurrency test group

67dbfc3

Signed-off-by: Muhammad Awad <[email protected]>

Merge branch 'main' into muhaawad/process-safe-jit

33f6cb0

Update test/python-concurrency/jit_parallel_compilation.py

e2c4380

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Use --filter-out

89a526c

Signed-off-by: Muhammad Awad <[email protected]>

Add pytest substitution

fd9a96c

Signed-off-by: Muhammad Awad <[email protected]>

Merge branch 'main' into muhaawad/process-safe-jit

15279c9

fifield self-requested a review October 6, 2025 21:06

fifield approved these changes Oct 6, 2025

View reviewed changes

jgmelber approved these changes Oct 8, 2025

View reviewed changes

Merge branch 'main' into muhaawad/process-safe-jit

bf11ec4

jgmelber enabled auto-merge October 8, 2025 16:56

jgmelber added this pull request to the merge queue Oct 8, 2025

Merged via the queue into main with commit 74a54d9 Oct 8, 2025
55 of 61 checks passed

jgmelber deleted the muhaawad/process-safe-jit branch October 8, 2025 17:25

mawad-amd mentioned this pull request Jan 8, 2026

IRON host runtime abstraction #2737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lock over cache dir#2622

Add lock over cache dir#2622
jgmelber merged 21 commits intomainfrom
muhaawad/process-safe-jit

mawad-amd commented Sep 30, 2025

Uh oh!

fifield Oct 2, 2025

Uh oh!

mawad-amd Oct 2, 2025

Uh oh!

fifield Oct 2, 2025

Uh oh!

mawad-amd Oct 2, 2025

Uh oh!

mawad-amd Oct 2, 2025

Uh oh!

mawad-amd Oct 2, 2025 •

edited

Loading

Uh oh!

mawad-amd Oct 3, 2025 •

edited

Loading

Uh oh!

ypapadop-amd commented Oct 3, 2025

Uh oh!

andrej Oct 3, 2025 •

edited

Loading

Uh oh!

fifield Oct 6, 2025 •

edited

Loading

Uh oh!

mawad-amd Oct 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

-        lock_file = open(lock_file_path, "w")
+        try:
+            f = os.open(lock_file_path, os.O_CREAT | os.O_EXCL)
+            os.close(f)
+        except FileExistsError:
+            pass  # File already exists
+        lock_file = open(lock_file_path, "a")

		elif [ x"${{ matrix.runner_type }}" == x"amd7940hs" ]; then
		LIT_OPTS="-j6 $LIT_OPTS"

Conversation

mawad-amd commented Sep 30, 2025

Uh oh!

fifield Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

mawad-amd Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

fifield Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

mawad-amd Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

mawad-amd Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

mawad-amd Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mawad-amd Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ypapadop-amd commented Oct 3, 2025

Uh oh!

andrej Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fifield Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mawad-amd Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mawad-amd Oct 2, 2025 •

edited

Loading

mawad-amd Oct 3, 2025 •

edited

Loading

andrej Oct 3, 2025 •

edited

Loading

fifield Oct 6, 2025 •

edited

Loading