Skip to content

Add lock over cache dir#2622

Merged
jgmelber merged 21 commits intomainfrom
muhaawad/process-safe-jit
Oct 8, 2025
Merged

Add lock over cache dir#2622
jgmelber merged 21 commits intomainfrom
muhaawad/process-safe-jit

Conversation

@mawad-amd
Copy link
Collaborator

Closes #2621

mawad-amd and others added 5 commits September 30, 2025 16:10
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
try:
# Create lock file if it doesn't exist
os.makedirs(os.path.dirname(lock_file_path), exist_ok=True)
lock_file = open(lock_file_path, "w")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lock_file = open(lock_file_path, "w")
try:
f = os.open(lock_file_path, os.O_CREAT | os.O_EXCL)
os.close(f)
except FileExistsError:
pass # File already exists
lock_file = open(lock_file_path, "a")

I'm suspicious the lock file creation is not atomic, and that is why errors remain. I'm testing a version of this locally and haven't seen any errors...so far.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tested locally and can't reproduce it yet. Was just about to push some debug printfs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw another error twice running tests in a loop:

FAILED ../../../../aie/test/python/cache_functionality.py::test_cache_lambda_functions

seems less frequent, but hard to know for sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we did serialize compilation which I think rules out #2544 and anything related to the compiler. I will test locally. I would like to resolve any weird concurrency issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they maybe a driver/runtime issue:

               Failed: Only 3/5 processes succeeded
E               
E               Process details:
E               
E               Process 0: FAILED (return code: 1)
E                 STDOUT: ERROR: RuntimeError: DRM_IOCTL_AMDXDNA_CREATE_HWCTX IOCTL failed (err=-2): No such file or directory
E               
E               Process 1: SUCCESS (return code: 0)
E                 STDOUT: SUCCESS
E               
E               Process 2: SUCCESS (return code: 0)
E                 STDOUT: SUCCESS
E               
E               Process 3: SUCCESS (return code: 0)
E                 STDOUT: SUCCESS
E               
E               Process 4: FAILED (return code: 1)
E                 STDOUT: ERROR: RuntimeError: DRM_IOCTL_AMDXDNA_CREATE_HWCTX IOCTL failed (err=-2): No such file or directory

Copy link
Collaborator Author

@mawad-amd mawad-amd Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running into a similar issue the other day. This actually explains the weird behavior I was seeing. There seems to be some limit on how many things we can run in parallel. In #2611 I had to keep lowering down the number of cached kernels to 1 and didn’t have time to debug it back then. I now think we should be doing something similar to make run -JMAX_CONCURRENT_HW_CONTEXTS or something similar.

Copy link
Collaborator Author

@mawad-amd mawad-amd Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to fix this in a robust way at the moment but I am pretty sure we are running out of hardware contexts. Eddie said the max is 6 hardware contexts on Phoenix.

  • Setting parallel jobs to a lower numbers should fix this but it doesn't make sense to lower all tests concurrency to get one test passing and will hide other concurrency bugs
  • Some tests may have multiple kernels in flight which means multiple contexts in flight so even if we remove this test, other tests will not be always reliable. Making tests run one kernel at a time doesn't make sense either.
  • Treating the DRM_IOCTL_AMDXDNA_CREATE_HWCTX as a non error is not a solution either

Let me know your thoughts.

@ypapadop-amd
Copy link
Collaborator

filelock using a different approach to locking (https://github.com/tox-dev/filelock/blob/main/src/filelock/_unix.py) where the file is closed when the lock is released. When I implemented file locking in C++ it was the same - you close the file at lock release; I don't remember the reason why, but that was the recommendation I've read. Maybe it's causing some of the problems?

try:
fcntl.flock(lock_file.fileno(), fcntl.LOCK_UN)
except OSError:
pass # Ignore errors when releasing lock
Copy link
Collaborator

@andrej andrej Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I think we'd rather know if it fails and why.

From man(2) flock:

Furthermore, the lock is released either by an explicit LOCK_UN operation on any of these duplicate file descriptors, or when all such file descriptors have been closed.

Keeping in mind lock_file is presumably the only file descriptor we have in the current process of the lock file, the explicit unlock here is redundant with the lock_file.close() in the next line, which will also release the lock.

Comment on lines +117 to +118
elif [ x"${{ matrix.runner_type }}" == x"amd7940hs" ]; then
LIT_OPTS="-j6 $LIT_OPTS"
Copy link
Collaborator

@fifield fifield Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not to do this. It adds an artificial bottleneck to something that's already a bottleneck. In practice we don't see concurrency failures with -j12. Testing in mlir-aie is setup to control concurrency at the top level and tests are supposed to be good citizens by using 1 thread or process to not blow things up.

I suggest pulling the concurrency tests into a new top level target with no inter-test concurrency, so that those tests can control it for themselves.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here 67dbfc3

mawad-amd and others added 6 commits October 6, 2025 12:46
@fifield fifield self-requested a review October 6, 2025 21:06
@jgmelber jgmelber enabled auto-merge October 8, 2025 16:56
@jgmelber jgmelber added this pull request to the merge queue Oct 8, 2025
Merged via the queue into main with commit 74a54d9 Oct 8, 2025
55 of 61 checks passed
@jgmelber jgmelber deleted the muhaawad/process-safe-jit branch October 8, 2025 17:25
fifield added a commit to fifield/mlir-aie that referenced this pull request Nov 12, 2025
Signed-off-by: Muhammad Awad <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jeff Fifield <[email protected]>
Co-authored-by: Joseph Melber <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JIT cache is not atomic

5 participants