Conversation
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
python/iron/jit.py
Outdated
| try: | ||
| # Create lock file if it doesn't exist | ||
| os.makedirs(os.path.dirname(lock_file_path), exist_ok=True) | ||
| lock_file = open(lock_file_path, "w") |
There was a problem hiding this comment.
| lock_file = open(lock_file_path, "w") | |
| try: | |
| f = os.open(lock_file_path, os.O_CREAT | os.O_EXCL) | |
| os.close(f) | |
| except FileExistsError: | |
| pass # File already exists | |
| lock_file = open(lock_file_path, "a") |
I'm suspicious the lock file creation is not atomic, and that is why errors remain. I'm testing a version of this locally and haven't seen any errors...so far.
There was a problem hiding this comment.
Yeah I tested locally and can't reproduce it yet. Was just about to push some debug printfs.
There was a problem hiding this comment.
I saw another error twice running tests in a loop:
FAILED ../../../../aie/test/python/cache_functionality.py::test_cache_lambda_functions
seems less frequent, but hard to know for sure.
There was a problem hiding this comment.
So we did serialize compilation which I think rules out #2544 and anything related to the compiler. I will test locally. I would like to resolve any weird concurrency issues.
There was a problem hiding this comment.
I think they maybe a driver/runtime issue:
Failed: Only 3/5 processes succeeded
E
E Process details:
E
E Process 0: FAILED (return code: 1)
E STDOUT: ERROR: RuntimeError: DRM_IOCTL_AMDXDNA_CREATE_HWCTX IOCTL failed (err=-2): No such file or directory
E
E Process 1: SUCCESS (return code: 0)
E STDOUT: SUCCESS
E
E Process 2: SUCCESS (return code: 0)
E STDOUT: SUCCESS
E
E Process 3: SUCCESS (return code: 0)
E STDOUT: SUCCESS
E
E Process 4: FAILED (return code: 1)
E STDOUT: ERROR: RuntimeError: DRM_IOCTL_AMDXDNA_CREATE_HWCTX IOCTL failed (err=-2): No such file or directory
There was a problem hiding this comment.
I was running into a similar issue the other day. This actually explains the weird behavior I was seeing. There seems to be some limit on how many things we can run in parallel. In #2611 I had to keep lowering down the number of cached kernels to 1 and didn’t have time to debug it back then. I now think we should be doing something similar to make run -JMAX_CONCURRENT_HW_CONTEXTS or something similar.
There was a problem hiding this comment.
Not sure how to fix this in a robust way at the moment but I am pretty sure we are running out of hardware contexts. Eddie said the max is 6 hardware contexts on Phoenix.
- Setting parallel jobs to a lower numbers should fix this but it doesn't make sense to lower all tests concurrency to get one test passing and will hide other concurrency bugs
- Some tests may have multiple kernels in flight which means multiple contexts in flight so even if we remove this test, other tests will not be always reliable. Making tests run one kernel at a time doesn't make sense either.
- Treating the
DRM_IOCTL_AMDXDNA_CREATE_HWCTXas a non error is not a solution either
Let me know your thoughts.
|
filelock using a different approach to locking (https://github.com/tox-dev/filelock/blob/main/src/filelock/_unix.py) where the file is closed when the lock is released. When I implemented file locking in C++ it was the same - you close the file at lock release; I don't remember the reason why, but that was the recommendation I've read. Maybe it's causing some of the problems? |
| try: | ||
| fcntl.flock(lock_file.fileno(), fcntl.LOCK_UN) | ||
| except OSError: | ||
| pass # Ignore errors when releasing lock |
There was a problem hiding this comment.
Why? I think we'd rather know if it fails and why.
From man(2) flock:
Furthermore, the lock is released either by an explicit LOCK_UN operation on any of these duplicate file descriptors, or when all such file descriptors have been closed.
Keeping in mind lock_file is presumably the only file descriptor we have in the current process of the lock file, the explicit unlock here is redundant with the lock_file.close() in the next line, which will also release the lock.
| elif [ x"${{ matrix.runner_type }}" == x"amd7940hs" ]; then | ||
| LIT_OPTS="-j6 $LIT_OPTS" |
There was a problem hiding this comment.
I'd rather not to do this. It adds an artificial bottleneck to something that's already a bottleneck. In practice we don't see concurrency failures with -j12. Testing in mlir-aie is setup to control concurrency at the top level and tests are supposed to be good citizens by using 1 thread or process to not blow things up.
I suggest pulling the concurrency tests into a new top level target with no inter-test concurrency, so that those tests can control it for themselves.
Signed-off-by: Muhammad Awad <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Muhammad Awad <[email protected]>
Signed-off-by: Muhammad Awad <[email protected]>
Signed-off-by: Muhammad Awad <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Jeff Fifield <[email protected]> Co-authored-by: Joseph Melber <[email protected]>
Closes #2621