Skip to content

v1.0.1

Latest

Choose a tag to compare

@nv-nmailhot nv-nmailhot released this 14 Apr 19:56
d196ff6

1.0.1

Summary

NVIDIA® NIXL Release 1.0.1 is a targeted maintenance release focusing on NIXL-EP stability fixes, libfabric transport reliability improvements, and build/packaging improvements across UCX, Python wheel, and Docker environments.

NIXL-EP Fixes

  • Fix Destruction Flows: Fixed resource cleanup and destruction ordering in NIXL-EP to prevent crashes and resource leaks during shutdown (#1452).
  • Fix Signaling Buffer Corruption During Elastic Scale-Up: Fixed a signaling buffer corruption issue in NIXL-EP that could occur when new nodes join during elastic scale-up, ensuring correct buffer state across topology changes (#1453).

Libfabric Fixes

  • Fix Notification Override on Transfer Handle Repost: Fixed an issue in the libfabric backend where updated notification messages were ignored when transfer handles were reposted, causing reposted transfers to always use the original notification from initial preparation time (#1482, #1433).
  • Fix Endpoint Thread Safety: Added proper mutex locking for all endpoint access in the libfabric backend to satisfy FI_THREAD_COMPLETION thread-safety requirements, preventing potential race conditions during concurrent I/O operations (#1483, #1457).

Build & Packaging

  • Enable UCX EP Support in Python Wheel Build: Added UCX endpoint support to the Python wheel build, enabling NIXL-EP functionality for pip-installed deployments (#1440).
  • Disable gdrcopy in UCX Build: Disabled gdrcopy in the UCX build to avoid linkage conflicts in environments where gdrcopy is not available or not needed (#1436).
  • Fix Abseil Version Conflicts: Resolved Abseil version conflicts in NIXL builds and Docker images that could cause linker errors or runtime symbol mismatches (#1432).
  • Bump RDMA Memory Check UCX Version: Updated the UCX version used for RDMA memory checks to align with the latest supported UCX release (#1445).
  • Pin Torch Version to 2.11: Pinned the PyTorch dependency to version 2.11 for reproducible builds and compatibility (#1471).
  • Add pkg-config Install: Added missing pkg-config installation to the build environment, fixing build failures in minimal container images (#1450).
  • Fix Dependency Issues: Removed strict PyTorch version check during module initialization to allow broader compatibility, and unified UCX checkout behavior to consistently use the configured UCX reference (#1488).

Full Changelog: 1.0.0...1.0.1