Skip to content

step() crashes on RTX 5080 (Blackwell) with musculoskeletal models #1280

@YONGJINJO

Description

@YONGJINJO

Hi, I'm trying to run MyoLegs (80-muscle model from musclemimic-models) on an RTX 5080 and hitting a consistent crash in tendon_velocity kernel loading.

What happens

Calling mjw.step() fails with CUDA error 700: illegal memory access during the tendon_velocity module load. Simple models (humanoid, small tendon models) work fine — the issue only shows up with complex spatial tendon models like MyoLegs (80 tendons, nwrap=324).

Setup

  • RTX 5080 (Blackwell, sm_120)
  • Driver 590.48.01 (CUDA 13.1) — also tried 570.211.01 (CUDA 12.8), same result
  • mujoco-warp 3.6.0, warp-lang 1.12.1, mujoco 3.6.0
  • Python 3.11.9, Ubuntu 22.04

Minimal repro

import mujoco, mujoco_warp as mjw, warp as wp
import musclemimic_models

mjm = mujoco.MjModel.from_xml_path(str(musclemimic_models.get_xml_path("myofullbody")))
m = mjw.put_model(mjm)
d = mjw.make_data(mjm, nworld=1)
mjw.step(m, d)  # crashes here
Module _tendon_velocity_..._8e72f5af load on device 'cuda:0' took 43.99 ms  (error)
Exception: Failed to load CUDA module '_tendon_velocity__locals__tendon_velocity_8e72f5af'

What I've found so far

I spent some time digging into this and it seems to be a module loading order issue, not a problem with the kernel itself:

  • Calling sub-functions individually (kinematics(), tendon(), fwd_velocity(), etc.) all work fine
  • But when forward() runs them together, tendon_velocity fails to load after the larger modules (smooth, constraint, CCD) are already loaded
  • If I pre-compile fwd_velocity() before anything else, tendon_velocity passes — but then a different module fails instead
  • A simple chain model with the same nv=34 loads 19+ modules successfully, so it's not a hard module count limit
  • All built-in test models (test_data/tendon/) pass, but they only have 3-4 tendons

Basically it seems like the combination of many large compiled modules + launch_tiled for tendon_velocity hits some issue specific to sm_120.

Model nv ntendon nwrap Result
test_data/tendon/site.xml 3 4 0
chain model (28 joints) 34 27 0
chain (5 joints, 80 tendons) 11 80 0
MyoLegs (80 muscles) 34 80 324
MyoFullBody (416 muscles) 128 424 ~1600

I saw that Blackwell is officially supported (RTX PRO 6000 benchmarks), so I'm guessing this might be an edge case that hasn't been tested with large spatial tendon models on consumer Blackwell GPUs.

Any pointers would be appreciated — happy to provide more info or test patches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions