step() crashes on RTX 5080 (Blackwell) with musculoskeletal models

Hi, I'm trying to run MyoLegs (80-muscle model from musclemimic-models) on an RTX 5080 and hitting a consistent crash in `tendon_velocity` kernel loading.

### What happens

Calling `mjw.step()` fails with `CUDA error 700: illegal memory access` during the `tendon_velocity` module load. Simple models (humanoid, small tendon models) work fine — the issue only shows up with complex spatial tendon models like MyoLegs (80 tendons, nwrap=324).

### Setup

- RTX 5080 (Blackwell, sm_120)
- Driver 590.48.01 (CUDA 13.1) — also tried 570.211.01 (CUDA 12.8), same result
- mujoco-warp 3.6.0, warp-lang 1.12.1, mujoco 3.6.0
- Python 3.11.9, Ubuntu 22.04

### Minimal repro

```python
import mujoco, mujoco_warp as mjw, warp as wp
import musclemimic_models

mjm = mujoco.MjModel.from_xml_path(str(musclemimic_models.get_xml_path("myofullbody")))
m = mjw.put_model(mjm)
d = mjw.make_data(mjm, nworld=1)
mjw.step(m, d)  # crashes here
```

```
Module _tendon_velocity_..._8e72f5af load on device 'cuda:0' took 43.99 ms  (error)
Exception: Failed to load CUDA module '_tendon_velocity__locals__tendon_velocity_8e72f5af'
```

### What I've found so far

I spent some time digging into this and it seems to be a module loading order issue, not a problem with the kernel itself:

- Calling sub-functions individually (`kinematics()`, `tendon()`, `fwd_velocity()`, etc.) all work fine
- But when `forward()` runs them together, `tendon_velocity` fails to load after the larger modules (smooth, constraint, CCD) are already loaded
- If I pre-compile `fwd_velocity()` before anything else, `tendon_velocity` passes — but then a *different* module fails instead
- A simple chain model with the same nv=34 loads 19+ modules successfully, so it's not a hard module count limit
- All built-in test models (test_data/tendon/) pass, but they only have 3-4 tendons

Basically it seems like the combination of many large compiled modules + `launch_tiled` for `tendon_velocity` hits some issue specific to sm_120.

| Model | nv | ntendon | nwrap | Result |
|-------|-----|---------|-------|--------|
| test_data/tendon/site.xml | 3 | 4 | 0 | ✅ |
| chain model (28 joints) | 34 | 27 | 0 | ✅ |
| chain (5 joints, 80 tendons) | 11 | 80 | 0 | ✅ |
| MyoLegs (80 muscles) | 34 | 80 | 324 | ❌ |
| MyoFullBody (416 muscles) | 128 | 424 | ~1600 | ❌ |

I saw that Blackwell is officially supported (RTX PRO 6000 benchmarks), so I'm guessing this might be an edge case that hasn't been tested with large spatial tendon models on consumer Blackwell GPUs.

Any pointers would be appreciated — happy to provide more info or test patches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

step() crashes on RTX 5080 (Blackwell) with musculoskeletal models #1280

What happens

Setup

Minimal repro

What I've found so far

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	nv	ntendon	nwrap	Result
test_data/tendon/site.xml	3	4	0	✅
chain model (28 joints)	34	27	0	✅
chain (5 joints, 80 tendons)	11	80	0	✅
MyoLegs (80 muscles)	34	80	324	❌
MyoFullBody (416 muscles)	128	424	~1600	❌

step() crashes on RTX 5080 (Blackwell) with musculoskeletal models #1280

Description

What happens

Setup

Minimal repro

What I've found so far

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions