Skip to content

Conversation

@jorgem-seq
Copy link
Contributor

…the compilation unit

Signed-off-by: Jorge Maidana [email protected]

@evadot
Copy link
Contributor

evadot commented May 16, 2020

I don't like modifying upstream code, what is your motiviation for that ?

@jorgem-seq
Copy link
Contributor Author

They are used in drivers/gpu/drm/drm_sysctl_freebsd.c, it was my mistake, they should have the #ifdef linux ...

@jorgem-seq
Copy link
Contributor Author

I was thinking that the drm_* files were FreeBSD only, that's why I obviate the explanation.

@jorgem-seq
Copy link
Contributor Author

This is better: https://github.com/jorgem-seq/drm-kmod/commit/b725fb652a2cf967b0bc34f957d4caa4c29ea747
You can close this PR, thanks for your time.

evadot pushed a commit that referenced this pull request May 27, 2020
The idea to break the circular lock dependency is to move allocate_mqd
out of dqm lock protection. See callstack #1 below.

[   59.510149] [drm] Initialized amdgpu 3.30.0 20150101 for 0000:04:00.0 on minor 0

[  513.604034] ======================================================
[  513.604205] WARNING: possible circular locking dependency detected
[  513.604375] 4.18.0-kfd-root #2 Tainted: G        W
[  513.604530] ------------------------------------------------------
[  513.604699] kswapd0/611 is trying to acquire lock:
[  513.604840] 00000000d254022e (&dqm->lock_hidden){+.+.}, at: evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.605150]
               but task is already holding lock:
[  513.605307] 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.605540]
               which lock already depends on the new lock.

[  513.605747]
               the existing dependency chain (in reverse order) is:
[  513.605944]
               -> #4 (&anon_vma->rwsem){++++}:
[  513.606106]        __vma_adjust+0x147/0x7f0
[  513.606231]        __split_vma+0x179/0x190
[  513.606353]        mprotect_fixup+0x217/0x260
[  513.606553]        do_mprotect_pkey+0x211/0x380
[  513.606752]        __x64_sys_mprotect+0x1b/0x20
[  513.606954]        do_syscall_64+0x50/0x1a0
[  513.607149]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.607380]
               -> #3 (&mapping->i_mmap_rwsem){++++}:
[  513.607678]        rmap_walk_file+0x1f0/0x280
[  513.607887]        page_referenced+0xdd/0x180
[  513.608081]        shrink_page_list+0x853/0xcb0
[  513.608279]        shrink_inactive_list+0x33b/0x700
[  513.608483]        shrink_node_memcg+0x37a/0x7f0
[  513.608682]        shrink_node+0xd8/0x490
[  513.608869]        balance_pgdat+0x18b/0x3b0
[  513.609062]        kswapd+0x203/0x5c0
[  513.609241]        kthread+0x100/0x140
[  513.609420]        ret_from_fork+0x24/0x30
[  513.609607]
               -> #2 (fs_reclaim){+.+.}:
[  513.609883]        kmem_cache_alloc_trace+0x34/0x2e0
[  513.610093]        reservation_object_reserve_shared+0x139/0x300
[  513.610326]        ttm_bo_init_reserved+0x291/0x480 [ttm]
[  513.610567]        amdgpu_bo_do_create+0x1d2/0x650 [amdgpu]
[  513.610811]        amdgpu_bo_create+0x40/0x1f0 [amdgpu]
[  513.611041]        amdgpu_bo_create_reserved+0x249/0x2d0 [amdgpu]
[  513.611290]        amdgpu_bo_create_kernel+0x12/0x70 [amdgpu]
[  513.611584]        amdgpu_ttm_init+0x2cb/0x560 [amdgpu]
[  513.611823]        gmc_v9_0_sw_init+0x400/0x750 [amdgpu]
[  513.612491]        amdgpu_device_init+0x14eb/0x1990 [amdgpu]
[  513.612730]        amdgpu_driver_load_kms+0x78/0x290 [amdgpu]
[  513.612958]        drm_dev_register+0x111/0x1a0
[  513.613171]        amdgpu_pci_probe+0x11c/0x1e0 [amdgpu]
[  513.613389]        local_pci_probe+0x3f/0x90
[  513.613581]        pci_device_probe+0x102/0x1c0
[  513.613779]        driver_probe_device+0x2a7/0x480
[  513.613984]        __driver_attach+0x10a/0x110
[  513.614179]        bus_for_each_dev+0x67/0xc0
[  513.614372]        bus_add_driver+0x1eb/0x260
[  513.614565]        driver_register+0x5b/0xe0
[  513.614756]        do_one_initcall+0xac/0x357
[  513.614952]        do_init_module+0x5b/0x213
[  513.615145]        load_module+0x2542/0x2d30
[  513.615337]        __do_sys_finit_module+0xd2/0x100
[  513.615541]        do_syscall_64+0x50/0x1a0
[  513.615731]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.615963]
               -> #1 (reservation_ww_class_mutex){+.+.}:
[  513.616293]        amdgpu_amdkfd_alloc_gtt_mem+0xcf/0x2c0 [amdgpu]
[  513.616554]        init_mqd+0x223/0x260 [amdgpu]
[  513.616779]        create_queue_nocpsch+0x4d9/0x600 [amdgpu]
[  513.617031]        pqm_create_queue+0x37c/0x520 [amdgpu]
[  513.617270]        kfd_ioctl_create_queue+0x2f9/0x650 [amdgpu]
[  513.617522]        kfd_ioctl+0x202/0x350 [amdgpu]
[  513.617724]        do_vfs_ioctl+0x9f/0x6c0
[  513.617914]        ksys_ioctl+0x66/0x70
[  513.618095]        __x64_sys_ioctl+0x16/0x20
[  513.618286]        do_syscall_64+0x50/0x1a0
[  513.618476]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.618695]
               -> #0 (&dqm->lock_hidden){+.+.}:
[  513.618984]        __mutex_lock+0x98/0x970
[  513.619197]        evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.619459]        kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.619710]        kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.620103]        amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.620363]        amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.620614]        __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.620851]        try_to_unmap_one+0x7fc/0x8f0
[  513.621049]        rmap_walk_anon+0x121/0x290
[  513.621242]        try_to_unmap+0x93/0xf0
[  513.621428]        shrink_page_list+0x606/0xcb0
[  513.621625]        shrink_inactive_list+0x33b/0x700
[  513.621835]        shrink_node_memcg+0x37a/0x7f0
[  513.622034]        shrink_node+0xd8/0x490
[  513.622219]        balance_pgdat+0x18b/0x3b0
[  513.622410]        kswapd+0x203/0x5c0
[  513.622589]        kthread+0x100/0x140
[  513.622769]        ret_from_fork+0x24/0x30
[  513.622957]
               other info that might help us debug this:

[  513.623354] Chain exists of:
                 &dqm->lock_hidden --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem

[  513.623900]  Possible unsafe locking scenario:

[  513.624189]        CPU0                    CPU1
[  513.624397]        ----                    ----
[  513.624594]   lock(&anon_vma->rwsem);
[  513.624771]                                lock(&mapping->i_mmap_rwsem);
[  513.625020]                                lock(&anon_vma->rwsem);
[  513.625253]   lock(&dqm->lock_hidden);
[  513.625433]
                *** DEADLOCK ***

[  513.625783] 3 locks held by kswapd0/611:
[  513.625967]  #0: 00000000f14edf84 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
[  513.626309]  #1: 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.626671]  #2: 0000000067b5cd12 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x5/0xb0
[  513.627037]
               stack backtrace:
[  513.627292] CPU: 0 PID: 611 Comm: kswapd0 Tainted: G        W         4.18.0-kfd-root #2
[  513.627632] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  513.627990] Call Trace:
[  513.628143]  dump_stack+0x7c/0xbb
[  513.628315]  print_circular_bug.isra.37+0x21b/0x228
[  513.628581]  __lock_acquire+0xf7d/0x1470
[  513.628782]  ? unwind_next_frame+0x6c/0x4f0
[  513.628974]  ? lock_acquire+0xec/0x1e0
[  513.629154]  lock_acquire+0xec/0x1e0
[  513.629357]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.629587]  __mutex_lock+0x98/0x970
[  513.629790]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630047]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630309]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630562]  evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630816]  kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.631057]  kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.631288]  amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.631536]  amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.632076]  __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.632299]  try_to_unmap_one+0x7fc/0x8f0
[  513.632487]  ? page_lock_anon_vma_read+0x68/0x250
[  513.632690]  rmap_walk_anon+0x121/0x290
[  513.632875]  try_to_unmap+0x93/0xf0
[  513.633050]  ? page_remove_rmap+0x330/0x330
[  513.633239]  ? rcu_read_unlock+0x60/0x60
[  513.633422]  ? page_get_anon_vma+0x160/0x160
[  513.633613]  shrink_page_list+0x606/0xcb0
[  513.633800]  shrink_inactive_list+0x33b/0x700
[  513.633997]  shrink_node_memcg+0x37a/0x7f0
[  513.634186]  ? shrink_node+0xd8/0x490
[  513.634363]  shrink_node+0xd8/0x490
[  513.634537]  balance_pgdat+0x18b/0x3b0
[  513.634718]  kswapd+0x203/0x5c0
[  513.634887]  ? wait_woken+0xb0/0xb0
[  513.635062]  kthread+0x100/0x140
[  513.635231]  ? balance_pgdat+0x3b0/0x3b0
[  513.635414]  ? kthread_delayed_work_timer_fn+0x80/0x80
[  513.635626]  ret_from_fork+0x24/0x30
[  513.636042] Evicting PASID 32768 queues
[  513.936236] Restoring PASID 32768 queues
[  524.708912] Evicting PASID 32768 queues
[  524.999875] Restoring PASID 32768 queues

Signed-off-by: Oak Zeng <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
evadot pushed a commit that referenced this pull request May 27, 2020
Acquiring drm_client_dev.modeset_mutex after the locks in drm_fb_helper.dev
creates a deadlock with drm_setup_crtcs() as shown below:

  [    4.959319] fbcon: radeondrmfb (fb0) is primary device
  [    4.993952] Console: switching to colour frame buffer device 240x67
  [    4.994040]
  [    4.994041] ======================================================
  [    4.994041] WARNING: possible circular locking dependency detected
  [    4.994042] 5.2.0-rc4-1-default+ #39 Tainted: G            E
  [    4.994043] ------------------------------------------------------
  [    4.994043] systemd-udevd/369 is trying to acquire lock:
  [    4.994044] 00000000fb622acb (&client->modeset_mutex){+.+.}, at: drm_fb_helper_pan_display+0x103/0x1f0 [drm_kms_helper]
  [    4.994055]
  [    4.994055] but task is already holding lock:
  [    4.994055] 0000000028767ae4 (crtc_ww_class_mutex){+.+.}, at: drm_modeset_lock+0x42/0xf0 [drm]
  [    4.994072]
  [    4.994072] which lock already depends on the new lock.
  [    4.994072]
  [    4.994072]
  [    4.994072] the existing dependency chain (in reverse order) is:
  [    4.994073]
  [    4.994073] -> #3 (crtc_ww_class_mutex){+.+.}:
  [    4.994076]        lock_acquire+0x9e/0x170
  [    4.994079]        __ww_mutex_lock.constprop.18+0x97/0xf40
  [    4.994080]        ww_mutex_lock+0x30/0x90
  [    4.994091]        drm_modeset_lock+0x42/0xf0 [drm]
  [    4.994102]        drm_modeset_lock_all_ctx+0x1f/0xe0 [drm]
  [    4.994113]        drm_modeset_lock_all+0x5e/0x1a0 [drm]
  [    4.994163]        intel_modeset_init+0x60b/0xda0 [i915]
  ..
  [    4.994253]
  [    4.994253] -> #2 (crtc_ww_class_acquire){+.+.}:
  [    4.994255]        lock_acquire+0x9e/0x170
  [    4.994270]        drm_modeset_acquire_init+0xcc/0x100 [drm]
  [    4.994280]        drm_modeset_lock_all+0x44/0x1a0 [drm]
  [    4.994320]        intel_modeset_init+0x60b/0xda0 [i915]
  ..
  [    4.994403]
  [    4.994403] -> #1 (&dev->mode_config.mutex){+.+.}:
  [    4.994405]        lock_acquire+0x9e/0x170
  [    4.994408]        __mutex_lock+0x62/0x8c0
  [    4.994413]        drm_setup_crtcs+0x17c/0xc50 [drm_kms_helper]
  [    4.994418]        __drm_fb_helper_initial_config_and_unlock+0x34/0x530 [drm_kms_helper]
  [    4.994450]        radeon_fbdev_init+0x110/0x130 [radeon]
  ..
  [    4.994535]
  [    4.994535] -> #0 (&client->modeset_mutex){+.+.}:
  [    4.994537]        __lock_acquire+0xa85/0xe90
  [    4.994538]        lock_acquire+0x9e/0x170
  [    4.994540]        __mutex_lock+0x62/0x8c0
  [    4.994545]        drm_fb_helper_pan_display+0x103/0x1f0 [drm_kms_helper]
  [    4.994547]        fb_pan_display+0x92/0x120
  [    4.994549]        bit_update_start+0x1a/0x40
  [    4.994550]        fbcon_switch+0x392/0x580
  [    4.994552]        redraw_screen+0x12c/0x220
  [    4.994553]        do_bind_con_driver.cold.30+0xe1/0x10d
  [    4.994554]        do_take_over_console+0x113/0x190
  [    4.994555]        do_fbcon_takeover+0x58/0xb0
  [    4.994557]        notifier_call_chain+0x47/0x70
  [    4.994558]        blocking_notifier_call_chain+0x44/0x60
  [    4.994559]        register_framebuffer+0x231/0x310
  [    4.994564]        __drm_fb_helper_initial_config_and_unlock+0x2fd/0x530 [drm_kms_helper]
  [    4.994590]        radeon_fbdev_init+0x110/0x130 [radeon]
  ..

This problem was introduced in

  d81294afe	drm/fb-helper: Remove drm_fb_helper_crtc

Reversing the lock ordering in pan_display_legacy() fixes the issue.

Fixes: d81294afeecd ("drm/fb-helper: Remove drm_fb_helper_crtc")
Cc: Noralf Trønnes <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Maxime Ripard <[email protected]>
Cc: Sean Paul <[email protected]>
Cc: [email protected]
Signed-off-by: Thomas Zimmermann <[email protected]>
Suggested-by: Daniel Vetter <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
evadot pushed a commit that referenced this pull request May 27, 2020
The idea to break the circular lock dependency is to temporarily drop
dqm lock before calling allocate_mqd. See callstack #1 below.

[   59.510149] [drm] Initialized amdgpu 3.30.0 20150101 for 0000:04:00.0 on minor 0

[  513.604034] ======================================================
[  513.604205] WARNING: possible circular locking dependency detected
[  513.604375] 4.18.0-kfd-root #2 Tainted: G        W
[  513.604530] ------------------------------------------------------
[  513.604699] kswapd0/611 is trying to acquire lock:
[  513.604840] 00000000d254022e (&dqm->lock_hidden){+.+.}, at: evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.605150]
               but task is already holding lock:
[  513.605307] 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.605540]
               which lock already depends on the new lock.

[  513.605747]
               the existing dependency chain (in reverse order) is:
[  513.605944]
               -> #4 (&anon_vma->rwsem){++++}:
[  513.606106]        __vma_adjust+0x147/0x7f0
[  513.606231]        __split_vma+0x179/0x190
[  513.606353]        mprotect_fixup+0x217/0x260
[  513.606553]        do_mprotect_pkey+0x211/0x380
[  513.606752]        __x64_sys_mprotect+0x1b/0x20
[  513.606954]        do_syscall_64+0x50/0x1a0
[  513.607149]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.607380]
               -> #3 (&mapping->i_mmap_rwsem){++++}:
[  513.607678]        rmap_walk_file+0x1f0/0x280
[  513.607887]        page_referenced+0xdd/0x180
[  513.608081]        shrink_page_list+0x853/0xcb0
[  513.608279]        shrink_inactive_list+0x33b/0x700
[  513.608483]        shrink_node_memcg+0x37a/0x7f0
[  513.608682]        shrink_node+0xd8/0x490
[  513.608869]        balance_pgdat+0x18b/0x3b0
[  513.609062]        kswapd+0x203/0x5c0
[  513.609241]        kthread+0x100/0x140
[  513.609420]        ret_from_fork+0x24/0x30
[  513.609607]
               -> #2 (fs_reclaim){+.+.}:
[  513.609883]        kmem_cache_alloc_trace+0x34/0x2e0
[  513.610093]        reservation_object_reserve_shared+0x139/0x300
[  513.610326]        ttm_bo_init_reserved+0x291/0x480 [ttm]
[  513.610567]        amdgpu_bo_do_create+0x1d2/0x650 [amdgpu]
[  513.610811]        amdgpu_bo_create+0x40/0x1f0 [amdgpu]
[  513.611041]        amdgpu_bo_create_reserved+0x249/0x2d0 [amdgpu]
[  513.611290]        amdgpu_bo_create_kernel+0x12/0x70 [amdgpu]
[  513.611584]        amdgpu_ttm_init+0x2cb/0x560 [amdgpu]
[  513.611823]        gmc_v9_0_sw_init+0x400/0x750 [amdgpu]
[  513.612491]        amdgpu_device_init+0x14eb/0x1990 [amdgpu]
[  513.612730]        amdgpu_driver_load_kms+0x78/0x290 [amdgpu]
[  513.612958]        drm_dev_register+0x111/0x1a0
[  513.613171]        amdgpu_pci_probe+0x11c/0x1e0 [amdgpu]
[  513.613389]        local_pci_probe+0x3f/0x90
[  513.613581]        pci_device_probe+0x102/0x1c0
[  513.613779]        driver_probe_device+0x2a7/0x480
[  513.613984]        __driver_attach+0x10a/0x110
[  513.614179]        bus_for_each_dev+0x67/0xc0
[  513.614372]        bus_add_driver+0x1eb/0x260
[  513.614565]        driver_register+0x5b/0xe0
[  513.614756]        do_one_initcall+0xac/0x357
[  513.614952]        do_init_module+0x5b/0x213
[  513.615145]        load_module+0x2542/0x2d30
[  513.615337]        __do_sys_finit_module+0xd2/0x100
[  513.615541]        do_syscall_64+0x50/0x1a0
[  513.615731]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.615963]
               -> #1 (reservation_ww_class_mutex){+.+.}:
[  513.616293]        amdgpu_amdkfd_alloc_gtt_mem+0xcf/0x2c0 [amdgpu]
[  513.616554]        init_mqd+0x223/0x260 [amdgpu]
[  513.616779]        create_queue_nocpsch+0x4d9/0x600 [amdgpu]
[  513.617031]        pqm_create_queue+0x37c/0x520 [amdgpu]
[  513.617270]        kfd_ioctl_create_queue+0x2f9/0x650 [amdgpu]
[  513.617522]        kfd_ioctl+0x202/0x350 [amdgpu]
[  513.617724]        do_vfs_ioctl+0x9f/0x6c0
[  513.617914]        ksys_ioctl+0x66/0x70
[  513.618095]        __x64_sys_ioctl+0x16/0x20
[  513.618286]        do_syscall_64+0x50/0x1a0
[  513.618476]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.618695]
               -> #0 (&dqm->lock_hidden){+.+.}:
[  513.618984]        __mutex_lock+0x98/0x970
[  513.619197]        evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.619459]        kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.619710]        kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.620103]        amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.620363]        amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.620614]        __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.620851]        try_to_unmap_one+0x7fc/0x8f0
[  513.621049]        rmap_walk_anon+0x121/0x290
[  513.621242]        try_to_unmap+0x93/0xf0
[  513.621428]        shrink_page_list+0x606/0xcb0
[  513.621625]        shrink_inactive_list+0x33b/0x700
[  513.621835]        shrink_node_memcg+0x37a/0x7f0
[  513.622034]        shrink_node+0xd8/0x490
[  513.622219]        balance_pgdat+0x18b/0x3b0
[  513.622410]        kswapd+0x203/0x5c0
[  513.622589]        kthread+0x100/0x140
[  513.622769]        ret_from_fork+0x24/0x30
[  513.622957]
               other info that might help us debug this:

[  513.623354] Chain exists of:
                 &dqm->lock_hidden --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem

[  513.623900]  Possible unsafe locking scenario:

[  513.624189]        CPU0                    CPU1
[  513.624397]        ----                    ----
[  513.624594]   lock(&anon_vma->rwsem);
[  513.624771]                                lock(&mapping->i_mmap_rwsem);
[  513.625020]                                lock(&anon_vma->rwsem);
[  513.625253]   lock(&dqm->lock_hidden);
[  513.625433]
                *** DEADLOCK ***

[  513.625783] 3 locks held by kswapd0/611:
[  513.625967]  #0: 00000000f14edf84 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
[  513.626309]  #1: 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.626671]  #2: 0000000067b5cd12 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x5/0xb0
[  513.627037]
               stack backtrace:
[  513.627292] CPU: 0 PID: 611 Comm: kswapd0 Tainted: G        W         4.18.0-kfd-root #2
[  513.627632] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  513.627990] Call Trace:
[  513.628143]  dump_stack+0x7c/0xbb
[  513.628315]  print_circular_bug.isra.37+0x21b/0x228
[  513.628581]  __lock_acquire+0xf7d/0x1470
[  513.628782]  ? unwind_next_frame+0x6c/0x4f0
[  513.628974]  ? lock_acquire+0xec/0x1e0
[  513.629154]  lock_acquire+0xec/0x1e0
[  513.629357]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.629587]  __mutex_lock+0x98/0x970
[  513.629790]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630047]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630309]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630562]  evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630816]  kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.631057]  kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.631288]  amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.631536]  amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.632076]  __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.632299]  try_to_unmap_one+0x7fc/0x8f0
[  513.632487]  ? page_lock_anon_vma_read+0x68/0x250
[  513.632690]  rmap_walk_anon+0x121/0x290
[  513.632875]  try_to_unmap+0x93/0xf0
[  513.633050]  ? page_remove_rmap+0x330/0x330
[  513.633239]  ? rcu_read_unlock+0x60/0x60
[  513.633422]  ? page_get_anon_vma+0x160/0x160
[  513.633613]  shrink_page_list+0x606/0xcb0
[  513.633800]  shrink_inactive_list+0x33b/0x700
[  513.633997]  shrink_node_memcg+0x37a/0x7f0
[  513.634186]  ? shrink_node+0xd8/0x490
[  513.634363]  shrink_node+0xd8/0x490
[  513.634537]  balance_pgdat+0x18b/0x3b0
[  513.634718]  kswapd+0x203/0x5c0
[  513.634887]  ? wait_woken+0xb0/0xb0
[  513.635062]  kthread+0x100/0x140
[  513.635231]  ? balance_pgdat+0x3b0/0x3b0
[  513.635414]  ? kthread_delayed_work_timer_fn+0x80/0x80
[  513.635626]  ret_from_fork+0x24/0x30
[  513.636042] Evicting PASID 32768 queues
[  513.936236] Restoring PASID 32768 queues
[  524.708912] Evicting PASID 32768 queues
[  524.999875] Restoring PASID 32768 queues

Signed-off-by: Oak Zeng <[email protected]>
Reviewed-by: Philip Yang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
evadot pushed a commit that referenced this pull request Jun 2, 2020
The idea to break the circular lock dependency is to move allocate_mqd
out of dqm lock protection. See callstack #1 below.

[   59.510149] [drm] Initialized amdgpu 3.30.0 20150101 for 0000:04:00.0 on minor 0

[  513.604034] ======================================================
[  513.604205] WARNING: possible circular locking dependency detected
[  513.604375] 4.18.0-kfd-root #2 Tainted: G        W
[  513.604530] ------------------------------------------------------
[  513.604699] kswapd0/611 is trying to acquire lock:
[  513.604840] 00000000d254022e (&dqm->lock_hidden){+.+.}, at: evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.605150]
               but task is already holding lock:
[  513.605307] 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.605540]
               which lock already depends on the new lock.

[  513.605747]
               the existing dependency chain (in reverse order) is:
[  513.605944]
               -> #4 (&anon_vma->rwsem){++++}:
[  513.606106]        __vma_adjust+0x147/0x7f0
[  513.606231]        __split_vma+0x179/0x190
[  513.606353]        mprotect_fixup+0x217/0x260
[  513.606553]        do_mprotect_pkey+0x211/0x380
[  513.606752]        __x64_sys_mprotect+0x1b/0x20
[  513.606954]        do_syscall_64+0x50/0x1a0
[  513.607149]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.607380]
               -> #3 (&mapping->i_mmap_rwsem){++++}:
[  513.607678]        rmap_walk_file+0x1f0/0x280
[  513.607887]        page_referenced+0xdd/0x180
[  513.608081]        shrink_page_list+0x853/0xcb0
[  513.608279]        shrink_inactive_list+0x33b/0x700
[  513.608483]        shrink_node_memcg+0x37a/0x7f0
[  513.608682]        shrink_node+0xd8/0x490
[  513.608869]        balance_pgdat+0x18b/0x3b0
[  513.609062]        kswapd+0x203/0x5c0
[  513.609241]        kthread+0x100/0x140
[  513.609420]        ret_from_fork+0x24/0x30
[  513.609607]
               -> #2 (fs_reclaim){+.+.}:
[  513.609883]        kmem_cache_alloc_trace+0x34/0x2e0
[  513.610093]        reservation_object_reserve_shared+0x139/0x300
[  513.610326]        ttm_bo_init_reserved+0x291/0x480 [ttm]
[  513.610567]        amdgpu_bo_do_create+0x1d2/0x650 [amdgpu]
[  513.610811]        amdgpu_bo_create+0x40/0x1f0 [amdgpu]
[  513.611041]        amdgpu_bo_create_reserved+0x249/0x2d0 [amdgpu]
[  513.611290]        amdgpu_bo_create_kernel+0x12/0x70 [amdgpu]
[  513.611584]        amdgpu_ttm_init+0x2cb/0x560 [amdgpu]
[  513.611823]        gmc_v9_0_sw_init+0x400/0x750 [amdgpu]
[  513.612491]        amdgpu_device_init+0x14eb/0x1990 [amdgpu]
[  513.612730]        amdgpu_driver_load_kms+0x78/0x290 [amdgpu]
[  513.612958]        drm_dev_register+0x111/0x1a0
[  513.613171]        amdgpu_pci_probe+0x11c/0x1e0 [amdgpu]
[  513.613389]        local_pci_probe+0x3f/0x90
[  513.613581]        pci_device_probe+0x102/0x1c0
[  513.613779]        driver_probe_device+0x2a7/0x480
[  513.613984]        __driver_attach+0x10a/0x110
[  513.614179]        bus_for_each_dev+0x67/0xc0
[  513.614372]        bus_add_driver+0x1eb/0x260
[  513.614565]        driver_register+0x5b/0xe0
[  513.614756]        do_one_initcall+0xac/0x357
[  513.614952]        do_init_module+0x5b/0x213
[  513.615145]        load_module+0x2542/0x2d30
[  513.615337]        __do_sys_finit_module+0xd2/0x100
[  513.615541]        do_syscall_64+0x50/0x1a0
[  513.615731]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.615963]
               -> #1 (reservation_ww_class_mutex){+.+.}:
[  513.616293]        amdgpu_amdkfd_alloc_gtt_mem+0xcf/0x2c0 [amdgpu]
[  513.616554]        init_mqd+0x223/0x260 [amdgpu]
[  513.616779]        create_queue_nocpsch+0x4d9/0x600 [amdgpu]
[  513.617031]        pqm_create_queue+0x37c/0x520 [amdgpu]
[  513.617270]        kfd_ioctl_create_queue+0x2f9/0x650 [amdgpu]
[  513.617522]        kfd_ioctl+0x202/0x350 [amdgpu]
[  513.617724]        do_vfs_ioctl+0x9f/0x6c0
[  513.617914]        ksys_ioctl+0x66/0x70
[  513.618095]        __x64_sys_ioctl+0x16/0x20
[  513.618286]        do_syscall_64+0x50/0x1a0
[  513.618476]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.618695]
               -> #0 (&dqm->lock_hidden){+.+.}:
[  513.618984]        __mutex_lock+0x98/0x970
[  513.619197]        evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.619459]        kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.619710]        kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.620103]        amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.620363]        amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.620614]        __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.620851]        try_to_unmap_one+0x7fc/0x8f0
[  513.621049]        rmap_walk_anon+0x121/0x290
[  513.621242]        try_to_unmap+0x93/0xf0
[  513.621428]        shrink_page_list+0x606/0xcb0
[  513.621625]        shrink_inactive_list+0x33b/0x700
[  513.621835]        shrink_node_memcg+0x37a/0x7f0
[  513.622034]        shrink_node+0xd8/0x490
[  513.622219]        balance_pgdat+0x18b/0x3b0
[  513.622410]        kswapd+0x203/0x5c0
[  513.622589]        kthread+0x100/0x140
[  513.622769]        ret_from_fork+0x24/0x30
[  513.622957]
               other info that might help us debug this:

[  513.623354] Chain exists of:
                 &dqm->lock_hidden --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem

[  513.623900]  Possible unsafe locking scenario:

[  513.624189]        CPU0                    CPU1
[  513.624397]        ----                    ----
[  513.624594]   lock(&anon_vma->rwsem);
[  513.624771]                                lock(&mapping->i_mmap_rwsem);
[  513.625020]                                lock(&anon_vma->rwsem);
[  513.625253]   lock(&dqm->lock_hidden);
[  513.625433]
                *** DEADLOCK ***

[  513.625783] 3 locks held by kswapd0/611:
[  513.625967]  #0: 00000000f14edf84 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
[  513.626309]  #1: 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.626671]  #2: 0000000067b5cd12 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x5/0xb0
[  513.627037]
               stack backtrace:
[  513.627292] CPU: 0 PID: 611 Comm: kswapd0 Tainted: G        W         4.18.0-kfd-root #2
[  513.627632] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  513.627990] Call Trace:
[  513.628143]  dump_stack+0x7c/0xbb
[  513.628315]  print_circular_bug.isra.37+0x21b/0x228
[  513.628581]  __lock_acquire+0xf7d/0x1470
[  513.628782]  ? unwind_next_frame+0x6c/0x4f0
[  513.628974]  ? lock_acquire+0xec/0x1e0
[  513.629154]  lock_acquire+0xec/0x1e0
[  513.629357]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.629587]  __mutex_lock+0x98/0x970
[  513.629790]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630047]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630309]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630562]  evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630816]  kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.631057]  kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.631288]  amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.631536]  amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.632076]  __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.632299]  try_to_unmap_one+0x7fc/0x8f0
[  513.632487]  ? page_lock_anon_vma_read+0x68/0x250
[  513.632690]  rmap_walk_anon+0x121/0x290
[  513.632875]  try_to_unmap+0x93/0xf0
[  513.633050]  ? page_remove_rmap+0x330/0x330
[  513.633239]  ? rcu_read_unlock+0x60/0x60
[  513.633422]  ? page_get_anon_vma+0x160/0x160
[  513.633613]  shrink_page_list+0x606/0xcb0
[  513.633800]  shrink_inactive_list+0x33b/0x700
[  513.633997]  shrink_node_memcg+0x37a/0x7f0
[  513.634186]  ? shrink_node+0xd8/0x490
[  513.634363]  shrink_node+0xd8/0x490
[  513.634537]  balance_pgdat+0x18b/0x3b0
[  513.634718]  kswapd+0x203/0x5c0
[  513.634887]  ? wait_woken+0xb0/0xb0
[  513.635062]  kthread+0x100/0x140
[  513.635231]  ? balance_pgdat+0x3b0/0x3b0
[  513.635414]  ? kthread_delayed_work_timer_fn+0x80/0x80
[  513.635626]  ret_from_fork+0x24/0x30
[  513.636042] Evicting PASID 32768 queues
[  513.936236] Restoring PASID 32768 queues
[  524.708912] Evicting PASID 32768 queues
[  524.999875] Restoring PASID 32768 queues

Signed-off-by: Oak Zeng <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
evadot pushed a commit that referenced this pull request Jun 2, 2020
Acquiring drm_client_dev.modeset_mutex after the locks in drm_fb_helper.dev
creates a deadlock with drm_setup_crtcs() as shown below:

  [    4.959319] fbcon: radeondrmfb (fb0) is primary device
  [    4.993952] Console: switching to colour frame buffer device 240x67
  [    4.994040]
  [    4.994041] ======================================================
  [    4.994041] WARNING: possible circular locking dependency detected
  [    4.994042] 5.2.0-rc4-1-default+ #39 Tainted: G            E
  [    4.994043] ------------------------------------------------------
  [    4.994043] systemd-udevd/369 is trying to acquire lock:
  [    4.994044] 00000000fb622acb (&client->modeset_mutex){+.+.}, at: drm_fb_helper_pan_display+0x103/0x1f0 [drm_kms_helper]
  [    4.994055]
  [    4.994055] but task is already holding lock:
  [    4.994055] 0000000028767ae4 (crtc_ww_class_mutex){+.+.}, at: drm_modeset_lock+0x42/0xf0 [drm]
  [    4.994072]
  [    4.994072] which lock already depends on the new lock.
  [    4.994072]
  [    4.994072]
  [    4.994072] the existing dependency chain (in reverse order) is:
  [    4.994073]
  [    4.994073] -> #3 (crtc_ww_class_mutex){+.+.}:
  [    4.994076]        lock_acquire+0x9e/0x170
  [    4.994079]        __ww_mutex_lock.constprop.18+0x97/0xf40
  [    4.994080]        ww_mutex_lock+0x30/0x90
  [    4.994091]        drm_modeset_lock+0x42/0xf0 [drm]
  [    4.994102]        drm_modeset_lock_all_ctx+0x1f/0xe0 [drm]
  [    4.994113]        drm_modeset_lock_all+0x5e/0x1a0 [drm]
  [    4.994163]        intel_modeset_init+0x60b/0xda0 [i915]
  ..
  [    4.994253]
  [    4.994253] -> #2 (crtc_ww_class_acquire){+.+.}:
  [    4.994255]        lock_acquire+0x9e/0x170
  [    4.994270]        drm_modeset_acquire_init+0xcc/0x100 [drm]
  [    4.994280]        drm_modeset_lock_all+0x44/0x1a0 [drm]
  [    4.994320]        intel_modeset_init+0x60b/0xda0 [i915]
  ..
  [    4.994403]
  [    4.994403] -> #1 (&dev->mode_config.mutex){+.+.}:
  [    4.994405]        lock_acquire+0x9e/0x170
  [    4.994408]        __mutex_lock+0x62/0x8c0
  [    4.994413]        drm_setup_crtcs+0x17c/0xc50 [drm_kms_helper]
  [    4.994418]        __drm_fb_helper_initial_config_and_unlock+0x34/0x530 [drm_kms_helper]
  [    4.994450]        radeon_fbdev_init+0x110/0x130 [radeon]
  ..
  [    4.994535]
  [    4.994535] -> #0 (&client->modeset_mutex){+.+.}:
  [    4.994537]        __lock_acquire+0xa85/0xe90
  [    4.994538]        lock_acquire+0x9e/0x170
  [    4.994540]        __mutex_lock+0x62/0x8c0
  [    4.994545]        drm_fb_helper_pan_display+0x103/0x1f0 [drm_kms_helper]
  [    4.994547]        fb_pan_display+0x92/0x120
  [    4.994549]        bit_update_start+0x1a/0x40
  [    4.994550]        fbcon_switch+0x392/0x580
  [    4.994552]        redraw_screen+0x12c/0x220
  [    4.994553]        do_bind_con_driver.cold.30+0xe1/0x10d
  [    4.994554]        do_take_over_console+0x113/0x190
  [    4.994555]        do_fbcon_takeover+0x58/0xb0
  [    4.994557]        notifier_call_chain+0x47/0x70
  [    4.994558]        blocking_notifier_call_chain+0x44/0x60
  [    4.994559]        register_framebuffer+0x231/0x310
  [    4.994564]        __drm_fb_helper_initial_config_and_unlock+0x2fd/0x530 [drm_kms_helper]
  [    4.994590]        radeon_fbdev_init+0x110/0x130 [radeon]
  ..

This problem was introduced in

  d81294afe	drm/fb-helper: Remove drm_fb_helper_crtc

Reversing the lock ordering in pan_display_legacy() fixes the issue.

Fixes: d81294afeecd ("drm/fb-helper: Remove drm_fb_helper_crtc")
Cc: Noralf Trønnes <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Maxime Ripard <[email protected]>
Cc: Sean Paul <[email protected]>
Cc: [email protected]
Signed-off-by: Thomas Zimmermann <[email protected]>
Suggested-by: Daniel Vetter <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
evadot pushed a commit that referenced this pull request Jun 2, 2020
The idea to break the circular lock dependency is to temporarily drop
dqm lock before calling allocate_mqd. See callstack #1 below.

[   59.510149] [drm] Initialized amdgpu 3.30.0 20150101 for 0000:04:00.0 on minor 0

[  513.604034] ======================================================
[  513.604205] WARNING: possible circular locking dependency detected
[  513.604375] 4.18.0-kfd-root #2 Tainted: G        W
[  513.604530] ------------------------------------------------------
[  513.604699] kswapd0/611 is trying to acquire lock:
[  513.604840] 00000000d254022e (&dqm->lock_hidden){+.+.}, at: evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.605150]
               but task is already holding lock:
[  513.605307] 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.605540]
               which lock already depends on the new lock.

[  513.605747]
               the existing dependency chain (in reverse order) is:
[  513.605944]
               -> #4 (&anon_vma->rwsem){++++}:
[  513.606106]        __vma_adjust+0x147/0x7f0
[  513.606231]        __split_vma+0x179/0x190
[  513.606353]        mprotect_fixup+0x217/0x260
[  513.606553]        do_mprotect_pkey+0x211/0x380
[  513.606752]        __x64_sys_mprotect+0x1b/0x20
[  513.606954]        do_syscall_64+0x50/0x1a0
[  513.607149]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.607380]
               -> #3 (&mapping->i_mmap_rwsem){++++}:
[  513.607678]        rmap_walk_file+0x1f0/0x280
[  513.607887]        page_referenced+0xdd/0x180
[  513.608081]        shrink_page_list+0x853/0xcb0
[  513.608279]        shrink_inactive_list+0x33b/0x700
[  513.608483]        shrink_node_memcg+0x37a/0x7f0
[  513.608682]        shrink_node+0xd8/0x490
[  513.608869]        balance_pgdat+0x18b/0x3b0
[  513.609062]        kswapd+0x203/0x5c0
[  513.609241]        kthread+0x100/0x140
[  513.609420]        ret_from_fork+0x24/0x30
[  513.609607]
               -> #2 (fs_reclaim){+.+.}:
[  513.609883]        kmem_cache_alloc_trace+0x34/0x2e0
[  513.610093]        reservation_object_reserve_shared+0x139/0x300
[  513.610326]        ttm_bo_init_reserved+0x291/0x480 [ttm]
[  513.610567]        amdgpu_bo_do_create+0x1d2/0x650 [amdgpu]
[  513.610811]        amdgpu_bo_create+0x40/0x1f0 [amdgpu]
[  513.611041]        amdgpu_bo_create_reserved+0x249/0x2d0 [amdgpu]
[  513.611290]        amdgpu_bo_create_kernel+0x12/0x70 [amdgpu]
[  513.611584]        amdgpu_ttm_init+0x2cb/0x560 [amdgpu]
[  513.611823]        gmc_v9_0_sw_init+0x400/0x750 [amdgpu]
[  513.612491]        amdgpu_device_init+0x14eb/0x1990 [amdgpu]
[  513.612730]        amdgpu_driver_load_kms+0x78/0x290 [amdgpu]
[  513.612958]        drm_dev_register+0x111/0x1a0
[  513.613171]        amdgpu_pci_probe+0x11c/0x1e0 [amdgpu]
[  513.613389]        local_pci_probe+0x3f/0x90
[  513.613581]        pci_device_probe+0x102/0x1c0
[  513.613779]        driver_probe_device+0x2a7/0x480
[  513.613984]        __driver_attach+0x10a/0x110
[  513.614179]        bus_for_each_dev+0x67/0xc0
[  513.614372]        bus_add_driver+0x1eb/0x260
[  513.614565]        driver_register+0x5b/0xe0
[  513.614756]        do_one_initcall+0xac/0x357
[  513.614952]        do_init_module+0x5b/0x213
[  513.615145]        load_module+0x2542/0x2d30
[  513.615337]        __do_sys_finit_module+0xd2/0x100
[  513.615541]        do_syscall_64+0x50/0x1a0
[  513.615731]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.615963]
               -> #1 (reservation_ww_class_mutex){+.+.}:
[  513.616293]        amdgpu_amdkfd_alloc_gtt_mem+0xcf/0x2c0 [amdgpu]
[  513.616554]        init_mqd+0x223/0x260 [amdgpu]
[  513.616779]        create_queue_nocpsch+0x4d9/0x600 [amdgpu]
[  513.617031]        pqm_create_queue+0x37c/0x520 [amdgpu]
[  513.617270]        kfd_ioctl_create_queue+0x2f9/0x650 [amdgpu]
[  513.617522]        kfd_ioctl+0x202/0x350 [amdgpu]
[  513.617724]        do_vfs_ioctl+0x9f/0x6c0
[  513.617914]        ksys_ioctl+0x66/0x70
[  513.618095]        __x64_sys_ioctl+0x16/0x20
[  513.618286]        do_syscall_64+0x50/0x1a0
[  513.618476]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.618695]
               -> #0 (&dqm->lock_hidden){+.+.}:
[  513.618984]        __mutex_lock+0x98/0x970
[  513.619197]        evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.619459]        kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.619710]        kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.620103]        amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.620363]        amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.620614]        __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.620851]        try_to_unmap_one+0x7fc/0x8f0
[  513.621049]        rmap_walk_anon+0x121/0x290
[  513.621242]        try_to_unmap+0x93/0xf0
[  513.621428]        shrink_page_list+0x606/0xcb0
[  513.621625]        shrink_inactive_list+0x33b/0x700
[  513.621835]        shrink_node_memcg+0x37a/0x7f0
[  513.622034]        shrink_node+0xd8/0x490
[  513.622219]        balance_pgdat+0x18b/0x3b0
[  513.622410]        kswapd+0x203/0x5c0
[  513.622589]        kthread+0x100/0x140
[  513.622769]        ret_from_fork+0x24/0x30
[  513.622957]
               other info that might help us debug this:

[  513.623354] Chain exists of:
                 &dqm->lock_hidden --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem

[  513.623900]  Possible unsafe locking scenario:

[  513.624189]        CPU0                    CPU1
[  513.624397]        ----                    ----
[  513.624594]   lock(&anon_vma->rwsem);
[  513.624771]                                lock(&mapping->i_mmap_rwsem);
[  513.625020]                                lock(&anon_vma->rwsem);
[  513.625253]   lock(&dqm->lock_hidden);
[  513.625433]
                *** DEADLOCK ***

[  513.625783] 3 locks held by kswapd0/611:
[  513.625967]  #0: 00000000f14edf84 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
[  513.626309]  #1: 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.626671]  #2: 0000000067b5cd12 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x5/0xb0
[  513.627037]
               stack backtrace:
[  513.627292] CPU: 0 PID: 611 Comm: kswapd0 Tainted: G        W         4.18.0-kfd-root #2
[  513.627632] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  513.627990] Call Trace:
[  513.628143]  dump_stack+0x7c/0xbb
[  513.628315]  print_circular_bug.isra.37+0x21b/0x228
[  513.628581]  __lock_acquire+0xf7d/0x1470
[  513.628782]  ? unwind_next_frame+0x6c/0x4f0
[  513.628974]  ? lock_acquire+0xec/0x1e0
[  513.629154]  lock_acquire+0xec/0x1e0
[  513.629357]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.629587]  __mutex_lock+0x98/0x970
[  513.629790]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630047]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630309]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630562]  evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630816]  kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.631057]  kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.631288]  amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.631536]  amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.632076]  __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.632299]  try_to_unmap_one+0x7fc/0x8f0
[  513.632487]  ? page_lock_anon_vma_read+0x68/0x250
[  513.632690]  rmap_walk_anon+0x121/0x290
[  513.632875]  try_to_unmap+0x93/0xf0
[  513.633050]  ? page_remove_rmap+0x330/0x330
[  513.633239]  ? rcu_read_unlock+0x60/0x60
[  513.633422]  ? page_get_anon_vma+0x160/0x160
[  513.633613]  shrink_page_list+0x606/0xcb0
[  513.633800]  shrink_inactive_list+0x33b/0x700
[  513.633997]  shrink_node_memcg+0x37a/0x7f0
[  513.634186]  ? shrink_node+0xd8/0x490
[  513.634363]  shrink_node+0xd8/0x490
[  513.634537]  balance_pgdat+0x18b/0x3b0
[  513.634718]  kswapd+0x203/0x5c0
[  513.634887]  ? wait_woken+0xb0/0xb0
[  513.635062]  kthread+0x100/0x140
[  513.635231]  ? balance_pgdat+0x3b0/0x3b0
[  513.635414]  ? kthread_delayed_work_timer_fn+0x80/0x80
[  513.635626]  ret_from_fork+0x24/0x30
[  513.636042] Evicting PASID 32768 queues
[  513.936236] Restoring PASID 32768 queues
[  524.708912] Evicting PASID 32768 queues
[  524.999875] Restoring PASID 32768 queues

Signed-off-by: Oak Zeng <[email protected]>
Reviewed-by: Philip Yang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
evadot pushed a commit that referenced this pull request Jun 2, 2020
If we hit an error while allocating the page tables, we have to unwind
the incomplete updates, and wish to free the unused pd. However, we are
not allowed to be hoding the spinlock at that point, and so must use the
later free to defer it until after we drop the lock.

<3> [414.363795] BUG: sleeping function called from invalid context at drivers/gpu/drm/i915/i915_gem_gtt.c:472
<3> [414.364167] in_atomic(): 1, irqs_disabled(): 0, pid: 3905, name: i915_selftest
<4> [414.364406] 3 locks held by i915_selftest/3905:
<4> [414.364408]  #0: 0000000034fe8aa8 (&dev->mutex){....}, at: device_driver_attach+0x18/0x50
<4> [414.364415]  #1: 000000006bd8a560 (&dev->struct_mutex){+.+.}, at: igt_ctx_exec+0xb7/0x410 [i915]
<4> [414.364476]  #2: 000000003dfdc766 (&(&pd->lock)->rlock){+.+.}, at: gen8_ppgtt_alloc_pdp+0x448/0x540 [i915]
<3> [414.364529] Preemption disabled at:
<4> [414.364530] [<0000000000000000>] 0x0
<4> [414.364696] CPU: 0 PID: 3905 Comm: i915_selftest Tainted: G     U            5.2.0-rc7-CI-CI_DRM_6403+ #1
<4> [414.364698] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
<4> [414.364699] Call Trace:
<4> [414.364704]  dump_stack+0x67/0x9b
<4> [414.364708]  ___might_sleep+0x167/0x250
<4> [414.364777]  vm_free_page+0x24/0xc0 [i915]
<4> [414.364852]  free_pd+0xf/0x20 [i915]
<4> [414.364897]  gen8_ppgtt_alloc_pdp+0x489/0x540 [i915]
<4> [414.364946]  gen8_ppgtt_alloc_4lvl+0x8e/0x2e0 [i915]
<4> [414.364992]  ppgtt_bind_vma+0x2e/0x60 [i915]
<4> [414.365039]  i915_vma_bind+0xe8/0x2c0 [i915]
<4> [414.365088]  __i915_vma_do_pin+0xa1/0xd20 [i915]

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111050
Fixes: 1d1b5490b91c ("drm/i915/gtt: Replace struct_mutex serialisation for allocation")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Matthew Auld <[email protected]>
Cc: Mika Kuoppala <[email protected]>
Reviewed-by: Mika Kuoppala <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 068610895ebd4bd86f496f01eb7b97e56d7269b2)
Signed-off-by: Jani Nikula <[email protected]>
evadot pushed a commit that referenced this pull request Jun 2, 2020
The idea to break the circular lock dependency is to move allocate_mqd
out of dqm lock protection. See callstack #1 below.

[   59.510149] [drm] Initialized amdgpu 3.30.0 20150101 for 0000:04:00.0 on minor 0

[  513.604034] ======================================================
[  513.604205] WARNING: possible circular locking dependency detected
[  513.604375] 4.18.0-kfd-root #2 Tainted: G        W
[  513.604530] ------------------------------------------------------
[  513.604699] kswapd0/611 is trying to acquire lock:
[  513.604840] 00000000d254022e (&dqm->lock_hidden){+.+.}, at: evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.605150]
               but task is already holding lock:
[  513.605307] 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.605540]
               which lock already depends on the new lock.

[  513.605747]
               the existing dependency chain (in reverse order) is:
[  513.605944]
               -> #4 (&anon_vma->rwsem){++++}:
[  513.606106]        __vma_adjust+0x147/0x7f0
[  513.606231]        __split_vma+0x179/0x190
[  513.606353]        mprotect_fixup+0x217/0x260
[  513.606553]        do_mprotect_pkey+0x211/0x380
[  513.606752]        __x64_sys_mprotect+0x1b/0x20
[  513.606954]        do_syscall_64+0x50/0x1a0
[  513.607149]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.607380]
               -> #3 (&mapping->i_mmap_rwsem){++++}:
[  513.607678]        rmap_walk_file+0x1f0/0x280
[  513.607887]        page_referenced+0xdd/0x180
[  513.608081]        shrink_page_list+0x853/0xcb0
[  513.608279]        shrink_inactive_list+0x33b/0x700
[  513.608483]        shrink_node_memcg+0x37a/0x7f0
[  513.608682]        shrink_node+0xd8/0x490
[  513.608869]        balance_pgdat+0x18b/0x3b0
[  513.609062]        kswapd+0x203/0x5c0
[  513.609241]        kthread+0x100/0x140
[  513.609420]        ret_from_fork+0x24/0x30
[  513.609607]
               -> #2 (fs_reclaim){+.+.}:
[  513.609883]        kmem_cache_alloc_trace+0x34/0x2e0
[  513.610093]        reservation_object_reserve_shared+0x139/0x300
[  513.610326]        ttm_bo_init_reserved+0x291/0x480 [ttm]
[  513.610567]        amdgpu_bo_do_create+0x1d2/0x650 [amdgpu]
[  513.610811]        amdgpu_bo_create+0x40/0x1f0 [amdgpu]
[  513.611041]        amdgpu_bo_create_reserved+0x249/0x2d0 [amdgpu]
[  513.611290]        amdgpu_bo_create_kernel+0x12/0x70 [amdgpu]
[  513.611584]        amdgpu_ttm_init+0x2cb/0x560 [amdgpu]
[  513.611823]        gmc_v9_0_sw_init+0x400/0x750 [amdgpu]
[  513.612491]        amdgpu_device_init+0x14eb/0x1990 [amdgpu]
[  513.612730]        amdgpu_driver_load_kms+0x78/0x290 [amdgpu]
[  513.612958]        drm_dev_register+0x111/0x1a0
[  513.613171]        amdgpu_pci_probe+0x11c/0x1e0 [amdgpu]
[  513.613389]        local_pci_probe+0x3f/0x90
[  513.613581]        pci_device_probe+0x102/0x1c0
[  513.613779]        driver_probe_device+0x2a7/0x480
[  513.613984]        __driver_attach+0x10a/0x110
[  513.614179]        bus_for_each_dev+0x67/0xc0
[  513.614372]        bus_add_driver+0x1eb/0x260
[  513.614565]        driver_register+0x5b/0xe0
[  513.614756]        do_one_initcall+0xac/0x357
[  513.614952]        do_init_module+0x5b/0x213
[  513.615145]        load_module+0x2542/0x2d30
[  513.615337]        __do_sys_finit_module+0xd2/0x100
[  513.615541]        do_syscall_64+0x50/0x1a0
[  513.615731]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.615963]
               -> #1 (reservation_ww_class_mutex){+.+.}:
[  513.616293]        amdgpu_amdkfd_alloc_gtt_mem+0xcf/0x2c0 [amdgpu]
[  513.616554]        init_mqd+0x223/0x260 [amdgpu]
[  513.616779]        create_queue_nocpsch+0x4d9/0x600 [amdgpu]
[  513.617031]        pqm_create_queue+0x37c/0x520 [amdgpu]
[  513.617270]        kfd_ioctl_create_queue+0x2f9/0x650 [amdgpu]
[  513.617522]        kfd_ioctl+0x202/0x350 [amdgpu]
[  513.617724]        do_vfs_ioctl+0x9f/0x6c0
[  513.617914]        ksys_ioctl+0x66/0x70
[  513.618095]        __x64_sys_ioctl+0x16/0x20
[  513.618286]        do_syscall_64+0x50/0x1a0
[  513.618476]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.618695]
               -> #0 (&dqm->lock_hidden){+.+.}:
[  513.618984]        __mutex_lock+0x98/0x970
[  513.619197]        evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.619459]        kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.619710]        kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.620103]        amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.620363]        amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.620614]        __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.620851]        try_to_unmap_one+0x7fc/0x8f0
[  513.621049]        rmap_walk_anon+0x121/0x290
[  513.621242]        try_to_unmap+0x93/0xf0
[  513.621428]        shrink_page_list+0x606/0xcb0
[  513.621625]        shrink_inactive_list+0x33b/0x700
[  513.621835]        shrink_node_memcg+0x37a/0x7f0
[  513.622034]        shrink_node+0xd8/0x490
[  513.622219]        balance_pgdat+0x18b/0x3b0
[  513.622410]        kswapd+0x203/0x5c0
[  513.622589]        kthread+0x100/0x140
[  513.622769]        ret_from_fork+0x24/0x30
[  513.622957]
               other info that might help us debug this:

[  513.623354] Chain exists of:
                 &dqm->lock_hidden --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem

[  513.623900]  Possible unsafe locking scenario:

[  513.624189]        CPU0                    CPU1
[  513.624397]        ----                    ----
[  513.624594]   lock(&anon_vma->rwsem);
[  513.624771]                                lock(&mapping->i_mmap_rwsem);
[  513.625020]                                lock(&anon_vma->rwsem);
[  513.625253]   lock(&dqm->lock_hidden);
[  513.625433]
                *** DEADLOCK ***

[  513.625783] 3 locks held by kswapd0/611:
[  513.625967]  #0: 00000000f14edf84 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
[  513.626309]  #1: 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.626671]  #2: 0000000067b5cd12 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x5/0xb0
[  513.627037]
               stack backtrace:
[  513.627292] CPU: 0 PID: 611 Comm: kswapd0 Tainted: G        W         4.18.0-kfd-root #2
[  513.627632] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  513.627990] Call Trace:
[  513.628143]  dump_stack+0x7c/0xbb
[  513.628315]  print_circular_bug.isra.37+0x21b/0x228
[  513.628581]  __lock_acquire+0xf7d/0x1470
[  513.628782]  ? unwind_next_frame+0x6c/0x4f0
[  513.628974]  ? lock_acquire+0xec/0x1e0
[  513.629154]  lock_acquire+0xec/0x1e0
[  513.629357]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.629587]  __mutex_lock+0x98/0x970
[  513.629790]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630047]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630309]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630562]  evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630816]  kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.631057]  kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.631288]  amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.631536]  amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.632076]  __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.632299]  try_to_unmap_one+0x7fc/0x8f0
[  513.632487]  ? page_lock_anon_vma_read+0x68/0x250
[  513.632690]  rmap_walk_anon+0x121/0x290
[  513.632875]  try_to_unmap+0x93/0xf0
[  513.633050]  ? page_remove_rmap+0x330/0x330
[  513.633239]  ? rcu_read_unlock+0x60/0x60
[  513.633422]  ? page_get_anon_vma+0x160/0x160
[  513.633613]  shrink_page_list+0x606/0xcb0
[  513.633800]  shrink_inactive_list+0x33b/0x700
[  513.633997]  shrink_node_memcg+0x37a/0x7f0
[  513.634186]  ? shrink_node+0xd8/0x490
[  513.634363]  shrink_node+0xd8/0x490
[  513.634537]  balance_pgdat+0x18b/0x3b0
[  513.634718]  kswapd+0x203/0x5c0
[  513.634887]  ? wait_woken+0xb0/0xb0
[  513.635062]  kthread+0x100/0x140
[  513.635231]  ? balance_pgdat+0x3b0/0x3b0
[  513.635414]  ? kthread_delayed_work_timer_fn+0x80/0x80
[  513.635626]  ret_from_fork+0x24/0x30
[  513.636042] Evicting PASID 32768 queues
[  513.936236] Restoring PASID 32768 queues
[  524.708912] Evicting PASID 32768 queues
[  524.999875] Restoring PASID 32768 queues

Signed-off-by: Oak Zeng <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
evadot pushed a commit that referenced this pull request Jun 2, 2020
Acquiring drm_client_dev.modeset_mutex after the locks in drm_fb_helper.dev
creates a deadlock with drm_setup_crtcs() as shown below:

  [    4.959319] fbcon: radeondrmfb (fb0) is primary device
  [    4.993952] Console: switching to colour frame buffer device 240x67
  [    4.994040]
  [    4.994041] ======================================================
  [    4.994041] WARNING: possible circular locking dependency detected
  [    4.994042] 5.2.0-rc4-1-default+ #39 Tainted: G            E
  [    4.994043] ------------------------------------------------------
  [    4.994043] systemd-udevd/369 is trying to acquire lock:
  [    4.994044] 00000000fb622acb (&client->modeset_mutex){+.+.}, at: drm_fb_helper_pan_display+0x103/0x1f0 [drm_kms_helper]
  [    4.994055]
  [    4.994055] but task is already holding lock:
  [    4.994055] 0000000028767ae4 (crtc_ww_class_mutex){+.+.}, at: drm_modeset_lock+0x42/0xf0 [drm]
  [    4.994072]
  [    4.994072] which lock already depends on the new lock.
  [    4.994072]
  [    4.994072]
  [    4.994072] the existing dependency chain (in reverse order) is:
  [    4.994073]
  [    4.994073] -> #3 (crtc_ww_class_mutex){+.+.}:
  [    4.994076]        lock_acquire+0x9e/0x170
  [    4.994079]        __ww_mutex_lock.constprop.18+0x97/0xf40
  [    4.994080]        ww_mutex_lock+0x30/0x90
  [    4.994091]        drm_modeset_lock+0x42/0xf0 [drm]
  [    4.994102]        drm_modeset_lock_all_ctx+0x1f/0xe0 [drm]
  [    4.994113]        drm_modeset_lock_all+0x5e/0x1a0 [drm]
  [    4.994163]        intel_modeset_init+0x60b/0xda0 [i915]
  ..
  [    4.994253]
  [    4.994253] -> #2 (crtc_ww_class_acquire){+.+.}:
  [    4.994255]        lock_acquire+0x9e/0x170
  [    4.994270]        drm_modeset_acquire_init+0xcc/0x100 [drm]
  [    4.994280]        drm_modeset_lock_all+0x44/0x1a0 [drm]
  [    4.994320]        intel_modeset_init+0x60b/0xda0 [i915]
  ..
  [    4.994403]
  [    4.994403] -> #1 (&dev->mode_config.mutex){+.+.}:
  [    4.994405]        lock_acquire+0x9e/0x170
  [    4.994408]        __mutex_lock+0x62/0x8c0
  [    4.994413]        drm_setup_crtcs+0x17c/0xc50 [drm_kms_helper]
  [    4.994418]        __drm_fb_helper_initial_config_and_unlock+0x34/0x530 [drm_kms_helper]
  [    4.994450]        radeon_fbdev_init+0x110/0x130 [radeon]
  ..
  [    4.994535]
  [    4.994535] -> #0 (&client->modeset_mutex){+.+.}:
  [    4.994537]        __lock_acquire+0xa85/0xe90
  [    4.994538]        lock_acquire+0x9e/0x170
  [    4.994540]        __mutex_lock+0x62/0x8c0
  [    4.994545]        drm_fb_helper_pan_display+0x103/0x1f0 [drm_kms_helper]
  [    4.994547]        fb_pan_display+0x92/0x120
  [    4.994549]        bit_update_start+0x1a/0x40
  [    4.994550]        fbcon_switch+0x392/0x580
  [    4.994552]        redraw_screen+0x12c/0x220
  [    4.994553]        do_bind_con_driver.cold.30+0xe1/0x10d
  [    4.994554]        do_take_over_console+0x113/0x190
  [    4.994555]        do_fbcon_takeover+0x58/0xb0
  [    4.994557]        notifier_call_chain+0x47/0x70
  [    4.994558]        blocking_notifier_call_chain+0x44/0x60
  [    4.994559]        register_framebuffer+0x231/0x310
  [    4.994564]        __drm_fb_helper_initial_config_and_unlock+0x2fd/0x530 [drm_kms_helper]
  [    4.994590]        radeon_fbdev_init+0x110/0x130 [radeon]
  ..

This problem was introduced in

  d81294afe	drm/fb-helper: Remove drm_fb_helper_crtc

Reversing the lock ordering in pan_display_legacy() fixes the issue.

Fixes: d81294afeecd ("drm/fb-helper: Remove drm_fb_helper_crtc")
Cc: Noralf Trønnes <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Maxime Ripard <[email protected]>
Cc: Sean Paul <[email protected]>
Cc: [email protected]
Signed-off-by: Thomas Zimmermann <[email protected]>
Suggested-by: Daniel Vetter <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
evadot pushed a commit that referenced this pull request Jun 2, 2020
The idea to break the circular lock dependency is to temporarily drop
dqm lock before calling allocate_mqd. See callstack #1 below.

[   59.510149] [drm] Initialized amdgpu 3.30.0 20150101 for 0000:04:00.0 on minor 0

[  513.604034] ======================================================
[  513.604205] WARNING: possible circular locking dependency detected
[  513.604375] 4.18.0-kfd-root #2 Tainted: G        W
[  513.604530] ------------------------------------------------------
[  513.604699] kswapd0/611 is trying to acquire lock:
[  513.604840] 00000000d254022e (&dqm->lock_hidden){+.+.}, at: evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.605150]
               but task is already holding lock:
[  513.605307] 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.605540]
               which lock already depends on the new lock.

[  513.605747]
               the existing dependency chain (in reverse order) is:
[  513.605944]
               -> #4 (&anon_vma->rwsem){++++}:
[  513.606106]        __vma_adjust+0x147/0x7f0
[  513.606231]        __split_vma+0x179/0x190
[  513.606353]        mprotect_fixup+0x217/0x260
[  513.606553]        do_mprotect_pkey+0x211/0x380
[  513.606752]        __x64_sys_mprotect+0x1b/0x20
[  513.606954]        do_syscall_64+0x50/0x1a0
[  513.607149]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.607380]
               -> #3 (&mapping->i_mmap_rwsem){++++}:
[  513.607678]        rmap_walk_file+0x1f0/0x280
[  513.607887]        page_referenced+0xdd/0x180
[  513.608081]        shrink_page_list+0x853/0xcb0
[  513.608279]        shrink_inactive_list+0x33b/0x700
[  513.608483]        shrink_node_memcg+0x37a/0x7f0
[  513.608682]        shrink_node+0xd8/0x490
[  513.608869]        balance_pgdat+0x18b/0x3b0
[  513.609062]        kswapd+0x203/0x5c0
[  513.609241]        kthread+0x100/0x140
[  513.609420]        ret_from_fork+0x24/0x30
[  513.609607]
               -> #2 (fs_reclaim){+.+.}:
[  513.609883]        kmem_cache_alloc_trace+0x34/0x2e0
[  513.610093]        reservation_object_reserve_shared+0x139/0x300
[  513.610326]        ttm_bo_init_reserved+0x291/0x480 [ttm]
[  513.610567]        amdgpu_bo_do_create+0x1d2/0x650 [amdgpu]
[  513.610811]        amdgpu_bo_create+0x40/0x1f0 [amdgpu]
[  513.611041]        amdgpu_bo_create_reserved+0x249/0x2d0 [amdgpu]
[  513.611290]        amdgpu_bo_create_kernel+0x12/0x70 [amdgpu]
[  513.611584]        amdgpu_ttm_init+0x2cb/0x560 [amdgpu]
[  513.611823]        gmc_v9_0_sw_init+0x400/0x750 [amdgpu]
[  513.612491]        amdgpu_device_init+0x14eb/0x1990 [amdgpu]
[  513.612730]        amdgpu_driver_load_kms+0x78/0x290 [amdgpu]
[  513.612958]        drm_dev_register+0x111/0x1a0
[  513.613171]        amdgpu_pci_probe+0x11c/0x1e0 [amdgpu]
[  513.613389]        local_pci_probe+0x3f/0x90
[  513.613581]        pci_device_probe+0x102/0x1c0
[  513.613779]        driver_probe_device+0x2a7/0x480
[  513.613984]        __driver_attach+0x10a/0x110
[  513.614179]        bus_for_each_dev+0x67/0xc0
[  513.614372]        bus_add_driver+0x1eb/0x260
[  513.614565]        driver_register+0x5b/0xe0
[  513.614756]        do_one_initcall+0xac/0x357
[  513.614952]        do_init_module+0x5b/0x213
[  513.615145]        load_module+0x2542/0x2d30
[  513.615337]        __do_sys_finit_module+0xd2/0x100
[  513.615541]        do_syscall_64+0x50/0x1a0
[  513.615731]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.615963]
               -> #1 (reservation_ww_class_mutex){+.+.}:
[  513.616293]        amdgpu_amdkfd_alloc_gtt_mem+0xcf/0x2c0 [amdgpu]
[  513.616554]        init_mqd+0x223/0x260 [amdgpu]
[  513.616779]        create_queue_nocpsch+0x4d9/0x600 [amdgpu]
[  513.617031]        pqm_create_queue+0x37c/0x520 [amdgpu]
[  513.617270]        kfd_ioctl_create_queue+0x2f9/0x650 [amdgpu]
[  513.617522]        kfd_ioctl+0x202/0x350 [amdgpu]
[  513.617724]        do_vfs_ioctl+0x9f/0x6c0
[  513.617914]        ksys_ioctl+0x66/0x70
[  513.618095]        __x64_sys_ioctl+0x16/0x20
[  513.618286]        do_syscall_64+0x50/0x1a0
[  513.618476]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  513.618695]
               -> #0 (&dqm->lock_hidden){+.+.}:
[  513.618984]        __mutex_lock+0x98/0x970
[  513.619197]        evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.619459]        kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.619710]        kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.620103]        amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.620363]        amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.620614]        __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.620851]        try_to_unmap_one+0x7fc/0x8f0
[  513.621049]        rmap_walk_anon+0x121/0x290
[  513.621242]        try_to_unmap+0x93/0xf0
[  513.621428]        shrink_page_list+0x606/0xcb0
[  513.621625]        shrink_inactive_list+0x33b/0x700
[  513.621835]        shrink_node_memcg+0x37a/0x7f0
[  513.622034]        shrink_node+0xd8/0x490
[  513.622219]        balance_pgdat+0x18b/0x3b0
[  513.622410]        kswapd+0x203/0x5c0
[  513.622589]        kthread+0x100/0x140
[  513.622769]        ret_from_fork+0x24/0x30
[  513.622957]
               other info that might help us debug this:

[  513.623354] Chain exists of:
                 &dqm->lock_hidden --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem

[  513.623900]  Possible unsafe locking scenario:

[  513.624189]        CPU0                    CPU1
[  513.624397]        ----                    ----
[  513.624594]   lock(&anon_vma->rwsem);
[  513.624771]                                lock(&mapping->i_mmap_rwsem);
[  513.625020]                                lock(&anon_vma->rwsem);
[  513.625253]   lock(&dqm->lock_hidden);
[  513.625433]
                *** DEADLOCK ***

[  513.625783] 3 locks held by kswapd0/611:
[  513.625967]  #0: 00000000f14edf84 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
[  513.626309]  #1: 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
[  513.626671]  #2: 0000000067b5cd12 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x5/0xb0
[  513.627037]
               stack backtrace:
[  513.627292] CPU: 0 PID: 611 Comm: kswapd0 Tainted: G        W         4.18.0-kfd-root #2
[  513.627632] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  513.627990] Call Trace:
[  513.628143]  dump_stack+0x7c/0xbb
[  513.628315]  print_circular_bug.isra.37+0x21b/0x228
[  513.628581]  __lock_acquire+0xf7d/0x1470
[  513.628782]  ? unwind_next_frame+0x6c/0x4f0
[  513.628974]  ? lock_acquire+0xec/0x1e0
[  513.629154]  lock_acquire+0xec/0x1e0
[  513.629357]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.629587]  __mutex_lock+0x98/0x970
[  513.629790]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630047]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630309]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630562]  evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
[  513.630816]  kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
[  513.631057]  kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
[  513.631288]  amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
[  513.631536]  amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
[  513.632076]  __mmu_notifier_invalidate_range_start+0x70/0xb0
[  513.632299]  try_to_unmap_one+0x7fc/0x8f0
[  513.632487]  ? page_lock_anon_vma_read+0x68/0x250
[  513.632690]  rmap_walk_anon+0x121/0x290
[  513.632875]  try_to_unmap+0x93/0xf0
[  513.633050]  ? page_remove_rmap+0x330/0x330
[  513.633239]  ? rcu_read_unlock+0x60/0x60
[  513.633422]  ? page_get_anon_vma+0x160/0x160
[  513.633613]  shrink_page_list+0x606/0xcb0
[  513.633800]  shrink_inactive_list+0x33b/0x700
[  513.633997]  shrink_node_memcg+0x37a/0x7f0
[  513.634186]  ? shrink_node+0xd8/0x490
[  513.634363]  shrink_node+0xd8/0x490
[  513.634537]  balance_pgdat+0x18b/0x3b0
[  513.634718]  kswapd+0x203/0x5c0
[  513.634887]  ? wait_woken+0xb0/0xb0
[  513.635062]  kthread+0x100/0x140
[  513.635231]  ? balance_pgdat+0x3b0/0x3b0
[  513.635414]  ? kthread_delayed_work_timer_fn+0x80/0x80
[  513.635626]  ret_from_fork+0x24/0x30
[  513.636042] Evicting PASID 32768 queues
[  513.936236] Restoring PASID 32768 queues
[  524.708912] Evicting PASID 32768 queues
[  524.999875] Restoring PASID 32768 queues

Signed-off-by: Oak Zeng <[email protected]>
Reviewed-by: Philip Yang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
evadot pushed a commit that referenced this pull request Jun 2, 2020
If we hit an error while allocating the page tables, we have to unwind
the incomplete updates, and wish to free the unused pd. However, we are
not allowed to be hoding the spinlock at that point, and so must use the
later free to defer it until after we drop the lock.

<3> [414.363795] BUG: sleeping function called from invalid context at drivers/gpu/drm/i915/i915_gem_gtt.c:472
<3> [414.364167] in_atomic(): 1, irqs_disabled(): 0, pid: 3905, name: i915_selftest
<4> [414.364406] 3 locks held by i915_selftest/3905:
<4> [414.364408]  #0: 0000000034fe8aa8 (&dev->mutex){....}, at: device_driver_attach+0x18/0x50
<4> [414.364415]  #1: 000000006bd8a560 (&dev->struct_mutex){+.+.}, at: igt_ctx_exec+0xb7/0x410 [i915]
<4> [414.364476]  #2: 000000003dfdc766 (&(&pd->lock)->rlock){+.+.}, at: gen8_ppgtt_alloc_pdp+0x448/0x540 [i915]
<3> [414.364529] Preemption disabled at:
<4> [414.364530] [<0000000000000000>] 0x0
<4> [414.364696] CPU: 0 PID: 3905 Comm: i915_selftest Tainted: G     U            5.2.0-rc7-CI-CI_DRM_6403+ #1
<4> [414.364698] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
<4> [414.364699] Call Trace:
<4> [414.364704]  dump_stack+0x67/0x9b
<4> [414.364708]  ___might_sleep+0x167/0x250
<4> [414.364777]  vm_free_page+0x24/0xc0 [i915]
<4> [414.364852]  free_pd+0xf/0x20 [i915]
<4> [414.364897]  gen8_ppgtt_alloc_pdp+0x489/0x540 [i915]
<4> [414.364946]  gen8_ppgtt_alloc_4lvl+0x8e/0x2e0 [i915]
<4> [414.364992]  ppgtt_bind_vma+0x2e/0x60 [i915]
<4> [414.365039]  i915_vma_bind+0xe8/0x2c0 [i915]
<4> [414.365088]  __i915_vma_do_pin+0xa1/0xd20 [i915]

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111050
Fixes: 1d1b5490b91c ("drm/i915/gtt: Replace struct_mutex serialisation for allocation")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Matthew Auld <[email protected]>
Cc: Mika Kuoppala <[email protected]>
Reviewed-by: Mika Kuoppala <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 068610895ebd4bd86f496f01eb7b97e56d7269b2)
Signed-off-by: Jani Nikula <[email protected]>
@jorgem-seq
Copy link
Contributor Author

I'm closing this pull request and opening another for the updated tree
and to explain the fix.

@jorgem-seq jorgem-seq closed this Jun 6, 2020
@jorgem-seq jorgem-seq deleted the nostatic branch June 7, 2020 00:09
evadot pushed a commit that referenced this pull request Jul 24, 2020
If we hit an error while allocating the page tables, we have to unwind
the incomplete updates, and wish to free the unused pd. However, we are
not allowed to be hoding the spinlock at that point, and so must use the
later free to defer it until after we drop the lock.

<3> [414.363795] BUG: sleeping function called from invalid context at drivers/gpu/drm/i915/i915_gem_gtt.c:472
<3> [414.364167] in_atomic(): 1, irqs_disabled(): 0, pid: 3905, name: i915_selftest
<4> [414.364406] 3 locks held by i915_selftest/3905:
<4> [414.364408]  #0: 0000000034fe8aa8 (&dev->mutex){....}, at: device_driver_attach+0x18/0x50
<4> [414.364415]  #1: 000000006bd8a560 (&dev->struct_mutex){+.+.}, at: igt_ctx_exec+0xb7/0x410 [i915]
<4> [414.364476]  #2: 000000003dfdc766 (&(&pd->lock)->rlock){+.+.}, at: gen8_ppgtt_alloc_pdp+0x448/0x540 [i915]
<3> [414.364529] Preemption disabled at:
<4> [414.364530] [<0000000000000000>] 0x0
<4> [414.364696] CPU: 0 PID: 3905 Comm: i915_selftest Tainted: G     U            5.2.0-rc7-CI-CI_DRM_6403+ #1
<4> [414.364698] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
<4> [414.364699] Call Trace:
<4> [414.364704]  dump_stack+0x67/0x9b
<4> [414.364708]  ___might_sleep+0x167/0x250
<4> [414.364777]  vm_free_page+0x24/0xc0 [i915]
<4> [414.364852]  free_pd+0xf/0x20 [i915]
<4> [414.364897]  gen8_ppgtt_alloc_pdp+0x489/0x540 [i915]
<4> [414.364946]  gen8_ppgtt_alloc_4lvl+0x8e/0x2e0 [i915]
<4> [414.364992]  ppgtt_bind_vma+0x2e/0x60 [i915]
<4> [414.365039]  i915_vma_bind+0xe8/0x2c0 [i915]
<4> [414.365088]  __i915_vma_do_pin+0xa1/0xd20 [i915]

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111050
Fixes: 1d1b5490b91c ("drm/i915/gtt: Replace struct_mutex serialisation for allocation")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Matthew Auld <[email protected]>
Cc: Mika Kuoppala <[email protected]>
Reviewed-by: Mika Kuoppala <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
evadot pushed a commit that referenced this pull request Jul 24, 2020
[Why]
DRM private objects have no hw_done/flip_done fencing mechanism on their
own and cannot be used to sequence commits accordingly.

When issuing commits that don't touch the same set of hardware resources
like page-flips on different CRTCs we can run into the issue below
because of this:

1. Client requests non-blocking Commit #1, has a new dc_state #1,
state is swapped, commit tail is deferred to work queue

2. Client requests non-blocking Commit #2, has a new dc_state #2,
state is swapped, commit tail is deferred to work queue

3. Commit #2 work starts, commit tail finishes,
atomic state is cleared, dc_state #1 is freed

4. Commit #1 work starts,
commit tail encounters null pointer deref on dc_state #1

In order to change the DC state as in the private object we need to
ensure that we wait for all outstanding commits to finish and that
any other pending commits must wait for the current one to finish as
well.

We do this for MEDIUM and FULL updates. But not for FAST updates, nor
would we want to since it would cause stuttering from the delays.

FAST updates that go through dm_determine_update_type_for_commit always
create a new dc_state and lock the DRM private object if there are
any changed planes.

We need the old state to validate, but we don't actually need the new
state here.

[How]
If the commit isn't a full update then the use after free can be
resolved by simply discarding the new state entirely and retaining
the existing one instead.

With this change the sequence above can be reexamined. Commit #2 will
still free Commit #1's reference, but before this happens we actually
added an additional reference as part of Commit #2.

If an update comes in during this that needs to change the dc_state
it will need to wait on Commit #1 and Commit #2 to finish. Then it'll
swap the state, finish the work in commit tail and drop the last
reference on Commit #2's dc_state.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204181
Fixes: 004b3938e637 ("drm/amd/display: Check scaling info when determing update type")

Signed-off-by: Nicholas Kazlauskas <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: David Francis <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
evadot pushed a commit that referenced this pull request Aug 3, 2020
Lockdep reports a circular locking dependency with pages_lock taken in
the shrinker callback. The deadlock can't actually happen with current
users at least as a BO will never be purgeable when pages_lock is held.
To be safe, let's use mutex_trylock() instead and bail if a BO is locked
already.

WARNING: possible circular locking dependency detected
5.3.0-rc1+ #100 Tainted: G             L
------------------------------------------------------
kswapd0/171 is trying to acquire lock:
000000009b9823fd (&shmem->pages_lock){+.+.}, at: drm_gem_shmem_purge+0x20/0x40

but task is already holding lock:
00000000f82369b6 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x40

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}:
       fs_reclaim_acquire.part.18+0x34/0x40
       fs_reclaim_acquire+0x20/0x28
       __kmalloc_node+0x6c/0x4c0
       kvmalloc_node+0x38/0xa8
       drm_gem_get_pages+0x80/0x1d0
       drm_gem_shmem_get_pages+0x58/0xa0
       drm_gem_shmem_get_pages_sgt+0x48/0xd0
       panfrost_mmu_map+0x38/0xf8 [panfrost]
       panfrost_gem_open+0xc0/0xe8 [panfrost]
       drm_gem_handle_create_tail+0xe8/0x198
       drm_gem_handle_create+0x3c/0x50
       panfrost_gem_create_with_handle+0x70/0xa0 [panfrost]
       panfrost_ioctl_create_bo+0x48/0x80 [panfrost]
       drm_ioctl_kernel+0xb8/0x110
       drm_ioctl+0x244/0x3f0
       do_vfs_ioctl+0xbc/0x910
       ksys_ioctl+0x78/0xa8
       __arm64_sys_ioctl+0x1c/0x28
       el0_svc_common.constprop.0+0x90/0x168
       el0_svc_handler+0x28/0x78
       el0_svc+0x8/0xc

-> #0 (&shmem->pages_lock){+.+.}:
       __lock_acquire+0xa2c/0x1d70
       lock_acquire+0xdc/0x228
       __mutex_lock+0x8c/0x800
       mutex_lock_nested+0x1c/0x28
       drm_gem_shmem_purge+0x20/0x40
       panfrost_gem_shrinker_scan+0xc0/0x180 [panfrost]
       do_shrink_slab+0x208/0x500
       shrink_slab+0x10c/0x2c0
       shrink_node+0x28c/0x4d8
       balance_pgdat+0x2c8/0x570
       kswapd+0x22c/0x638
       kthread+0x128/0x130
       ret_from_fork+0x10/0x18

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               lock(&shmem->pages_lock);
                               lock(fs_reclaim);
  lock(&shmem->pages_lock);

 *** DEADLOCK ***

3 locks held by kswapd0/171:
 #0: 00000000f82369b6 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x40
 #1: 00000000ceb37808 (shrinker_rwsem){++++}, at: shrink_slab+0xbc/0x2c0
 #2: 00000000f31efa81 (&pfdev->shrinker_lock){+.+.}, at: panfrost_gem_shrinker_scan+0x34/0x180 [panfrost]

Fixes: 17acb9f35ed7 ("drm/shmem: Add madvise state and purge helpers")
Cc: Maarten Lankhorst <[email protected]>
Cc: Maxime Ripard <[email protected]>
Cc: Sean Paul <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Acked-by: Alyssa Rosenzweig <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
evadot pushed a commit that referenced this pull request Aug 3, 2020
Daniel Vetter uncovered a nasty cycle in using the mmu-notifiers to
invalidate userptr objects which also happen to be pulled into GGTT
mmaps. That is when we unbind the userptr object (on mmu invalidation),
we revoke all CPU mmaps, which may then recurse into mmu invalidation.

We looked for ways of breaking the cycle, but the revocation on
invalidation is required and cannot be avoided. The only solution we
could see was to not allow such GGTT bindings of userptr objects in the
first place. In practice, no one really wants to use a GGTT mmapping of
a CPU pointer...

Just before Daniel's explosive lockdep patches land in v5.4-rc1, we got
a genuine blip from CI:

<4>[  246.793958] ======================================================
<4>[  246.793972] WARNING: possible circular locking dependency detected
<4>[  246.793989] 5.3.0-gbd6c56f50d15-drmtip_372+ #1 Tainted: G     U
<4>[  246.794003] ------------------------------------------------------
<4>[  246.794017] kswapd0/145 is trying to acquire lock:
<4>[  246.794030] 000000003f565be6 (&dev->struct_mutex/1){+.+.}, at: userptr_mn_invalidate_range_start+0x18f/0x220 [i915]
<4>[  246.794250]
                  but task is already holding lock:
<4>[  246.794263] 000000001799cef9 (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe6/0x2a0
<4>[  246.794291]
                  which lock already depends on the new lock.

<4>[  246.794307]
                  the existing dependency chain (in reverse order) is:
<4>[  246.794322]
                  -> #3 (&anon_vma->rwsem){++++}:
<4>[  246.794344]        down_write+0x33/0x70
<4>[  246.794357]        __vma_adjust+0x3d9/0x7b0
<4>[  246.794370]        __split_vma+0x16a/0x180
<4>[  246.794385]        mprotect_fixup+0x2a5/0x320
<4>[  246.794399]        do_mprotect_pkey+0x208/0x2e0
<4>[  246.794413]        __x64_sys_mprotect+0x16/0x20
<4>[  246.794429]        do_syscall_64+0x55/0x1c0
<4>[  246.794443]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794456]
                  -> #2 (&mapping->i_mmap_rwsem){++++}:
<4>[  246.794478]        down_write+0x33/0x70
<4>[  246.794493]        unmap_mapping_pages+0x48/0x130
<4>[  246.794519]        i915_vma_revoke_mmap+0x81/0x1b0 [i915]
<4>[  246.794519]        i915_vma_unbind+0x11d/0x4a0 [i915]
<4>[  246.794519]        i915_vma_destroy+0x31/0x300 [i915]
<4>[  246.794519]        __i915_gem_free_objects+0xb8/0x4b0 [i915]
<4>[  246.794519]        drm_file_free.part.0+0x1e6/0x290
<4>[  246.794519]        drm_release+0xa6/0xe0
<4>[  246.794519]        __fput+0xc2/0x250
<4>[  246.794519]        task_work_run+0x82/0xb0
<4>[  246.794519]        do_exit+0x35b/0xdb0
<4>[  246.794519]        do_group_exit+0x34/0xb0
<4>[  246.794519]        __x64_sys_exit_group+0xf/0x10
<4>[  246.794519]        do_syscall_64+0x55/0x1c0
<4>[  246.794519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794519]
                  -> #1 (&vm->mutex){+.+.}:
<4>[  246.794519]        i915_gem_shrinker_taints_mutex+0x6d/0xe0 [i915]
<4>[  246.794519]        i915_address_space_init+0x9f/0x160 [i915]
<4>[  246.794519]        i915_ggtt_init_hw+0x55/0x170 [i915]
<4>[  246.794519]        i915_driver_probe+0xc9f/0x1620 [i915]
<4>[  246.794519]        i915_pci_probe+0x43/0x1b0 [i915]
<4>[  246.794519]        pci_device_probe+0x9e/0x120
<4>[  246.794519]        really_probe+0xea/0x3d0
<4>[  246.794519]        driver_probe_device+0x10b/0x120
<4>[  246.794519]        device_driver_attach+0x4a/0x50
<4>[  246.794519]        __driver_attach+0x97/0x130
<4>[  246.794519]        bus_for_each_dev+0x74/0xc0
<4>[  246.794519]        bus_add_driver+0x13f/0x210
<4>[  246.794519]        driver_register+0x56/0xe0
<4>[  246.794519]        do_one_initcall+0x58/0x300
<4>[  246.794519]        do_init_module+0x56/0x1f6
<4>[  246.794519]        load_module+0x25bd/0x2a40
<4>[  246.794519]        __se_sys_finit_module+0xd3/0xf0
<4>[  246.794519]        do_syscall_64+0x55/0x1c0
<4>[  246.794519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794519]
                  -> #0 (&dev->struct_mutex/1){+.+.}:
<4>[  246.794519]        __lock_acquire+0x15d8/0x1e90
<4>[  246.794519]        lock_acquire+0xa6/0x1c0
<4>[  246.794519]        __mutex_lock+0x9d/0x9b0
<4>[  246.794519]        userptr_mn_invalidate_range_start+0x18f/0x220 [i915]
<4>[  246.794519]        __mmu_notifier_invalidate_range_start+0x85/0x110
<4>[  246.794519]        try_to_unmap_one+0x76b/0x860
<4>[  246.794519]        rmap_walk_anon+0x104/0x280
<4>[  246.794519]        try_to_unmap+0xc0/0xf0
<4>[  246.794519]        shrink_page_list+0x561/0xc10
<4>[  246.794519]        shrink_inactive_list+0x220/0x440
<4>[  246.794519]        shrink_node_memcg+0x36e/0x740
<4>[  246.794519]        shrink_node+0xcb/0x490
<4>[  246.794519]        balance_pgdat+0x241/0x580
<4>[  246.794519]        kswapd+0x16c/0x530
<4>[  246.794519]        kthread+0x119/0x130
<4>[  246.794519]        ret_from_fork+0x24/0x50
<4>[  246.794519]
                  other info that might help us debug this:

<4>[  246.794519] Chain exists of:
                    &dev->struct_mutex/1 --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem

<4>[  246.794519]  Possible unsafe locking scenario:

<4>[  246.794519]        CPU0                    CPU1
<4>[  246.794519]        ----                    ----
<4>[  246.794519]   lock(&anon_vma->rwsem);
<4>[  246.794519]                                lock(&mapping->i_mmap_rwsem);
<4>[  246.794519]                                lock(&anon_vma->rwsem);
<4>[  246.794519]   lock(&dev->struct_mutex/1);
<4>[  246.794519]
                   *** DEADLOCK ***

v2: Say no to mmap_ioctl

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111744
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111870
Signed-off-by: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: [email protected]
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit a4311745bba9763e3c965643d4531bd5765b0513)
Signed-off-by: Rodrigo Vivi <[email protected]>
evadot pushed a commit that referenced this pull request Aug 4, 2020
[Why]
DRM private objects have no hw_done/flip_done fencing mechanism on their
own and cannot be used to sequence commits accordingly.

When issuing commits that don't touch the same set of hardware resources
like page-flips on different CRTCs we can run into the issue below
because of this:

1. Client requests non-blocking Commit #1, has a new dc_state #1,
state is swapped, commit tail is deferred to work queue

2. Client requests non-blocking Commit #2, has a new dc_state #2,
state is swapped, commit tail is deferred to work queue

3. Commit #2 work starts, commit tail finishes,
atomic state is cleared, dc_state #1 is freed

4. Commit #1 work starts,
commit tail encounters null pointer deref on dc_state #1

In order to change the DC state as in the private object we need to
ensure that we wait for all outstanding commits to finish and that
any other pending commits must wait for the current one to finish as
well.

We do this for MEDIUM and FULL updates. But not for FAST updates, nor
would we want to since it would cause stuttering from the delays.

FAST updates that go through dm_determine_update_type_for_commit always
create a new dc_state and lock the DRM private object if there are
any changed planes.

We need the old state to validate, but we don't actually need the new
state here.

[How]
If the commit isn't a full update then the use after free can be
resolved by simply discarding the new state entirely and retaining
the existing one instead.

With this change the sequence above can be reexamined. Commit #2 will
still free Commit #1's reference, but before this happens we actually
added an additional reference as part of Commit #2.

If an update comes in during this that needs to change the dc_state
it will need to wait on Commit #1 and Commit #2 to finish. Then it'll
swap the state, finish the work in commit tail and drop the last
reference on Commit #2's dc_state.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204181
Fixes: 004b3938e637 ("drm/amd/display: Check scaling info when determing update type")

Signed-off-by: Nicholas Kazlauskas <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: David Francis <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
evadot pushed a commit that referenced this pull request Aug 4, 2020
Lockdep reports a circular locking dependency with pages_lock taken in
the shrinker callback. The deadlock can't actually happen with current
users at least as a BO will never be purgeable when pages_lock is held.
To be safe, let's use mutex_trylock() instead and bail if a BO is locked
already.

WARNING: possible circular locking dependency detected
5.3.0-rc1+ #100 Tainted: G             L
------------------------------------------------------
kswapd0/171 is trying to acquire lock:
000000009b9823fd (&shmem->pages_lock){+.+.}, at: drm_gem_shmem_purge+0x20/0x40

but task is already holding lock:
00000000f82369b6 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x40

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}:
       fs_reclaim_acquire.part.18+0x34/0x40
       fs_reclaim_acquire+0x20/0x28
       __kmalloc_node+0x6c/0x4c0
       kvmalloc_node+0x38/0xa8
       drm_gem_get_pages+0x80/0x1d0
       drm_gem_shmem_get_pages+0x58/0xa0
       drm_gem_shmem_get_pages_sgt+0x48/0xd0
       panfrost_mmu_map+0x38/0xf8 [panfrost]
       panfrost_gem_open+0xc0/0xe8 [panfrost]
       drm_gem_handle_create_tail+0xe8/0x198
       drm_gem_handle_create+0x3c/0x50
       panfrost_gem_create_with_handle+0x70/0xa0 [panfrost]
       panfrost_ioctl_create_bo+0x48/0x80 [panfrost]
       drm_ioctl_kernel+0xb8/0x110
       drm_ioctl+0x244/0x3f0
       do_vfs_ioctl+0xbc/0x910
       ksys_ioctl+0x78/0xa8
       __arm64_sys_ioctl+0x1c/0x28
       el0_svc_common.constprop.0+0x90/0x168
       el0_svc_handler+0x28/0x78
       el0_svc+0x8/0xc

-> #0 (&shmem->pages_lock){+.+.}:
       __lock_acquire+0xa2c/0x1d70
       lock_acquire+0xdc/0x228
       __mutex_lock+0x8c/0x800
       mutex_lock_nested+0x1c/0x28
       drm_gem_shmem_purge+0x20/0x40
       panfrost_gem_shrinker_scan+0xc0/0x180 [panfrost]
       do_shrink_slab+0x208/0x500
       shrink_slab+0x10c/0x2c0
       shrink_node+0x28c/0x4d8
       balance_pgdat+0x2c8/0x570
       kswapd+0x22c/0x638
       kthread+0x128/0x130
       ret_from_fork+0x10/0x18

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               lock(&shmem->pages_lock);
                               lock(fs_reclaim);
  lock(&shmem->pages_lock);

 *** DEADLOCK ***

3 locks held by kswapd0/171:
 #0: 00000000f82369b6 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x40
 #1: 00000000ceb37808 (shrinker_rwsem){++++}, at: shrink_slab+0xbc/0x2c0
 #2: 00000000f31efa81 (&pfdev->shrinker_lock){+.+.}, at: panfrost_gem_shrinker_scan+0x34/0x180 [panfrost]

Fixes: 17acb9f35ed7 ("drm/shmem: Add madvise state and purge helpers")
Cc: Maarten Lankhorst <[email protected]>
Cc: Maxime Ripard <[email protected]>
Cc: Sean Paul <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Acked-by: Alyssa Rosenzweig <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
evadot pushed a commit that referenced this pull request Aug 4, 2020
Daniel Vetter uncovered a nasty cycle in using the mmu-notifiers to
invalidate userptr objects which also happen to be pulled into GGTT
mmaps. That is when we unbind the userptr object (on mmu invalidation),
we revoke all CPU mmaps, which may then recurse into mmu invalidation.

We looked for ways of breaking the cycle, but the revocation on
invalidation is required and cannot be avoided. The only solution we
could see was to not allow such GGTT bindings of userptr objects in the
first place. In practice, no one really wants to use a GGTT mmapping of
a CPU pointer...

Just before Daniel's explosive lockdep patches land in v5.4-rc1, we got
a genuine blip from CI:

<4>[  246.793958] ======================================================
<4>[  246.793972] WARNING: possible circular locking dependency detected
<4>[  246.793989] 5.3.0-gbd6c56f50d15-drmtip_372+ #1 Tainted: G     U
<4>[  246.794003] ------------------------------------------------------
<4>[  246.794017] kswapd0/145 is trying to acquire lock:
<4>[  246.794030] 000000003f565be6 (&dev->struct_mutex/1){+.+.}, at: userptr_mn_invalidate_range_start+0x18f/0x220 [i915]
<4>[  246.794250]
                  but task is already holding lock:
<4>[  246.794263] 000000001799cef9 (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe6/0x2a0
<4>[  246.794291]
                  which lock already depends on the new lock.

<4>[  246.794307]
                  the existing dependency chain (in reverse order) is:
<4>[  246.794322]
                  -> #3 (&anon_vma->rwsem){++++}:
<4>[  246.794344]        down_write+0x33/0x70
<4>[  246.794357]        __vma_adjust+0x3d9/0x7b0
<4>[  246.794370]        __split_vma+0x16a/0x180
<4>[  246.794385]        mprotect_fixup+0x2a5/0x320
<4>[  246.794399]        do_mprotect_pkey+0x208/0x2e0
<4>[  246.794413]        __x64_sys_mprotect+0x16/0x20
<4>[  246.794429]        do_syscall_64+0x55/0x1c0
<4>[  246.794443]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794456]
                  -> #2 (&mapping->i_mmap_rwsem){++++}:
<4>[  246.794478]        down_write+0x33/0x70
<4>[  246.794493]        unmap_mapping_pages+0x48/0x130
<4>[  246.794519]        i915_vma_revoke_mmap+0x81/0x1b0 [i915]
<4>[  246.794519]        i915_vma_unbind+0x11d/0x4a0 [i915]
<4>[  246.794519]        i915_vma_destroy+0x31/0x300 [i915]
<4>[  246.794519]        __i915_gem_free_objects+0xb8/0x4b0 [i915]
<4>[  246.794519]        drm_file_free.part.0+0x1e6/0x290
<4>[  246.794519]        drm_release+0xa6/0xe0
<4>[  246.794519]        __fput+0xc2/0x250
<4>[  246.794519]        task_work_run+0x82/0xb0
<4>[  246.794519]        do_exit+0x35b/0xdb0
<4>[  246.794519]        do_group_exit+0x34/0xb0
<4>[  246.794519]        __x64_sys_exit_group+0xf/0x10
<4>[  246.794519]        do_syscall_64+0x55/0x1c0
<4>[  246.794519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794519]
                  -> #1 (&vm->mutex){+.+.}:
<4>[  246.794519]        i915_gem_shrinker_taints_mutex+0x6d/0xe0 [i915]
<4>[  246.794519]        i915_address_space_init+0x9f/0x160 [i915]
<4>[  246.794519]        i915_ggtt_init_hw+0x55/0x170 [i915]
<4>[  246.794519]        i915_driver_probe+0xc9f/0x1620 [i915]
<4>[  246.794519]        i915_pci_probe+0x43/0x1b0 [i915]
<4>[  246.794519]        pci_device_probe+0x9e/0x120
<4>[  246.794519]        really_probe+0xea/0x3d0
<4>[  246.794519]        driver_probe_device+0x10b/0x120
<4>[  246.794519]        device_driver_attach+0x4a/0x50
<4>[  246.794519]        __driver_attach+0x97/0x130
<4>[  246.794519]        bus_for_each_dev+0x74/0xc0
<4>[  246.794519]        bus_add_driver+0x13f/0x210
<4>[  246.794519]        driver_register+0x56/0xe0
<4>[  246.794519]        do_one_initcall+0x58/0x300
<4>[  246.794519]        do_init_module+0x56/0x1f6
<4>[  246.794519]        load_module+0x25bd/0x2a40
<4>[  246.794519]        __se_sys_finit_module+0xd3/0xf0
<4>[  246.794519]        do_syscall_64+0x55/0x1c0
<4>[  246.794519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794519]
                  -> #0 (&dev->struct_mutex/1){+.+.}:
<4>[  246.794519]        __lock_acquire+0x15d8/0x1e90
<4>[  246.794519]        lock_acquire+0xa6/0x1c0
<4>[  246.794519]        __mutex_lock+0x9d/0x9b0
<4>[  246.794519]        userptr_mn_invalidate_range_start+0x18f/0x220 [i915]
<4>[  246.794519]        __mmu_notifier_invalidate_range_start+0x85/0x110
<4>[  246.794519]        try_to_unmap_one+0x76b/0x860
<4>[  246.794519]        rmap_walk_anon+0x104/0x280
<4>[  246.794519]        try_to_unmap+0xc0/0xf0
<4>[  246.794519]        shrink_page_list+0x561/0xc10
<4>[  246.794519]        shrink_inactive_list+0x220/0x440
<4>[  246.794519]        shrink_node_memcg+0x36e/0x740
<4>[  246.794519]        shrink_node+0xcb/0x490
<4>[  246.794519]        balance_pgdat+0x241/0x580
<4>[  246.794519]        kswapd+0x16c/0x530
<4>[  246.794519]        kthread+0x119/0x130
<4>[  246.794519]        ret_from_fork+0x24/0x50
<4>[  246.794519]
                  other info that might help us debug this:

<4>[  246.794519] Chain exists of:
                    &dev->struct_mutex/1 --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem

<4>[  246.794519]  Possible unsafe locking scenario:

<4>[  246.794519]        CPU0                    CPU1
<4>[  246.794519]        ----                    ----
<4>[  246.794519]   lock(&anon_vma->rwsem);
<4>[  246.794519]                                lock(&mapping->i_mmap_rwsem);
<4>[  246.794519]                                lock(&anon_vma->rwsem);
<4>[  246.794519]   lock(&dev->struct_mutex/1);
<4>[  246.794519]
                   *** DEADLOCK ***

v2: Say no to mmap_ioctl

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111744
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111870
Signed-off-by: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: [email protected]
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit a4311745bba9763e3c965643d4531bd5765b0513)
Signed-off-by: Rodrigo Vivi <[email protected]>
evadot pushed a commit that referenced this pull request Aug 13, 2020
Lockdep reports a circular locking dependency with pages_lock taken in
the shrinker callback. The deadlock can't actually happen with current
users at least as a BO will never be purgeable when pages_lock is held.
To be safe, let's use mutex_trylock() instead and bail if a BO is locked
already.

WARNING: possible circular locking dependency detected
5.3.0-rc1+ #100 Tainted: G             L
------------------------------------------------------
kswapd0/171 is trying to acquire lock:
000000009b9823fd (&shmem->pages_lock){+.+.}, at: drm_gem_shmem_purge+0x20/0x40

but task is already holding lock:
00000000f82369b6 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x40

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}:
       fs_reclaim_acquire.part.18+0x34/0x40
       fs_reclaim_acquire+0x20/0x28
       __kmalloc_node+0x6c/0x4c0
       kvmalloc_node+0x38/0xa8
       drm_gem_get_pages+0x80/0x1d0
       drm_gem_shmem_get_pages+0x58/0xa0
       drm_gem_shmem_get_pages_sgt+0x48/0xd0
       panfrost_mmu_map+0x38/0xf8 [panfrost]
       panfrost_gem_open+0xc0/0xe8 [panfrost]
       drm_gem_handle_create_tail+0xe8/0x198
       drm_gem_handle_create+0x3c/0x50
       panfrost_gem_create_with_handle+0x70/0xa0 [panfrost]
       panfrost_ioctl_create_bo+0x48/0x80 [panfrost]
       drm_ioctl_kernel+0xb8/0x110
       drm_ioctl+0x244/0x3f0
       do_vfs_ioctl+0xbc/0x910
       ksys_ioctl+0x78/0xa8
       __arm64_sys_ioctl+0x1c/0x28
       el0_svc_common.constprop.0+0x90/0x168
       el0_svc_handler+0x28/0x78
       el0_svc+0x8/0xc

-> #0 (&shmem->pages_lock){+.+.}:
       __lock_acquire+0xa2c/0x1d70
       lock_acquire+0xdc/0x228
       __mutex_lock+0x8c/0x800
       mutex_lock_nested+0x1c/0x28
       drm_gem_shmem_purge+0x20/0x40
       panfrost_gem_shrinker_scan+0xc0/0x180 [panfrost]
       do_shrink_slab+0x208/0x500
       shrink_slab+0x10c/0x2c0
       shrink_node+0x28c/0x4d8
       balance_pgdat+0x2c8/0x570
       kswapd+0x22c/0x638
       kthread+0x128/0x130
       ret_from_fork+0x10/0x18

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               lock(&shmem->pages_lock);
                               lock(fs_reclaim);
  lock(&shmem->pages_lock);

 *** DEADLOCK ***

3 locks held by kswapd0/171:
 #0: 00000000f82369b6 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x40
 #1: 00000000ceb37808 (shrinker_rwsem){++++}, at: shrink_slab+0xbc/0x2c0
 #2: 00000000f31efa81 (&pfdev->shrinker_lock){+.+.}, at: panfrost_gem_shrinker_scan+0x34/0x180 [panfrost]

Fixes: 17acb9f35ed7 ("drm/shmem: Add madvise state and purge helpers")
Cc: Maarten Lankhorst <[email protected]>
Cc: Maxime Ripard <[email protected]>
Cc: Sean Paul <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Acked-by: Alyssa Rosenzweig <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
evadot pushed a commit that referenced this pull request Aug 13, 2020
Daniel Vetter uncovered a nasty cycle in using the mmu-notifiers to
invalidate userptr objects which also happen to be pulled into GGTT
mmaps. That is when we unbind the userptr object (on mmu invalidation),
we revoke all CPU mmaps, which may then recurse into mmu invalidation.

We looked for ways of breaking the cycle, but the revocation on
invalidation is required and cannot be avoided. The only solution we
could see was to not allow such GGTT bindings of userptr objects in the
first place. In practice, no one really wants to use a GGTT mmapping of
a CPU pointer...

Just before Daniel's explosive lockdep patches land in v5.4-rc1, we got
a genuine blip from CI:

<4>[  246.793958] ======================================================
<4>[  246.793972] WARNING: possible circular locking dependency detected
<4>[  246.793989] 5.3.0-gbd6c56f50d15-drmtip_372+ #1 Tainted: G     U
<4>[  246.794003] ------------------------------------------------------
<4>[  246.794017] kswapd0/145 is trying to acquire lock:
<4>[  246.794030] 000000003f565be6 (&dev->struct_mutex/1){+.+.}, at: userptr_mn_invalidate_range_start+0x18f/0x220 [i915]
<4>[  246.794250]
                  but task is already holding lock:
<4>[  246.794263] 000000001799cef9 (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe6/0x2a0
<4>[  246.794291]
                  which lock already depends on the new lock.

<4>[  246.794307]
                  the existing dependency chain (in reverse order) is:
<4>[  246.794322]
                  -> #3 (&anon_vma->rwsem){++++}:
<4>[  246.794344]        down_write+0x33/0x70
<4>[  246.794357]        __vma_adjust+0x3d9/0x7b0
<4>[  246.794370]        __split_vma+0x16a/0x180
<4>[  246.794385]        mprotect_fixup+0x2a5/0x320
<4>[  246.794399]        do_mprotect_pkey+0x208/0x2e0
<4>[  246.794413]        __x64_sys_mprotect+0x16/0x20
<4>[  246.794429]        do_syscall_64+0x55/0x1c0
<4>[  246.794443]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794456]
                  -> #2 (&mapping->i_mmap_rwsem){++++}:
<4>[  246.794478]        down_write+0x33/0x70
<4>[  246.794493]        unmap_mapping_pages+0x48/0x130
<4>[  246.794519]        i915_vma_revoke_mmap+0x81/0x1b0 [i915]
<4>[  246.794519]        i915_vma_unbind+0x11d/0x4a0 [i915]
<4>[  246.794519]        i915_vma_destroy+0x31/0x300 [i915]
<4>[  246.794519]        __i915_gem_free_objects+0xb8/0x4b0 [i915]
<4>[  246.794519]        drm_file_free.part.0+0x1e6/0x290
<4>[  246.794519]        drm_release+0xa6/0xe0
<4>[  246.794519]        __fput+0xc2/0x250
<4>[  246.794519]        task_work_run+0x82/0xb0
<4>[  246.794519]        do_exit+0x35b/0xdb0
<4>[  246.794519]        do_group_exit+0x34/0xb0
<4>[  246.794519]        __x64_sys_exit_group+0xf/0x10
<4>[  246.794519]        do_syscall_64+0x55/0x1c0
<4>[  246.794519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794519]
                  -> #1 (&vm->mutex){+.+.}:
<4>[  246.794519]        i915_gem_shrinker_taints_mutex+0x6d/0xe0 [i915]
<4>[  246.794519]        i915_address_space_init+0x9f/0x160 [i915]
<4>[  246.794519]        i915_ggtt_init_hw+0x55/0x170 [i915]
<4>[  246.794519]        i915_driver_probe+0xc9f/0x1620 [i915]
<4>[  246.794519]        i915_pci_probe+0x43/0x1b0 [i915]
<4>[  246.794519]        pci_device_probe+0x9e/0x120
<4>[  246.794519]        really_probe+0xea/0x3d0
<4>[  246.794519]        driver_probe_device+0x10b/0x120
<4>[  246.794519]        device_driver_attach+0x4a/0x50
<4>[  246.794519]        __driver_attach+0x97/0x130
<4>[  246.794519]        bus_for_each_dev+0x74/0xc0
<4>[  246.794519]        bus_add_driver+0x13f/0x210
<4>[  246.794519]        driver_register+0x56/0xe0
<4>[  246.794519]        do_one_initcall+0x58/0x300
<4>[  246.794519]        do_init_module+0x56/0x1f6
<4>[  246.794519]        load_module+0x25bd/0x2a40
<4>[  246.794519]        __se_sys_finit_module+0xd3/0xf0
<4>[  246.794519]        do_syscall_64+0x55/0x1c0
<4>[  246.794519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  246.794519]
                  -> #0 (&dev->struct_mutex/1){+.+.}:
<4>[  246.794519]        __lock_acquire+0x15d8/0x1e90
<4>[  246.794519]        lock_acquire+0xa6/0x1c0
<4>[  246.794519]        __mutex_lock+0x9d/0x9b0
<4>[  246.794519]        userptr_mn_invalidate_range_start+0x18f/0x220 [i915]
<4>[  246.794519]        __mmu_notifier_invalidate_range_start+0x85/0x110
<4>[  246.794519]        try_to_unmap_one+0x76b/0x860
<4>[  246.794519]        rmap_walk_anon+0x104/0x280
<4>[  246.794519]        try_to_unmap+0xc0/0xf0
<4>[  246.794519]        shrink_page_list+0x561/0xc10
<4>[  246.794519]        shrink_inactive_list+0x220/0x440
<4>[  246.794519]        shrink_node_memcg+0x36e/0x740
<4>[  246.794519]        shrink_node+0xcb/0x490
<4>[  246.794519]        balance_pgdat+0x241/0x580
<4>[  246.794519]        kswapd+0x16c/0x530
<4>[  246.794519]        kthread+0x119/0x130
<4>[  246.794519]        ret_from_fork+0x24/0x50
<4>[  246.794519]
                  other info that might help us debug this:

<4>[  246.794519] Chain exists of:
                    &dev->struct_mutex/1 --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem

<4>[  246.794519]  Possible unsafe locking scenario:

<4>[  246.794519]        CPU0                    CPU1
<4>[  246.794519]        ----                    ----
<4>[  246.794519]   lock(&anon_vma->rwsem);
<4>[  246.794519]                                lock(&mapping->i_mmap_rwsem);
<4>[  246.794519]                                lock(&anon_vma->rwsem);
<4>[  246.794519]   lock(&dev->struct_mutex/1);
<4>[  246.794519]
                   *** DEADLOCK ***

v2: Say no to mmap_ioctl

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111744
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111870
Signed-off-by: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: [email protected]
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit a4311745bba9763e3c965643d4531bd5765b0513)
Signed-off-by: Rodrigo Vivi <[email protected]>
evadot pushed a commit that referenced this pull request Aug 18, 2020
[ Upstream commit 08ad9a3846fc72b047b110b36d162ffbcf298fa2 ]

As we may signal a request and take the engine->active.lock within the
signaler, the engine submission paths have to use a nested annotation on
their requests -- but we guarantee that we can never submit on the same
engine as the signaling fence.

<4>[  723.763281] WARNING: possible circular locking dependency detected
<4>[  723.763285] 5.3.0-g80fa0e042cdb-drmtip_379+ #1 Tainted: G     U
<4>[  723.763288] ------------------------------------------------------
<4>[  723.763291] gem_exec_await/1388 is trying to acquire lock:
<4>[  723.763294] ffff93a7b53221d8 (&engine->active.lock){..-.}, at: execlists_submit_request+0x2b/0x1e0 [i915]
<4>[  723.763378]
                  but task is already holding lock:
<4>[  723.763381] ffff93a7c25f6d20 (&i915_request_get(rq)->submit/1){-.-.}, at: __i915_sw_fence_complete+0x1b2/0x250 [i915]
<4>[  723.763420]
                  which lock already depends on the new lock.

<4>[  723.763423]
                  the existing dependency chain (in reverse order) is:
<4>[  723.763427]
                  -> #2 (&i915_request_get(rq)->submit/1){-.-.}:
<4>[  723.763434]        _raw_spin_lock_irqsave_nested+0x39/0x50
<4>[  723.763478]        __i915_sw_fence_complete+0x1b2/0x250 [i915]
<4>[  723.763513]        intel_engine_breadcrumbs_irq+0x3aa/0x5e0 [i915]
<4>[  723.763600]        cs_irq_handler+0x49/0x50 [i915]
<4>[  723.763659]        gen11_gt_irq_handler+0x17b/0x280 [i915]
<4>[  723.763690]        gen11_irq_handler+0x54/0xf0 [i915]
<4>[  723.763695]        __handle_irq_event_percpu+0x41/0x2d0
<4>[  723.763699]        handle_irq_event_percpu+0x2b/0x70
<4>[  723.763702]        handle_irq_event+0x2f/0x50
<4>[  723.763706]        handle_edge_irq+0xee/0x1a0
<4>[  723.763709]        do_IRQ+0x7e/0x160
<4>[  723.763712]        ret_from_intr+0x0/0x1d
<4>[  723.763717]        __slab_alloc.isra.28.constprop.33+0x4f/0x70
<4>[  723.763720]        kmem_cache_alloc+0x28d/0x2f0
<4>[  723.763724]        vm_area_dup+0x15/0x40
<4>[  723.763727]        dup_mm+0x2dd/0x550
<4>[  723.763730]        copy_process+0xf21/0x1ef0
<4>[  723.763734]        _do_fork+0x71/0x670
<4>[  723.763737]        __se_sys_clone+0x6e/0xa0
<4>[  723.763741]        do_syscall_64+0x4f/0x210
<4>[  723.763744]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  723.763747]
                  -> #1 (&(&rq->lock)->rlock#2){-.-.}:
<4>[  723.763752]        _raw_spin_lock+0x2a/0x40
<4>[  723.763789]        __unwind_incomplete_requests+0x3eb/0x450 [i915]
<4>[  723.763825]        __execlists_submission_tasklet+0x9ec/0x1d60 [i915]
<4>[  723.763864]        execlists_submission_tasklet+0x34/0x50 [i915]
<4>[  723.763874]        tasklet_action_common.isra.5+0x47/0xb0
<4>[  723.763878]        __do_softirq+0xd8/0x4ae
<4>[  723.763881]        irq_exit+0xa9/0xc0
<4>[  723.763883]        smp_apic_timer_interrupt+0xb7/0x280
<4>[  723.763887]        apic_timer_interrupt+0xf/0x20
<4>[  723.763892]        cpuidle_enter_state+0xae/0x450
<4>[  723.763895]        cpuidle_enter+0x24/0x40
<4>[  723.763899]        do_idle+0x1e7/0x250
<4>[  723.763902]        cpu_startup_entry+0x14/0x20
<4>[  723.763905]        start_secondary+0x15f/0x1b0
<4>[  723.763908]        secondary_startup_64+0xa4/0xb0
<4>[  723.763911]
                  -> #0 (&engine->active.lock){..-.}:
<4>[  723.763916]        __lock_acquire+0x15d8/0x1ea0
<4>[  723.763919]        lock_acquire+0xa6/0x1c0
<4>[  723.763922]        _raw_spin_lock_irqsave+0x33/0x50
<4>[  723.763956]        execlists_submit_request+0x2b/0x1e0 [i915]
<4>[  723.764002]        submit_notify+0xa8/0x13c [i915]
<4>[  723.764035]        __i915_sw_fence_complete+0x81/0x250 [i915]
<4>[  723.764054]        i915_sw_fence_wake+0x51/0x64 [i915]
<4>[  723.764054]        __i915_sw_fence_complete+0x1ee/0x250 [i915]
<4>[  723.764054]        dma_i915_sw_fence_wake_timer+0x14/0x20 [i915]
<4>[  723.764054]        dma_fence_signal_locked+0x9e/0x1c0
<4>[  723.764054]        dma_fence_signal+0x1f/0x40
<4>[  723.764054]        vgem_fence_signal_ioctl+0x67/0xc0 [vgem]
<4>[  723.764054]        drm_ioctl_kernel+0x83/0xf0
<4>[  723.764054]        drm_ioctl+0x2f3/0x3b0
<4>[  723.764054]        do_vfs_ioctl+0xa0/0x6f0
<4>[  723.764054]        ksys_ioctl+0x35/0x60
<4>[  723.764054]        __x64_sys_ioctl+0x11/0x20
<4>[  723.764054]        do_syscall_64+0x4f/0x210
<4>[  723.764054]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  723.764054]
                  other info that might help us debug this:

<4>[  723.764054] Chain exists of:
                    &engine->active.lock --> &(&rq->lock)->rlock#2 --> &i915_request_get(rq)->submit/1

<4>[  723.764054]  Possible unsafe locking scenario:

<4>[  723.764054]        CPU0                    CPU1
<4>[  723.764054]        ----                    ----
<4>[  723.764054]   lock(&i915_request_get(rq)->submit/1);
<4>[  723.764054]                                lock(&(&rq->lock)->rlock#2);
<4>[  723.764054]                                lock(&i915_request_get(rq)->submit/1);
<4>[  723.764054]   lock(&engine->active.lock);
<4>[  723.764054]
                   *** DEADLOCK ***

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111862
Signed-off-by: Chris Wilson <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
evadot pushed a commit that referenced this pull request Aug 18, 2020
[ Upstream commit a86675968e2300fb567994459da3dbc4cd1b322a ]

This reverts commit 64e62bdf04ab8529f45ed0a85122c703035dec3a.

This commit ends up causing some lockdep splats due to trying to grab the
payload lock while holding the mgr's lock:

[   54.010099]
[   54.011765] ======================================================
[   54.018670] WARNING: possible circular locking dependency detected
[   54.025577] 5.5.0-rc6-02274-g77381c23ee63 #47 Not tainted
[   54.031610] ------------------------------------------------------
[   54.038516] kworker/1:6/1040 is trying to acquire lock:
[   54.044354] ffff888272af3228 (&mgr->payload_lock){+.+.}, at:
drm_dp_mst_topology_mgr_set_mst+0x218/0x2e4
[   54.054957]
[   54.054957] but task is already holding lock:
[   54.061473] ffff888272af3060 (&mgr->lock){+.+.}, at:
drm_dp_mst_topology_mgr_set_mst+0x3c/0x2e4
[   54.071193]
[   54.071193] which lock already depends on the new lock.
[   54.071193]
[   54.080334]
[   54.080334] the existing dependency chain (in reverse order) is:
[   54.088697]
[   54.088697] -> #1 (&mgr->lock){+.+.}:
[   54.094440]        __mutex_lock+0xc3/0x498
[   54.099015]        drm_dp_mst_topology_get_port_validated+0x25/0x80
[   54.106018]        drm_dp_update_payload_part1+0xa2/0x2e2
[   54.112051]        intel_mst_pre_enable_dp+0x144/0x18f
[   54.117791]        intel_encoders_pre_enable+0x63/0x70
[   54.123532]        hsw_crtc_enable+0xa1/0x722
[   54.128396]        intel_update_crtc+0x50/0x194
[   54.133455]        skl_commit_modeset_enables+0x40c/0x540
[   54.139485]        intel_atomic_commit_tail+0x5f7/0x130d
[   54.145418]        intel_atomic_commit+0x2c8/0x2d8
[   54.150770]        drm_atomic_helper_set_config+0x5a/0x70
[   54.156801]        drm_mode_setcrtc+0x2ab/0x833
[   54.161862]        drm_ioctl+0x2e5/0x424
[   54.166242]        vfs_ioctl+0x21/0x2f
[   54.170426]        do_vfs_ioctl+0x5fb/0x61e
[   54.175096]        ksys_ioctl+0x55/0x75
[   54.179377]        __x64_sys_ioctl+0x1a/0x1e
[   54.184146]        do_syscall_64+0x5c/0x6d
[   54.188721]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   54.194946]
[   54.194946] -> #0 (&mgr->payload_lock){+.+.}:
[   54.201463]
[   54.201463] other info that might help us debug this:
[   54.201463]
[   54.210410]  Possible unsafe locking scenario:
[   54.210410]
[   54.217025]        CPU0                    CPU1
[   54.222082]        ----                    ----
[   54.227138]   lock(&mgr->lock);
[   54.230643]                                lock(&mgr->payload_lock);
[   54.237742]                                lock(&mgr->lock);
[   54.244062]   lock(&mgr->payload_lock);
[   54.248346]
[   54.248346]  *** DEADLOCK ***
[   54.248346]
[   54.254959] 7 locks held by kworker/1:6/1040:
[   54.259822]  #0: ffff888275c4f528 ((wq_completion)events){+.+.},
at: worker_thread+0x455/0x6e2
[   54.269451]  #1: ffffc9000119beb0
((work_completion)(&(&dev_priv->hotplug.hotplug_work)->work)){+.+.},
at: worker_thread+0x455/0x6e2
[   54.282768]  #2: ffff888272a403f0 (&dev->mode_config.mutex){+.+.},
at: i915_hotplug_work_func+0x4b/0x2be
[   54.293368]  #3: ffffffff824fc6c0 (drm_connector_list_iter){.+.+},
at: i915_hotplug_work_func+0x17e/0x2be
[   54.304061]  #4: ffffc9000119bc58 (crtc_ww_class_acquire){+.+.},
at: drm_helper_probe_detect_ctx+0x40/0xfd
[   54.314855]  #5: ffff888272a40470 (crtc_ww_class_mutex){+.+.}, at:
drm_modeset_lock+0x74/0xe2
[   54.324385]  #6: ffff888272af3060 (&mgr->lock){+.+.}, at:
drm_dp_mst_topology_mgr_set_mst+0x3c/0x2e4
[   54.334597]
[   54.334597] stack backtrace:
[   54.339464] CPU: 1 PID: 1040 Comm: kworker/1:6 Not tainted
5.5.0-rc6-02274-g77381c23ee63 #47
[   54.348893] Hardware name: Google Fizz/Fizz, BIOS
Google_Fizz.10139.39.0 01/04/2018
[   54.357451] Workqueue: events i915_hotplug_work_func
[   54.362995] Call Trace:
[   54.365724]  dump_stack+0x71/0x9c
[   54.369427]  check_noncircular+0x91/0xbc
[   54.373809]  ? __lock_acquire+0xc9e/0xf66
[   54.378286]  ? __lock_acquire+0xc9e/0xf66
[   54.382763]  ? lock_acquire+0x175/0x1ac
[   54.387048]  ? drm_dp_mst_topology_mgr_set_mst+0x218/0x2e4
[   54.393177]  ? __mutex_lock+0xc3/0x498
[   54.397362]  ? drm_dp_mst_topology_mgr_set_mst+0x218/0x2e4
[   54.403492]  ? drm_dp_mst_topology_mgr_set_mst+0x218/0x2e4
[   54.409620]  ? drm_dp_dpcd_access+0xd9/0x101
[   54.414390]  ? drm_dp_mst_topology_mgr_set_mst+0x218/0x2e4
[   54.420517]  ? drm_dp_mst_topology_mgr_set_mst+0x218/0x2e4
[   54.426645]  ? intel_digital_port_connected+0x34d/0x35c
[   54.432482]  ? intel_dp_detect+0x227/0x44e
[   54.437056]  ? ww_mutex_lock+0x49/0x9a
[   54.441242]  ? drm_helper_probe_detect_ctx+0x75/0xfd
[   54.446789]  ? intel_encoder_hotplug+0x4b/0x97
[   54.451752]  ? intel_ddi_hotplug+0x61/0x2e0
[   54.456423]  ? mark_held_locks+0x53/0x68
[   54.460803]  ? _raw_spin_unlock_irqrestore+0x3a/0x51
[   54.466347]  ? lockdep_hardirqs_on+0x187/0x1a4
[   54.471310]  ? drm_connector_list_iter_next+0x89/0x9a
[   54.476953]  ? i915_hotplug_work_func+0x206/0x2be
[   54.482208]  ? worker_thread+0x4d5/0x6e2
[   54.486587]  ? worker_thread+0x455/0x6e2
[   54.490966]  ? queue_work_on+0x64/0x64
[   54.495151]  ? kthread+0x1e9/0x1f1
[   54.498946]  ? queue_work_on+0x64/0x64
[   54.503130]  ? kthread_unpark+0x5e/0x5e
[   54.507413]  ? ret_from_fork+0x3a/0x50

The proper fix for this is probably cleanup the VCPI allocations when we're
enabling the topology, or on the first payload allocation. For now though,
let's just revert.

Signed-off-by: Lyude Paul <[email protected]>
Fixes: 64e62bdf04ab ("drm/dp_mst: Remove VCPI while disabling topology mgr")
Cc: Sean Paul <[email protected]>
Cc: Wayne Lin <[email protected]>
Reviewed-by: Sean Paul <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Aug 16, 2025
The GGTT is currently a 32 bit address space, but the HW and GuC
support 48b addresses in GGTT-related operations, both to keep the
interface/HW paths common between PPGTT and GGTT and to allow for
future increase of the GGTT size.
This leaves us having to program a 64b field with a 32b offset, which
currently we're in some cases doing this by using an upper_32_bits()
call on a 32b variable, which doesn't make any sense. To do this cleanly
we have 2 options:

1 - Set the upper 32 bits directly to zero.
2 - Use 64b variables for the offset and keep programming the whole thing,
    so we're ready if we ever have bigger offsets.

This patch goes with option freebsd#2 and switches the related variables to u64.

v2: don't change the log ctl flag variable (John)

Signed-off-by: Daniele Ceraolo Spurio <[email protected]>
Cc: John Harrison <[email protected]>
Reviewed-by: John Harrison <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Aug 16, 2025
Bos can be put with multiple unrelated dma-resv locks held. But
imported bos attempt to grab the bo dma-resv during dma-buf detach
that typically happens during cleanup. That leads to lockde splats
similar to the below and a potential ABBA deadlock.

Fix this by always taking the delayed workqueue cleanup path for
imported bos.

Requesting stable fixes from when the Xe driver was introduced,
since its usage of drm_exec and wide vm dma_resvs appear to be
the first reliable trigger of this.

[22982.116427] ============================================
[22982.116428] WARNING: possible recursive locking detected
[22982.116429] 6.10.0-rc2+ freebsd#10 Tainted: G     U  W
[22982.116430] --------------------------------------------
[22982.116430] glxgears:sh0/5785 is trying to acquire lock:
[22982.116431] ffff8c2bafa539a8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: dma_buf_detach+0x3b/0xf0
[22982.116438]
               but task is already holding lock:
[22982.116438] ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[22982.116442]
               other info that might help us debug this:
[22982.116442]  Possible unsafe locking scenario:

[22982.116443]        CPU0
[22982.116444]        ----
[22982.116444]   lock(reservation_ww_class_mutex);
[22982.116445]   lock(reservation_ww_class_mutex);
[22982.116447]
                *** DEADLOCK ***

[22982.116447]  May be due to missing lock nesting notation

[22982.116448] 5 locks held by glxgears:sh0/5785:
[22982.116449]  #0: ffff8c2d9aba58c8 (&xef->vm.lock){+.+.}-{3:3}, at: xe_file_close+0xde/0x1c0 [xe]
[22982.116507]  freebsd#1: ffff8c2e28cc8480 (&vm->lock){++++}-{3:3}, at: xe_vm_close_and_put+0x161/0x9b0 [xe]
[22982.116578]  freebsd#2: ffff8c2e31982970 (&val->lock){.+.+}-{3:3}, at: xe_validation_ctx_init+0x6d/0x70 [xe]
[22982.116647]  freebsd#3: ffffacdc469478a8 (reservation_ww_class_acquire){+.+.}-{0:0}, at: xe_vma_destroy_unlocked+0x7f/0xe0 [xe]
[22982.116716]  freebsd#4: ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[22982.116719]
               stack backtrace:
[22982.116720] CPU: 8 PID: 5785 Comm: glxgears:sh0 Tainted: G     U  W          6.10.0-rc2+ freebsd#10
[22982.116721] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
[22982.116723] Call Trace:
[22982.116724]  <TASK>
[22982.116725]  dump_stack_lvl+0x77/0xb0
[22982.116727]  __lock_acquire+0x1232/0x2160
[22982.116730]  lock_acquire+0xcb/0x2d0
[22982.116732]  ? dma_buf_detach+0x3b/0xf0
[22982.116734]  ? __lock_acquire+0x417/0x2160
[22982.116736]  __ww_mutex_lock.constprop.0+0xd0/0x13b0
[22982.116738]  ? dma_buf_detach+0x3b/0xf0
[22982.116741]  ? dma_buf_detach+0x3b/0xf0
[22982.116743]  ? ww_mutex_lock+0x2b/0x90
[22982.116745]  ww_mutex_lock+0x2b/0x90
[22982.116747]  dma_buf_detach+0x3b/0xf0
[22982.116749]  drm_prime_gem_destroy+0x2f/0x40 [drm]
[22982.116775]  xe_ttm_bo_destroy+0x32/0x220 [xe]
[22982.116818]  ? __mutex_unlock_slowpath+0x3a/0x290
[22982.116821]  drm_exec_unlock_all+0xa1/0xd0 [drm_exec]
[22982.116823]  drm_exec_fini+0x12/0xb0 [drm_exec]
[22982.116824]  xe_validation_ctx_fini+0x15/0x40 [xe]
[22982.116892]  xe_vma_destroy_unlocked+0xb1/0xe0 [xe]
[22982.116959]  xe_vm_close_and_put+0x41a/0x9b0 [xe]
[22982.117025]  ? xa_find+0xe3/0x1e0
[22982.117028]  xe_file_close+0x10a/0x1c0 [xe]
[22982.117074]  drm_file_free+0x22a/0x280 [drm]
[22982.117099]  drm_release_noglobal+0x22/0x70 [drm]
[22982.117119]  __fput+0xf1/0x2d0
[22982.117122]  task_work_run+0x59/0x90
[22982.117125]  do_exit+0x330/0xb40
[22982.117127]  do_group_exit+0x36/0xa0
[22982.117129]  get_signal+0xbd2/0xbe0
[22982.117131]  arch_do_signal_or_restart+0x3e/0x240
[22982.117134]  syscall_exit_to_user_mode+0x1e7/0x290
[22982.117137]  do_syscall_64+0xa1/0x180
[22982.117139]  ? lock_acquire+0xcb/0x2d0
[22982.117140]  ? __set_task_comm+0x28/0x1e0
[22982.117141]  ? find_held_lock+0x2b/0x80
[22982.117144]  ? __set_task_comm+0xe1/0x1e0
[22982.117145]  ? lock_release+0xca/0x290
[22982.117147]  ? __do_sys_prctl+0x245/0xab0
[22982.117149]  ? lockdep_hardirqs_on_prepare+0xde/0x190
[22982.117150]  ? syscall_exit_to_user_mode+0xb0/0x290
[22982.117152]  ? do_syscall_64+0xa1/0x180
[22982.117154]  ? __lock_acquire+0x417/0x2160
[22982.117155]  ? reacquire_held_locks+0xd1/0x1f0
[22982.117156]  ? do_user_addr_fault+0x30c/0x790
[22982.117158]  ? lock_acquire+0xcb/0x2d0
[22982.117160]  ? find_held_lock+0x2b/0x80
[22982.117162]  ? do_user_addr_fault+0x357/0x790
[22982.117163]  ? lock_release+0xca/0x290
[22982.117164]  ? do_user_addr_fault+0x361/0x790
[22982.117166]  ? trace_hardirqs_off+0x4b/0xc0
[22982.117168]  ? clear_bhb_loop+0x45/0xa0
[22982.117170]  ? clear_bhb_loop+0x45/0xa0
[22982.117172]  ? clear_bhb_loop+0x45/0xa0
[22982.117174]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[22982.117176] RIP: 0033:0x7f943d267169
[22982.117192] Code: Unable to access opcode bytes at 0x7f943d26713f.
[22982.117193] RSP: 002b:00007f9430bffc80 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[22982.117195] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f943d267169
[22982.117196] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005622f89579d0
[22982.117197] RBP: 00007f9430bffcb0 R08: 0000000000000000 R09: 00000000ffffffff
[22982.117198] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[22982.117199] R13: 0000000000000000 R14: 0000000000000000 R15: 00005622f89579d0
[22982.117202]  </TASK>

Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Christian König <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: <[email protected]> # v6.8+
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Matthew Brost <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Aug 16, 2025
 everywhere

This just standardizes the use of MIN() and MAX() macros, with the very
traditional semantics.  The goal is to use these for C constant
expressions and for top-level / static initializers, and so be able to
simplify the min()/max() macros.

These macro names were used by various kernel code - they are very
traditional, after all - and all such users have been fixed up, with a
few different approaches:

 - trivial duplicated macro definitions have been removed

   Note that 'trivial' here means that it's obviously kernel code that
   already included all the major kernel headers, and thus gets the new
   generic MIN/MAX macros automatically.

 - non-trivial duplicated macro definitions are guarded with #ifndef

   This is the "yes, they define their own versions, but no, the include
   situation is not entirely obvious, and maybe they don't get the
   generic version automatically" case.

 - strange use case freebsd#1

   A couple of drivers decided that the way they want to describe their
   versioning is with

        #define MAJ 1
        #define MIN 2
        #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN)

   which adds zero value and I just did my Alexander the Great
   impersonation, and rewrote that pointless Gordian knot as

        #define DRV_VERSION "1.2"

   instead.

 - strange use case freebsd#2

   A couple of drivers thought that it's a good idea to have a random
   'MIN' or 'MAX' define for a value or index into a table, rather than
   the traditional macro that takes arguments.

   These values were re-written as C enum's instead. The new
   function-line macros only expand when followed by an open
   parenthesis, and thus don't clash with enum use.

Happily, there weren't really all that many of these cases, and a lot of
users already had the pattern of using '#ifndef' guarding (or in one
case just using '#undef MIN') before defining their own private version
that does the same thing. I left such cases alone.

Cc: David Laight <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Aug 16, 2025
Remove intel_dp_mst_suspend/resume from runtime suspend resume
sequences. It is incorrect as it depends on AUX transfers which
itself depend on the device being runtime resumed. This is
also why we see a lock_dep splat here.

<4> [76.011119] kworker/4:2/192 is trying to acquire lock:
<4> [76.011122] ffff8881120b3210 (&mgr->lock#2){+.+.}-{3:3}, at:
drm_dp_mst_topology_mgr_suspend+0x33/0xd0 [drm_display_helper]
<4> [76.011142]
but task is already holding lock:
<4> [76.011144] ffffffffa0bc3420
(xe_pm_runtime_lockdep_map){+.+.}-{0:0}, at:
xe_pm_runtime_suspend+0x51/0x3f0 [xe]
<4> [76.011223]
which lock already depends on the new lock.
<4> [76.011226]
the existing dependency chain (in reverse order) is:
<4> [76.011229]
-> freebsd#2 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}:
<4> [76.011233]        pm_runtime_lockdep_prime+0x2f/0x50 [xe]
<4> [76.011306]        xe_pm_runtime_resume_and_get+0x29/0x90 [xe]
<4> [76.011377]        intel_display_power_get+0x24/0x70 [xe]
<4> [76.011466]        intel_digital_port_connected_locked+0x4c/0xf0
[xe]
<4> [76.011551]        intel_dp_aux_xfer+0xb8/0x7c0 [xe]
<4> [76.011633]        intel_dp_aux_transfer+0x166/0x2e0 [xe]
<4> [76.011715]        drm_dp_dpcd_access+0x87/0x150
[drm_display_helper]
<4> [76.011726]        drm_dp_dpcd_probe+0x3d/0xf0 [drm_display_helper]
<4> [76.011737]        drm_dp_dpcd_read+0xdd/0x130 [drm_display_helper]
<4> [76.011747]        intel_dp_get_colorimetry_status+0x3a/0x70 [xe]
<4> [76.011886]        intel_dp_init_connector+0x4ff/0x1030 [xe]
<4> [76.011969]        intel_ddi_init+0xc5b/0x1030 [xe]
<4> [76.012058]        intel_bios_for_each_encoder+0x36/0x60 [xe]
<4> [76.012145]        intel_setup_outputs+0x201/0x460 [xe]
<4> [76.012233]        intel_display_driver_probe_nogem+0x155/0x1e0 [xe]
<4> [76.012320]        xe_display_init_noaccel+0x27/0x70 [xe]

Signed-off-by: Suraj Kandpal <[email protected]>
Reviewed-by: Rodrigo Vivi <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Aug 16, 2025
Do not do intel_fbdev_set_suspend during runtime_suspend/resume
functions. This cause a big circular lock_dep splat.

kworker/0:4/198 is trying to acquire lock:
<4> [77.185594] ffffffff83398500 (console_lock){+.+.}-{0:0}, at:
intel_fbdev_set_suspend+0x169/0x1f0 [xe]
<4> [77.185947]
but task is already holding lock:
<4> [77.185949] ffffffffa09e9460
(xe_pm_runtime_lockdep_map){+.+.}-{0:0}, at:
xe_pm_runtime_suspend+0x51/0x3f0 [xe]
<4> [77.186262]
which lock already depends on the new lock.
<4> [77.186264]
the existing dependency chain (in reverse order) is:
<4> [77.186266]
-> freebsd#2 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}:
<4> [77.186276]        pm_runtime_lockdep_prime+0x2f/0x50 [xe]
<4> [77.186572]        xe_pm_runtime_resume_and_get+0x29/0x90 [xe]
<4> [77.186867]        intelfb_create+0x150/0x390 [xe]
<4> [77.187197]
__drm_fb_helper_initial_config_and_unlock+0x31c/0x5e0 [drm_kms_helper]
<4> [77.187243]        drm_fb_helper_initial_config+0x3d/0x50
[drm_kms_helper]
<4> [77.187274]        intel_fbdev_client_hotplug+0xb1/0x140 [xe]
<4> [77.187603]        drm_client_register+0x87/0xd0 [drm]
<4> [77.187704]        intel_fbdev_setup+0x51c/0x640 [xe]
<4> [77.188033]        intel_display_driver_register+0xb7/0xf0 [xe]
<4> [77.188438]        xe_display_register+0x21/0x40 [xe]
<4> [77.188809]        xe_device_probe+0xa8d/0xbf0 [xe]
<4> [77.189035]        xe_pci_probe+0x333/0x5b0 [xe]
<4> [77.189330]        local_pci_probe+0x48/0xb0
<4> [77.189341]        pci_device_probe+0xc8/0x280
<4> [77.189351]        really_probe+0xf8/0x390
<4> [77.189362]        __driver_probe_device+0x8a/0x170
<4> [77.189373]        driver_probe_device+0x23/0xb0

Signed-off-by: Suraj Kandpal <[email protected]>
Reviewed-by: Rodrigo Vivi <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Aug 16, 2025
This commit addresses a circular locking dependency issue within the GFX
isolation mechanism. The problem was identified by a warning indicating
a potential deadlock due to inconsistent lock acquisition order.

- The `amdgpu_gfx_enforce_isolation_ring_begin_use` and
  `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously
  acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`,
  leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is
  called while `enforce_isolation_mutex` is held, and
  `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is
  held, it can create a circular dependency.

By ensuring consistent lock usage, this fix resolves the issue:

[  606.297333] ======================================================
[  606.297343] WARNING: possible circular locking dependency detected
[  606.297353] 6.10.0-amd-mlkd-610-311224-lof freebsd#19 Tainted: G           OE
[  606.297365] ------------------------------------------------------
[  606.297375] kworker/u96:3/3825 is trying to acquire lock:
[  606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610
[  606.297413]
               but task is already holding lock:
[  606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.297725]
               which lock already depends on the new lock.

[  606.297738]
               the existing dependency chain (in reverse order) is:
[  606.297749]
               -> freebsd#2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}:
[  606.297765]        __mutex_lock+0x85/0x930
[  606.297776]        mutex_lock_nested+0x1b/0x30
[  606.297786]        amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.298007]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.298225]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.298412]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.298603]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.298866]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.298880]        process_one_work+0x21e/0x680
[  606.298890]        worker_thread+0x190/0x350
[  606.298899]        kthread+0xe7/0x120
[  606.298908]        ret_from_fork+0x3c/0x60
[  606.298919]        ret_from_fork_asm+0x1a/0x30
[  606.298929]
               -> freebsd#1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}:
[  606.298947]        __mutex_lock+0x85/0x930
[  606.298956]        mutex_lock_nested+0x1b/0x30
[  606.298966]        amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu]
[  606.299190]        process_one_work+0x21e/0x680
[  606.299199]        worker_thread+0x190/0x350
[  606.299208]        kthread+0xe7/0x120
[  606.299217]        ret_from_fork+0x3c/0x60
[  606.299227]        ret_from_fork_asm+0x1a/0x30
[  606.299236]
               -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}:
[  606.299257]        __lock_acquire+0x16f9/0x2810
[  606.299267]        lock_acquire+0xd1/0x300
[  606.299276]        __flush_work+0x250/0x610
[  606.299286]        cancel_delayed_work_sync+0x71/0x80
[  606.299296]        amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
[  606.299509]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.299723]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.299909]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.300101]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.300355]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.300369]        process_one_work+0x21e/0x680
[  606.300378]        worker_thread+0x190/0x350
[  606.300387]        kthread+0xe7/0x120
[  606.300396]        ret_from_fork+0x3c/0x60
[  606.300406]        ret_from_fork_asm+0x1a/0x30
[  606.300416]
               other info that might help us debug this:

[  606.300428] Chain exists of:
                 (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex

[  606.300458]  Possible unsafe locking scenario:

[  606.300468]        CPU0                    CPU1
[  606.300476]        ----                    ----
[  606.300484]   lock(&adev->gfx.kfd_sch_mutex);
[  606.300494]                                lock(&adev->enforce_isolation_mutex);
[  606.300508]                                lock(&adev->gfx.kfd_sch_mutex);
[  606.300521]   lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work));
[  606.300536]
                *** DEADLOCK ***

[  606.300546] 5 locks held by kworker/u96:3/3825:
[  606.300555]  #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
[  606.300577]  freebsd#1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
[  606.300600]  freebsd#2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu]
[  606.300837]  freebsd#3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.301062]  freebsd#4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610
[  606.301083]
               stack backtrace:
[  606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G           OE      6.10.0-amd-mlkd-610-311224-lof freebsd#19
[  606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024
[  606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
[  606.301140] Call Trace:
[  606.301146]  <TASK>
[  606.301154]  dump_stack_lvl+0x9b/0xf0
[  606.301166]  dump_stack+0x10/0x20
[  606.301175]  print_circular_bug+0x26c/0x340
[  606.301187]  check_noncircular+0x157/0x170
[  606.301197]  ? register_lock_class+0x48/0x490
[  606.301213]  __lock_acquire+0x16f9/0x2810
[  606.301230]  lock_acquire+0xd1/0x300
[  606.301239]  ? __flush_work+0x232/0x610
[  606.301250]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.301261]  ? mark_held_locks+0x54/0x90
[  606.301274]  ? __flush_work+0x232/0x610
[  606.301284]  __flush_work+0x250/0x610
[  606.301293]  ? __flush_work+0x232/0x610
[  606.301305]  ? __pfx_wq_barrier_func+0x10/0x10
[  606.301318]  ? mark_held_locks+0x54/0x90
[  606.301331]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.301345]  cancel_delayed_work_sync+0x71/0x80
[  606.301356]  amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
[  606.301661]  amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.302050]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.302069]  amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.302452]  amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.302862]  ? drm_sched_entity_error+0x82/0x190 [gpu_sched]
[  606.302890]  amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.303366]  drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.303388]  process_one_work+0x21e/0x680
[  606.303409]  worker_thread+0x190/0x350
[  606.303424]  ? __pfx_worker_thread+0x10/0x10
[  606.303437]  kthread+0xe7/0x120
[  606.303449]  ? __pfx_kthread+0x10/0x10
[  606.303463]  ret_from_fork+0x3c/0x60
[  606.303476]  ? __pfx_kthread+0x10/0x10
[  606.303489]  ret_from_fork_asm+0x1a/0x30
[  606.303512]  </TASK>

v2: Refactor lock handling to resolve circular dependency (Alex)

- Introduced a `sched_work` flag to defer the call to
  `amdgpu_gfx_kfd_sch_ctrl` until after releasing
  `enforce_isolation_mutex`.
- This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside
  the critical section, preventing the circular dependency and deadlock.
- The `sched_work` flag is set within the mutex-protected section if
  conditions are met, and the actual function call is made afterward.
- This approach ensures consistent lock acquisition order.

Fixes: afefd6f24502 ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization")
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Suggested-by: Alex Deucher <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 0b6b2dd38336d5fd49214f0e4e6495e658e3ab44)
Cc: [email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Aug 16, 2025
…void Priority Inversion in SRIOV

RLCG Register Access is a way for virtual functions to safely access GPU
registers in a virtualized environment., including TLB flushes and
register reads. When multiple threads or VFs try to access the same
registers simultaneously, it can lead to race conditions. By using the
RLCG interface, the driver can serialize access to the registers. This
means that only one thread can access the registers at a time,
preventing conflicts and ensuring that operations are performed
correctly. Additionally, when a low-priority task holds a mutex that a
high-priority task needs, ie., If a thread holding a spinlock tries to
acquire a mutex, it can lead to priority inversion. register access in
amdgpu_virt_rlcg_reg_rw especially in a fast code path is critical.

The call stack shows that the function amdgpu_virt_rlcg_reg_rw is being
called, which attempts to acquire the mutex. This function is invoked
from amdgpu_sriov_wreg, which in turn is called from
gmc_v11_0_flush_gpu_tlb.

The [ BUG: Invalid wait context ] indicates that a thread is trying to
acquire a mutex while it is in a context that does not allow it to sleep
(like holding a spinlock).

Fixes the below:

[  253.013423] =============================
[  253.013434] [ BUG: Invalid wait context ]
[  253.013446] 6.12.0-amdstaging-drm-next-lol-050225 freebsd#14 Tainted: G     U     OE
[  253.013464] -----------------------------
[  253.013475] kworker/0:1/10 is trying to lock:
[  253.013487] ffff9f30542e3cf8 (&adev->virt.rlcg_reg_lock){+.+.}-{3:3}, at: amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.013815] other info that might help us debug this:
[  253.013827] context-{4:4}
[  253.013835] 3 locks held by kworker/0:1/10:
[  253.013847]  #0: ffff9f3040050f58 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
[  253.013877]  freebsd#1: ffffb789c008be40 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
[  253.013905]  freebsd#2: ffff9f3054281838 (&adev->gmc.invalidate_lock){+.+.}-{2:2}, at: gmc_v11_0_flush_gpu_tlb+0x198/0x4f0 [amdgpu]
[  253.014154] stack backtrace:
[  253.014164] CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Tainted: G     U     OE      6.12.0-amdstaging-drm-next-lol-050225 freebsd#14
[  253.014189] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  253.014203] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/18/2024
[  253.014224] Workqueue: events work_for_cpu_fn
[  253.014241] Call Trace:
[  253.014250]  <TASK>
[  253.014260]  dump_stack_lvl+0x9b/0xf0
[  253.014275]  dump_stack+0x10/0x20
[  253.014287]  __lock_acquire+0xa47/0x2810
[  253.014303]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.014321]  lock_acquire+0xd1/0x300
[  253.014333]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.014562]  ? __lock_acquire+0xa6b/0x2810
[  253.014578]  __mutex_lock+0x85/0xe20
[  253.014591]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.014782]  ? sched_clock_noinstr+0x9/0x10
[  253.014795]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.014808]  ? local_clock_noinstr+0xe/0xc0
[  253.014822]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.015012]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.015029]  mutex_lock_nested+0x1b/0x30
[  253.015044]  ? mutex_lock_nested+0x1b/0x30
[  253.015057]  amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.015249]  amdgpu_sriov_wreg+0xc5/0xd0 [amdgpu]
[  253.015435]  gmc_v11_0_flush_gpu_tlb+0x44b/0x4f0 [amdgpu]
[  253.015667]  gfx_v11_0_hw_init+0x499/0x29c0 [amdgpu]
[  253.015901]  ? __pfx_smu_v13_0_update_pcie_parameters+0x10/0x10 [amdgpu]
[  253.016159]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.016173]  ? smu_hw_init+0x18d/0x300 [amdgpu]
[  253.016403]  amdgpu_device_init+0x29ad/0x36a0 [amdgpu]
[  253.016614]  amdgpu_driver_load_kms+0x1a/0xc0 [amdgpu]
[  253.017057]  amdgpu_pci_probe+0x1c2/0x660 [amdgpu]
[  253.017493]  local_pci_probe+0x4b/0xb0
[  253.017746]  work_for_cpu_fn+0x1a/0x30
[  253.017995]  process_one_work+0x21e/0x680
[  253.018248]  worker_thread+0x190/0x330
[  253.018500]  ? __pfx_worker_thread+0x10/0x10
[  253.018746]  kthread+0xe7/0x120
[  253.018988]  ? __pfx_kthread+0x10/0x10
[  253.019231]  ret_from_fork+0x3c/0x60
[  253.019468]  ? __pfx_kthread+0x10/0x10
[  253.019701]  ret_from_fork_asm+0x1a/0x30
[  253.019939]  </TASK>

v2: s/spin_trylock/spin_lock_irqsave to be safe (Christian).

Fixes: e864180ee49b ("drm/amdgpu: Add lock around VF RLCG interface")
Cc: lin cao <[email protected]>
Cc: Jingwen Chen <[email protected]>
Cc: Victor Skvortsov <[email protected]>
Cc: Zhigang Luo <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Suggested-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Aug 16, 2025
Commit 7da55c27e76749b9 ("drm/amd/display: Remove incorrect FP context
start") removes the FP context protection of dml2_create(), and it said
"All the DC_FP_START/END should be used before call anything from DML2".

However, dml2_init()/dml21_init() are not protected from their callers,
causing such errors:

 do_fpu invoked from kernel context![freebsd#1]:
 CPU: 0 UID: 0 PID: 239 Comm: kworker/0:5 Not tainted 6.14.0-rc6+ freebsd#2
 Workqueue: events work_for_cpu_fn
 pc ffff80000319de80 ra ffff80000319de5c tp 900000010575c000 sp 900000010575f840
 a0 0000000000000000 a1 900000012f210130 a2 900000012f000000 a3 ffff80000357e268
 a4 ffff80000357e260 a5 900000012ea52cf0 a6 0000000400000004 a7 0000012c00001388
 t0 00001900000015e0 t1 ffff80000379d000 t2 0000000010624dd3 t3 0000006400000014
 t4 00000000000003e8 t5 0000005000000018 t6 0000000000000020 t7 0000000f00000064
 t8 000000000000002f u0 5f5e9200f8901912 s9 900000012d380010 s0 900000012ea51fd8
 s1 900000012f000000 s2 9000000109296000 s3 0000000000000001 s4 0000000000001fd8
 s5 0000000000000001 s6 ffff800003415000 s7 900000012d390000 s8 ffff800003211f80
    ra: ffff80000319de5c dml21_apply_soc_bb_overrides+0x3c/0x960 [amdgpu]
   ERA: ffff80000319de80 dml21_apply_soc_bb_overrides+0x60/0x960 [amdgpu]
  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
  PRMD: 00000004 (PPLV0 +PIE -PWE)
  EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
  ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
 ESTAT: 000f0000 [FPD] (IS= ECode=15 EsubCode=0)
  PRID: 0014d010 (Loongson-64bit, Loongson-3C6000/S)
 Process kworker/0:5 (pid: 239, threadinfo=00000000927eadc6, task=000000008fd31682)
 Stack : 00040dc000003164 0000000000000001 900000012f210130 900000012eabeeb8
         900000012f000000 ffff80000319fe48 900000012f210000 900000012f210130
         900000012f000000 900000012eabeeb8 0000000000000001 ffff8000031a0064
         900000010575f9f0 900000012f210130 900000012eac0000 900000012ea80000
         900000012f000000 ffff8000031cefc4 900000010575f9f0 ffff8000035859c0
         ffff800003414000 900000010575fa78 900000012f000000 ffff8000031b4c50
         0000000000000000 9000000101c9d700 9000000109c40000 5f5e9200f8901912
         900000012d3c4bd0 900000012d3c5000 ffff8000034aed18 900000012d380010
         900000012d3c4bd0 ffff800003414000 900000012d380000 ffff800002ea49dc
         0000000000000001 900000012d3c6000 00000000ffffe423 0000000000010000
         ...
 Call Trace:
 [<ffff80000319de80>] dml21_apply_soc_bb_overrides+0x60/0x960 [amdgpu]
 [<ffff80000319fe44>] dml21_init+0xa4/0x280 [amdgpu]
 [<ffff8000031a0060>] dml21_create+0x40/0x80 [amdgpu]
 [<ffff8000031cefc0>] dc_state_create+0x100/0x160 [amdgpu]
 [<ffff8000031b4c4c>] dc_create+0x44c/0x640 [amdgpu]
 [<ffff800002ea49d8>] amdgpu_dm_init+0x3f8/0x2060 [amdgpu]
 [<ffff800002ea6658>] dm_hw_init+0x18/0x60 [amdgpu]
 [<ffff800002b16738>] amdgpu_device_init+0x1938/0x27e0 [amdgpu]
 [<ffff800002b18e80>] amdgpu_driver_load_kms+0x20/0xa0 [amdgpu]
 [<ffff800002b0c8f0>] amdgpu_pci_probe+0x1b0/0x580 [amdgpu]
 [<900000000448eae4>] local_pci_probe+0x44/0xc0
 [<9000000003b02b18>] work_for_cpu_fn+0x18/0x40
 [<9000000003b05da0>] process_one_work+0x160/0x300
 [<9000000003b06718>] worker_thread+0x318/0x440
 [<9000000003b11b8c>] kthread+0x12c/0x220
 [<9000000003ac1484>] ret_from_kernel_thread+0x8/0xa4

Unfortunately, protecting dml2_init()/dml21_init() out of DML2 causes
"sleeping function called from invalid context", so protect them with
DC_FP_START() and DC_FP_END() inside.

Fixes: 7da55c27e767 ("drm/amd/display: Remove incorrect FP context start")
Cc: [email protected]
Signed-off-by: Huacai Chen <[email protected]>
Reviewed-by: Aurabindo Pillai <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Aug 23, 2025
 amdgpu_userq_mgr_fini+0x70c
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The issue was reproduced on NV10 using IGT pci_unplug test.
It is expected that `amdgpu_driver_postclose_kms()` is called prior to `amdgpu_drm_release()`.
However, the bug is that `amdgpu_fpriv` was freed in `amdgpu_driver_postclose_kms()`, and then
later accessed in `amdgpu_drm_release()` via a call to `amdgpu_userq_mgr_fini()`.
As a result, KASAN detected a use-after-free condition, as shown in the log below.
The proposed fix is to move the calls to `amdgpu_eviction_fence_destroy()` and
`amdgpu_userq_mgr_fini()` into `amdgpu_driver_postclose_kms()`, so they are invoked before
`amdgpu_fpriv` is freed.

This also ensures symmetry with the initialization path in `amdgpu_driver_open_kms()`,
where the following components are initialized:
- `amdgpu_userq_mgr_init()`
- `amdgpu_eviction_fence_init()`
- `amdgpu_ctx_mgr_init()`

Correspondingly, in `amdgpu_driver_postclose_kms()` we should clean up using:
- `amdgpu_userq_mgr_fini()`
- `amdgpu_eviction_fence_destroy()`
- `amdgpu_ctx_mgr_fini()`

This change eliminates the use-after-free and improves consistency in resource management between open and close paths.

[  +0.094367] ==================================================================
[  +0.000026] BUG: KASAN: slab-use-after-free in amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000866] Write of size 8 at addr ffff88811c068c60 by task amd_pci_unplug/1737
[  +0.000026] CPU: 3 UID: 0 PID: 1737 Comm: amd_pci_unplug Not tainted 6.14.0+ freebsd#2
[  +0.000008] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000010]  print_report+0xce/0x600
[  +0.000009]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000790]  ? srso_return_thunk+0x5/0x5f
[  +0.000007]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000008]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000684]  kasan_report+0xbe/0x110
[  +0.000007]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000601]  __asan_report_store8_noabort+0x17/0x30
[  +0.000007]  amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000801]  ? __pfx_amdgpu_userq_mgr_fini+0x10/0x10 [amdgpu]
[  +0.000819]  ? srso_return_thunk+0x5/0x5f
[  +0.000008]  amdgpu_drm_release+0xa3/0xe0 [amdgpu]
[  +0.000604]  __fput+0x354/0xa90
[  +0.000010]  __fput_sync+0x59/0x80
[  +0.000005]  __x64_sys_close+0x7d/0xe0
[  +0.000006]  x64_sys_call+0x2505/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? kasan_record_aux_stack+0xae/0xd0
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kmem_cache_free+0x398/0x580
[  +0.000006]  ? __fput+0x543/0xa90
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __fput+0x543/0xa90
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000005] RIP: 0033:0x7ffff7b14f67
[  +0.000005] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff
[  +0.000004] RSP: 002b:00007fffffffe358 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffff7b14f67
[  +0.000003] RDX: 0000000000000000 RSI: 00007ffff7f5755a RDI: 0000000000000003
[  +0.000003] RBP: 00007fffffffe380 R08: 0000555555568170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffffffe5c8
[  +0.000003] R13: 00005555555552a9 R14: 0000555555557d48 R15: 00007ffff7ffd040
[  +0.000007]  </TASK>

[  +0.000286] Allocated by task 425 on cpu 11 at 29.751192s:
[  +0.000013]  kasan_save_stack+0x28/0x60
[  +0.000008]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000006]  __kasan_kmalloc+0xc1/0xd0
[  +0.000005]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000006]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000521]  drm_file_alloc+0x569/0x9a0
[  +0.000008]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000007]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x50b/0x10d0 [amdgpu]
[  +0.000482]  local_pci_probe+0xe7/0x1b0
[  +0.000008]  pci_device_probe+0x5bf/0x890
[  +0.000005]  really_probe+0x1fd/0x950
[  +0.000007]  __driver_probe_device+0x307/0x410
[  +0.000005]  driver_probe_device+0x4e/0x150
[  +0.000006]  __driver_attach+0x223/0x510
[  +0.000005]  bus_for_each_dev+0x102/0x1a0
[  +0.000006]  driver_attach+0x3d/0x60
[  +0.000005]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo]
[  +0.000008]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000007]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000286] Allocated by task 425 on cpu 11 at 29.751192s:
[  +0.000013]  kasan_save_stack+0x28/0x60
[  +0.000008]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000006]  __kasan_kmalloc+0xc1/0xd0
[  +0.000005]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000006]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000521]  drm_file_alloc+0x569/0x9a0
[  +0.000008]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000007]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x50b/0x10d0 [amdgpu]
[  +0.000482]  local_pci_probe+0xe7/0x1b0
[  +0.000008]  pci_device_probe+0x5bf/0x890
[  +0.000005]  really_probe+0x1fd/0x950
[  +0.000007]  __driver_probe_device+0x307/0x410
[  +0.000005]  driver_probe_device+0x4e/0x150
[  +0.000006]  __driver_attach+0x223/0x510
[  +0.000005]  bus_for_each_dev+0x102/0x1a0
[  +0.000006]  driver_attach+0x3d/0x60
[  +0.000005]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo]
[  +0.000008]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000007]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] The buggy address belongs to the object at ffff88811c068000
               which belongs to the cache kmalloc-rnd-01-4k of size 4096
[  +0.000016] The buggy address is located 3168 bytes inside of
               freed 4096-byte region [ffff88811c068000, ffff88811c069000)

[  +0.000022] The buggy address belongs to the physical page:
[  +0.000010] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88811c06e000 pfn:0x11c068
[  +0.000006] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000006] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000007] raw: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] raw: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] head: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004701a01 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000004] page dumped because: kasan: bad access detected

[  +0.000011] Memory state around the buggy address:
[  +0.000009]  ffff88811c068b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88811c068b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88811c068c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                        ^
[  +0.000010]  ffff88811c068c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88811c068d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

Cc: Alex Deucher <[email protected]>
Cc: Christian König <[email protected]>
Cc: Lijo Lazar <[email protected]>
Cc: Jesse Zhang <[email protected]>
Cc: Arvind Yadav <[email protected]>

v2: drop amdgpu_drm_release() and assign drm_release()
    as the callback directly.(Alex)

Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Vitaly Prosyak <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Aug 24, 2025
[  +0.000020] BUG: KASAN: slab-use-after-free in amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000817] Read of size 8 at addr ffff88812eec8c58 by task amd_pci_unplug/1733

[  +0.000027] CPU: 10 UID: 0 PID: 1733 Comm: amd_pci_unplug Tainted: G        W          6.14.0+ freebsd#2
[  +0.000009] Tainted: [W]=WARN
[  +0.000003] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000011]  print_report+0xce/0x600
[  +0.000009]  ? srso_return_thunk+0x5/0x5f
[  +0.000006]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000007]  ? kasan_addr_to_slab+0xd/0xb0
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000707]  kasan_report+0xbe/0x110
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000541]  __asan_report_load8_noabort+0x14/0x30
[  +0.000005]  amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000535]  ? stop_cpsch+0x396/0x600 [amdgpu]
[  +0.000556]  ? stop_cpsch+0x429/0x600 [amdgpu]
[  +0.000536]  ? __pfx_amdgpu_userq_suspend+0x10/0x10 [amdgpu]
[  +0.000536]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kgd2kfd_suspend+0x132/0x1d0 [amdgpu]
[  +0.000542]  amdgpu_device_fini_hw+0x581/0xe90 [amdgpu]
[  +0.000485]  ? down_write+0xbb/0x140
[  +0.000007]  ? __mutex_unlock_slowpath.constprop.0+0x317/0x360
[  +0.000005]  ? __pfx_amdgpu_device_fini_hw+0x10/0x10 [amdgpu]
[  +0.000482]  ? __kasan_check_write+0x14/0x30
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? up_write+0x55/0xb0
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? blocking_notifier_chain_unregister+0x6c/0xc0
[  +0.000008]  amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
[  +0.000484]  amdgpu_pci_remove+0x93/0x130 [amdgpu]
[  +0.000482]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000008]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000007]  device_release_driver+0x12/0x20
[  +0.000004]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000005]  remove_store+0xd7/0xf0
[  +0.000005]  ? __pfx_remove_store+0x10/0x10
[  +0.000006]  ? __pfx__copy_from_iter+0x10/0x10
[  +0.000006]  ? __pfx_dev_attr_store+0x10/0x10
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000006]  sysfs_kf_write+0x125/0x1d0
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __kasan_check_write+0x14/0x30
[  +0.000005]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000005]  ? rw_verify_area+0x70/0x420
[  +0.000005]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __pfx_vfs_write+0x10/0x10
[  +0.000004]  ? local_clock+0x15/0x30
[  +0.000008]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_slab_free+0x5f/0x80
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fdget_pos+0x1d3/0x500
[  +0.000007]  ksys_write+0x119/0x220
[  +0.000005]  ? putname+0x1c/0x30
[  +0.000006]  ? __pfx_ksys_write+0x10/0x10
[  +0.000007]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __pfx___x64_sys_openat+0x10/0x10
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000006] RIP: 0033:0x7480c0b14887
[  +0.000005] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  +0.000005] RSP: 002b:00007fff142b0058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007480c0b14887
[  +0.000003] RDX: 0000000000000001 RSI: 00007480c0e7365a RDI: 0000000000000004
[  +0.000003] RBP: 00007fff142b0080 R08: 0000563b2e73c170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff142b02f8
[  +0.000003] R13: 0000563b159a72a9 R14: 0000563b159a9d48 R15: 00007480c0f19040
[  +0.000008]  </TASK>

[  +0.000445] Allocated by task 427 on cpu 5 at 29.342331s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000005]  __kasan_kmalloc+0xc1/0xd0
[  +0.000006]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000007]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000493]  drm_file_alloc+0x569/0x9a0
[  +0.000007]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000006]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x510/0x10c0 [amdgpu]
[  +0.000483]  local_pci_probe+0xe7/0x1b0
[  +0.000006]  pci_device_probe+0x5bf/0x890
[  +0.000006]  really_probe+0x1fd/0x950
[  +0.000005]  __driver_probe_device+0x307/0x410
[  +0.000006]  driver_probe_device+0x4e/0x150
[  +0.000005]  __driver_attach+0x223/0x510
[  +0.000006]  bus_for_each_dev+0x102/0x1a0
[  +0.000005]  driver_attach+0x3d/0x60
[  +0.000006]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  rfcomm_dlc_clear_state+0x69/0x220 [rfcomm]
[  +0.000011]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000006]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] Freed by task 1733 on cpu 5 at 59.907086s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000005]  kasan_save_free_info+0x3b/0x60
[  +0.000005]  __kasan_slab_free+0x54/0x80
[  +0.000006]  kfree+0x127/0x470
[  +0.000006]  amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu]
[  +0.000493]  drm_file_free.part.0+0x5b1/0xba0
[  +0.000006]  drm_file_free+0x13/0x30
[  +0.000006]  drm_client_release+0x1c4/0x2b0
[  +0.000006]  drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper]
[  +0.000007]  put_fb_info+0x97/0xe0
[  +0.000007]  unregister_framebuffer+0x197/0x380
[  +0.000005]  drm_fb_helper_unregister_info+0x94/0x100
[  +0.000005]  drm_fbdev_client_unregister+0x3c/0x80
[  +0.000007]  drm_client_dev_unregister+0x144/0x330
[  +0.000006]  drm_dev_unregister+0x49/0x1b0
[  +0.000006]  drm_dev_unplug+0x4c/0xd0
[  +0.000006]  amdgpu_pci_remove+0x58/0x130 [amdgpu]
[  +0.000484]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000007]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000006]  device_release_driver+0x12/0x20
[  +0.000007]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000006]  remove_store+0xd7/0xf0
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000005]  sysfs_kf_write+0x125/0x1d0
[  +0.000006]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000006]  ksys_write+0x119/0x220
[  +0.000006]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000005]  do_syscall_64+0x7c/0x170
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000012] The buggy address belongs to the object at ffff88812eec8000
               which belongs to the cache kmalloc-rnd-07-4k of size 4096
[  +0.000016] The buggy address is located 3160 bytes inside of
               freed 4096-byte region [ffff88812eec8000, ffff88812eec9000)

[  +0.000023] The buggy address belongs to the physical page:
[  +0.000009] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12eec8
[  +0.000007] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000005] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000008] raw: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] raw: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] head: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004bbb201 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000005] page dumped because: kasan: bad access detected

[  +0.000010] Memory state around the buggy address:
[  +0.000009]  ffff88812eec8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88812eec8b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88812eec8c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                     ^
[  +0.000010]  ffff88812eec8c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88812eec8d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

The use-after-free occurs because a delayed work item (`suspend_work`) may still
be pending or running when resources it accesses are freed during device removal
or file close. The previous code used `flush_work(&fpriv->evf_mgr.suspend_work.work)`,
which does not wait for delayed work that has not yet started. As a result, the
delayed work could run after its memory was freed, causing a use-after-free.
By switching to `flush_delayed_work(&fpriv->evf_mgr.suspend_work)`, we ensure that
the kernel waits for both queued and delayed work to finish before
freeing memory, closing this race.

Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Signed-off-by: Vitaly Prosyak <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
dumbbell pushed a commit to dumbbell/drm-kmod that referenced this pull request Sep 17, 2025
Bos can be put with multiple unrelated dma-resv locks held. But
imported bos attempt to grab the bo dma-resv during dma-buf detach
that typically happens during cleanup. That leads to lockde splats
similar to the below and a potential ABBA deadlock.

Fix this by always taking the delayed workqueue cleanup path for
imported bos.

Requesting stable fixes from when the Xe driver was introduced,
since its usage of drm_exec and wide vm dma_resvs appear to be
the first reliable trigger of this.

[22982.116427] ============================================
[22982.116428] WARNING: possible recursive locking detected
[22982.116429] 6.10.0-rc2+ freebsd#10 Tainted: G     U  W
[22982.116430] --------------------------------------------
[22982.116430] glxgears:sh0/5785 is trying to acquire lock:
[22982.116431] ffff8c2bafa539a8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: dma_buf_detach+0x3b/0xf0
[22982.116438]
               but task is already holding lock:
[22982.116438] ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[22982.116442]
               other info that might help us debug this:
[22982.116442]  Possible unsafe locking scenario:

[22982.116443]        CPU0
[22982.116444]        ----
[22982.116444]   lock(reservation_ww_class_mutex);
[22982.116445]   lock(reservation_ww_class_mutex);
[22982.116447]
                *** DEADLOCK ***

[22982.116447]  May be due to missing lock nesting notation

[22982.116448] 5 locks held by glxgears:sh0/5785:
[22982.116449]  #0: ffff8c2d9aba58c8 (&xef->vm.lock){+.+.}-{3:3}, at: xe_file_close+0xde/0x1c0 [xe]
[22982.116507]  freebsd#1: ffff8c2e28cc8480 (&vm->lock){++++}-{3:3}, at: xe_vm_close_and_put+0x161/0x9b0 [xe]
[22982.116578]  freebsd#2: ffff8c2e31982970 (&val->lock){.+.+}-{3:3}, at: xe_validation_ctx_init+0x6d/0x70 [xe]
[22982.116647]  freebsd#3: ffffacdc469478a8 (reservation_ww_class_acquire){+.+.}-{0:0}, at: xe_vma_destroy_unlocked+0x7f/0xe0 [xe]
[22982.116716]  freebsd#4: ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[22982.116719]
               stack backtrace:
[22982.116720] CPU: 8 PID: 5785 Comm: glxgears:sh0 Tainted: G     U  W          6.10.0-rc2+ freebsd#10
[22982.116721] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
[22982.116723] Call Trace:
[22982.116724]  <TASK>
[22982.116725]  dump_stack_lvl+0x77/0xb0
[22982.116727]  __lock_acquire+0x1232/0x2160
[22982.116730]  lock_acquire+0xcb/0x2d0
[22982.116732]  ? dma_buf_detach+0x3b/0xf0
[22982.116734]  ? __lock_acquire+0x417/0x2160
[22982.116736]  __ww_mutex_lock.constprop.0+0xd0/0x13b0
[22982.116738]  ? dma_buf_detach+0x3b/0xf0
[22982.116741]  ? dma_buf_detach+0x3b/0xf0
[22982.116743]  ? ww_mutex_lock+0x2b/0x90
[22982.116745]  ww_mutex_lock+0x2b/0x90
[22982.116747]  dma_buf_detach+0x3b/0xf0
[22982.116749]  drm_prime_gem_destroy+0x2f/0x40 [drm]
[22982.116775]  xe_ttm_bo_destroy+0x32/0x220 [xe]
[22982.116818]  ? __mutex_unlock_slowpath+0x3a/0x290
[22982.116821]  drm_exec_unlock_all+0xa1/0xd0 [drm_exec]
[22982.116823]  drm_exec_fini+0x12/0xb0 [drm_exec]
[22982.116824]  xe_validation_ctx_fini+0x15/0x40 [xe]
[22982.116892]  xe_vma_destroy_unlocked+0xb1/0xe0 [xe]
[22982.116959]  xe_vm_close_and_put+0x41a/0x9b0 [xe]
[22982.117025]  ? xa_find+0xe3/0x1e0
[22982.117028]  xe_file_close+0x10a/0x1c0 [xe]
[22982.117074]  drm_file_free+0x22a/0x280 [drm]
[22982.117099]  drm_release_noglobal+0x22/0x70 [drm]
[22982.117119]  __fput+0xf1/0x2d0
[22982.117122]  task_work_run+0x59/0x90
[22982.117125]  do_exit+0x330/0xb40
[22982.117127]  do_group_exit+0x36/0xa0
[22982.117129]  get_signal+0xbd2/0xbe0
[22982.117131]  arch_do_signal_or_restart+0x3e/0x240
[22982.117134]  syscall_exit_to_user_mode+0x1e7/0x290
[22982.117137]  do_syscall_64+0xa1/0x180
[22982.117139]  ? lock_acquire+0xcb/0x2d0
[22982.117140]  ? __set_task_comm+0x28/0x1e0
[22982.117141]  ? find_held_lock+0x2b/0x80
[22982.117144]  ? __set_task_comm+0xe1/0x1e0
[22982.117145]  ? lock_release+0xca/0x290
[22982.117147]  ? __do_sys_prctl+0x245/0xab0
[22982.117149]  ? lockdep_hardirqs_on_prepare+0xde/0x190
[22982.117150]  ? syscall_exit_to_user_mode+0xb0/0x290
[22982.117152]  ? do_syscall_64+0xa1/0x180
[22982.117154]  ? __lock_acquire+0x417/0x2160
[22982.117155]  ? reacquire_held_locks+0xd1/0x1f0
[22982.117156]  ? do_user_addr_fault+0x30c/0x790
[22982.117158]  ? lock_acquire+0xcb/0x2d0
[22982.117160]  ? find_held_lock+0x2b/0x80
[22982.117162]  ? do_user_addr_fault+0x357/0x790
[22982.117163]  ? lock_release+0xca/0x290
[22982.117164]  ? do_user_addr_fault+0x361/0x790
[22982.117166]  ? trace_hardirqs_off+0x4b/0xc0
[22982.117168]  ? clear_bhb_loop+0x45/0xa0
[22982.117170]  ? clear_bhb_loop+0x45/0xa0
[22982.117172]  ? clear_bhb_loop+0x45/0xa0
[22982.117174]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[22982.117176] RIP: 0033:0x7f943d267169
[22982.117192] Code: Unable to access opcode bytes at 0x7f943d26713f.
[22982.117193] RSP: 002b:00007f9430bffc80 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[22982.117195] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f943d267169
[22982.117196] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005622f89579d0
[22982.117197] RBP: 00007f9430bffcb0 R08: 0000000000000000 R09: 00000000ffffffff
[22982.117198] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[22982.117199] R13: 0000000000000000 R14: 0000000000000000 R15: 00005622f89579d0
[22982.117202]  </TASK>

Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Christian König <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: <[email protected]> # v6.8+
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Matthew Brost <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 20, 2025
…void Priority Inversion in SRIOV

RLCG Register Access is a way for virtual functions to safely access GPU
registers in a virtualized environment., including TLB flushes and
register reads. When multiple threads or VFs try to access the same
registers simultaneously, it can lead to race conditions. By using the
RLCG interface, the driver can serialize access to the registers. This
means that only one thread can access the registers at a time,
preventing conflicts and ensuring that operations are performed
correctly. Additionally, when a low-priority task holds a mutex that a
high-priority task needs, ie., If a thread holding a spinlock tries to
acquire a mutex, it can lead to priority inversion. register access in
amdgpu_virt_rlcg_reg_rw especially in a fast code path is critical.

The call stack shows that the function amdgpu_virt_rlcg_reg_rw is being
called, which attempts to acquire the mutex. This function is invoked
from amdgpu_sriov_wreg, which in turn is called from
gmc_v11_0_flush_gpu_tlb.

The [ BUG: Invalid wait context ] indicates that a thread is trying to
acquire a mutex while it is in a context that does not allow it to sleep
(like holding a spinlock).

Fixes the below:

[  253.013423] =============================
[  253.013434] [ BUG: Invalid wait context ]
[  253.013446] 6.12.0-amdstaging-drm-next-lol-050225 freebsd#14 Tainted: G     U     OE
[  253.013464] -----------------------------
[  253.013475] kworker/0:1/10 is trying to lock:
[  253.013487] ffff9f30542e3cf8 (&adev->virt.rlcg_reg_lock){+.+.}-{3:3}, at: amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.013815] other info that might help us debug this:
[  253.013827] context-{4:4}
[  253.013835] 3 locks held by kworker/0:1/10:
[  253.013847]  #0: ffff9f3040050f58 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
[  253.013877]  freebsd#1: ffffb789c008be40 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
[  253.013905]  freebsd#2: ffff9f3054281838 (&adev->gmc.invalidate_lock){+.+.}-{2:2}, at: gmc_v11_0_flush_gpu_tlb+0x198/0x4f0 [amdgpu]
[  253.014154] stack backtrace:
[  253.014164] CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Tainted: G     U     OE      6.12.0-amdstaging-drm-next-lol-050225 freebsd#14
[  253.014189] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  253.014203] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/18/2024
[  253.014224] Workqueue: events work_for_cpu_fn
[  253.014241] Call Trace:
[  253.014250]  <TASK>
[  253.014260]  dump_stack_lvl+0x9b/0xf0
[  253.014275]  dump_stack+0x10/0x20
[  253.014287]  __lock_acquire+0xa47/0x2810
[  253.014303]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.014321]  lock_acquire+0xd1/0x300
[  253.014333]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.014562]  ? __lock_acquire+0xa6b/0x2810
[  253.014578]  __mutex_lock+0x85/0xe20
[  253.014591]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.014782]  ? sched_clock_noinstr+0x9/0x10
[  253.014795]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.014808]  ? local_clock_noinstr+0xe/0xc0
[  253.014822]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.015012]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.015029]  mutex_lock_nested+0x1b/0x30
[  253.015044]  ? mutex_lock_nested+0x1b/0x30
[  253.015057]  amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.015249]  amdgpu_sriov_wreg+0xc5/0xd0 [amdgpu]
[  253.015435]  gmc_v11_0_flush_gpu_tlb+0x44b/0x4f0 [amdgpu]
[  253.015667]  gfx_v11_0_hw_init+0x499/0x29c0 [amdgpu]
[  253.015901]  ? __pfx_smu_v13_0_update_pcie_parameters+0x10/0x10 [amdgpu]
[  253.016159]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.016173]  ? smu_hw_init+0x18d/0x300 [amdgpu]
[  253.016403]  amdgpu_device_init+0x29ad/0x36a0 [amdgpu]
[  253.016614]  amdgpu_driver_load_kms+0x1a/0xc0 [amdgpu]
[  253.017057]  amdgpu_pci_probe+0x1c2/0x660 [amdgpu]
[  253.017493]  local_pci_probe+0x4b/0xb0
[  253.017746]  work_for_cpu_fn+0x1a/0x30
[  253.017995]  process_one_work+0x21e/0x680
[  253.018248]  worker_thread+0x190/0x330
[  253.018500]  ? __pfx_worker_thread+0x10/0x10
[  253.018746]  kthread+0xe7/0x120
[  253.018988]  ? __pfx_kthread+0x10/0x10
[  253.019231]  ret_from_fork+0x3c/0x60
[  253.019468]  ? __pfx_kthread+0x10/0x10
[  253.019701]  ret_from_fork_asm+0x1a/0x30
[  253.019939]  </TASK>

v2: s/spin_trylock/spin_lock_irqsave to be safe (Christian).

Fixes: e864180ee49b ("drm/amdgpu: Add lock around VF RLCG interface")
Cc: lin cao <[email protected]>
Cc: Jingwen Chen <[email protected]>
Cc: Victor Skvortsov <[email protected]>
Cc: Zhigang Luo <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Suggested-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 20, 2025
Commit 7da55c27e76749b9 ("drm/amd/display: Remove incorrect FP context
start") removes the FP context protection of dml2_create(), and it said
"All the DC_FP_START/END should be used before call anything from DML2".

However, dml2_init()/dml21_init() are not protected from their callers,
causing such errors:

 do_fpu invoked from kernel context![freebsd#1]:
 CPU: 0 UID: 0 PID: 239 Comm: kworker/0:5 Not tainted 6.14.0-rc6+ freebsd#2
 Workqueue: events work_for_cpu_fn
 pc ffff80000319de80 ra ffff80000319de5c tp 900000010575c000 sp 900000010575f840
 a0 0000000000000000 a1 900000012f210130 a2 900000012f000000 a3 ffff80000357e268
 a4 ffff80000357e260 a5 900000012ea52cf0 a6 0000000400000004 a7 0000012c00001388
 t0 00001900000015e0 t1 ffff80000379d000 t2 0000000010624dd3 t3 0000006400000014
 t4 00000000000003e8 t5 0000005000000018 t6 0000000000000020 t7 0000000f00000064
 t8 000000000000002f u0 5f5e9200f8901912 s9 900000012d380010 s0 900000012ea51fd8
 s1 900000012f000000 s2 9000000109296000 s3 0000000000000001 s4 0000000000001fd8
 s5 0000000000000001 s6 ffff800003415000 s7 900000012d390000 s8 ffff800003211f80
    ra: ffff80000319de5c dml21_apply_soc_bb_overrides+0x3c/0x960 [amdgpu]
   ERA: ffff80000319de80 dml21_apply_soc_bb_overrides+0x60/0x960 [amdgpu]
  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
  PRMD: 00000004 (PPLV0 +PIE -PWE)
  EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
  ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
 ESTAT: 000f0000 [FPD] (IS= ECode=15 EsubCode=0)
  PRID: 0014d010 (Loongson-64bit, Loongson-3C6000/S)
 Process kworker/0:5 (pid: 239, threadinfo=00000000927eadc6, task=000000008fd31682)
 Stack : 00040dc000003164 0000000000000001 900000012f210130 900000012eabeeb8
         900000012f000000 ffff80000319fe48 900000012f210000 900000012f210130
         900000012f000000 900000012eabeeb8 0000000000000001 ffff8000031a0064
         900000010575f9f0 900000012f210130 900000012eac0000 900000012ea80000
         900000012f000000 ffff8000031cefc4 900000010575f9f0 ffff8000035859c0
         ffff800003414000 900000010575fa78 900000012f000000 ffff8000031b4c50
         0000000000000000 9000000101c9d700 9000000109c40000 5f5e9200f8901912
         900000012d3c4bd0 900000012d3c5000 ffff8000034aed18 900000012d380010
         900000012d3c4bd0 ffff800003414000 900000012d380000 ffff800002ea49dc
         0000000000000001 900000012d3c6000 00000000ffffe423 0000000000010000
         ...
 Call Trace:
 [<ffff80000319de80>] dml21_apply_soc_bb_overrides+0x60/0x960 [amdgpu]
 [<ffff80000319fe44>] dml21_init+0xa4/0x280 [amdgpu]
 [<ffff8000031a0060>] dml21_create+0x40/0x80 [amdgpu]
 [<ffff8000031cefc0>] dc_state_create+0x100/0x160 [amdgpu]
 [<ffff8000031b4c4c>] dc_create+0x44c/0x640 [amdgpu]
 [<ffff800002ea49d8>] amdgpu_dm_init+0x3f8/0x2060 [amdgpu]
 [<ffff800002ea6658>] dm_hw_init+0x18/0x60 [amdgpu]
 [<ffff800002b16738>] amdgpu_device_init+0x1938/0x27e0 [amdgpu]
 [<ffff800002b18e80>] amdgpu_driver_load_kms+0x20/0xa0 [amdgpu]
 [<ffff800002b0c8f0>] amdgpu_pci_probe+0x1b0/0x580 [amdgpu]
 [<900000000448eae4>] local_pci_probe+0x44/0xc0
 [<9000000003b02b18>] work_for_cpu_fn+0x18/0x40
 [<9000000003b05da0>] process_one_work+0x160/0x300
 [<9000000003b06718>] worker_thread+0x318/0x440
 [<9000000003b11b8c>] kthread+0x12c/0x220
 [<9000000003ac1484>] ret_from_kernel_thread+0x8/0xa4

Unfortunately, protecting dml2_init()/dml21_init() out of DML2 causes
"sleeping function called from invalid context", so protect them with
DC_FP_START() and DC_FP_END() inside.

Fixes: 7da55c27e767 ("drm/amd/display: Remove incorrect FP context start")
Cc: [email protected]
Signed-off-by: Huacai Chen <[email protected]>
Reviewed-by: Aurabindo Pillai <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 20, 2025
 amdgpu_userq_mgr_fini+0x70c
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The issue was reproduced on NV10 using IGT pci_unplug test.
It is expected that `amdgpu_driver_postclose_kms()` is called prior to `amdgpu_drm_release()`.
However, the bug is that `amdgpu_fpriv` was freed in `amdgpu_driver_postclose_kms()`, and then
later accessed in `amdgpu_drm_release()` via a call to `amdgpu_userq_mgr_fini()`.
As a result, KASAN detected a use-after-free condition, as shown in the log below.
The proposed fix is to move the calls to `amdgpu_eviction_fence_destroy()` and
`amdgpu_userq_mgr_fini()` into `amdgpu_driver_postclose_kms()`, so they are invoked before
`amdgpu_fpriv` is freed.

This also ensures symmetry with the initialization path in `amdgpu_driver_open_kms()`,
where the following components are initialized:
- `amdgpu_userq_mgr_init()`
- `amdgpu_eviction_fence_init()`
- `amdgpu_ctx_mgr_init()`

Correspondingly, in `amdgpu_driver_postclose_kms()` we should clean up using:
- `amdgpu_userq_mgr_fini()`
- `amdgpu_eviction_fence_destroy()`
- `amdgpu_ctx_mgr_fini()`

This change eliminates the use-after-free and improves consistency in resource management between open and close paths.

[  +0.094367] ==================================================================
[  +0.000026] BUG: KASAN: slab-use-after-free in amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000866] Write of size 8 at addr ffff88811c068c60 by task amd_pci_unplug/1737
[  +0.000026] CPU: 3 UID: 0 PID: 1737 Comm: amd_pci_unplug Not tainted 6.14.0+ freebsd#2
[  +0.000008] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000010]  print_report+0xce/0x600
[  +0.000009]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000790]  ? srso_return_thunk+0x5/0x5f
[  +0.000007]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000008]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000684]  kasan_report+0xbe/0x110
[  +0.000007]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000601]  __asan_report_store8_noabort+0x17/0x30
[  +0.000007]  amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000801]  ? __pfx_amdgpu_userq_mgr_fini+0x10/0x10 [amdgpu]
[  +0.000819]  ? srso_return_thunk+0x5/0x5f
[  +0.000008]  amdgpu_drm_release+0xa3/0xe0 [amdgpu]
[  +0.000604]  __fput+0x354/0xa90
[  +0.000010]  __fput_sync+0x59/0x80
[  +0.000005]  __x64_sys_close+0x7d/0xe0
[  +0.000006]  x64_sys_call+0x2505/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? kasan_record_aux_stack+0xae/0xd0
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kmem_cache_free+0x398/0x580
[  +0.000006]  ? __fput+0x543/0xa90
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __fput+0x543/0xa90
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000005] RIP: 0033:0x7ffff7b14f67
[  +0.000005] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff
[  +0.000004] RSP: 002b:00007fffffffe358 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffff7b14f67
[  +0.000003] RDX: 0000000000000000 RSI: 00007ffff7f5755a RDI: 0000000000000003
[  +0.000003] RBP: 00007fffffffe380 R08: 0000555555568170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffffffe5c8
[  +0.000003] R13: 00005555555552a9 R14: 0000555555557d48 R15: 00007ffff7ffd040
[  +0.000007]  </TASK>

[  +0.000286] Allocated by task 425 on cpu 11 at 29.751192s:
[  +0.000013]  kasan_save_stack+0x28/0x60
[  +0.000008]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000006]  __kasan_kmalloc+0xc1/0xd0
[  +0.000005]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000006]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000521]  drm_file_alloc+0x569/0x9a0
[  +0.000008]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000007]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x50b/0x10d0 [amdgpu]
[  +0.000482]  local_pci_probe+0xe7/0x1b0
[  +0.000008]  pci_device_probe+0x5bf/0x890
[  +0.000005]  really_probe+0x1fd/0x950
[  +0.000007]  __driver_probe_device+0x307/0x410
[  +0.000005]  driver_probe_device+0x4e/0x150
[  +0.000006]  __driver_attach+0x223/0x510
[  +0.000005]  bus_for_each_dev+0x102/0x1a0
[  +0.000006]  driver_attach+0x3d/0x60
[  +0.000005]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo]
[  +0.000008]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000007]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000286] Allocated by task 425 on cpu 11 at 29.751192s:
[  +0.000013]  kasan_save_stack+0x28/0x60
[  +0.000008]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000006]  __kasan_kmalloc+0xc1/0xd0
[  +0.000005]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000006]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000521]  drm_file_alloc+0x569/0x9a0
[  +0.000008]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000007]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x50b/0x10d0 [amdgpu]
[  +0.000482]  local_pci_probe+0xe7/0x1b0
[  +0.000008]  pci_device_probe+0x5bf/0x890
[  +0.000005]  really_probe+0x1fd/0x950
[  +0.000007]  __driver_probe_device+0x307/0x410
[  +0.000005]  driver_probe_device+0x4e/0x150
[  +0.000006]  __driver_attach+0x223/0x510
[  +0.000005]  bus_for_each_dev+0x102/0x1a0
[  +0.000006]  driver_attach+0x3d/0x60
[  +0.000005]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo]
[  +0.000008]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000007]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] The buggy address belongs to the object at ffff88811c068000
               which belongs to the cache kmalloc-rnd-01-4k of size 4096
[  +0.000016] The buggy address is located 3168 bytes inside of
               freed 4096-byte region [ffff88811c068000, ffff88811c069000)

[  +0.000022] The buggy address belongs to the physical page:
[  +0.000010] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88811c06e000 pfn:0x11c068
[  +0.000006] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000006] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000007] raw: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] raw: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] head: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004701a01 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000004] page dumped because: kasan: bad access detected

[  +0.000011] Memory state around the buggy address:
[  +0.000009]  ffff88811c068b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88811c068b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88811c068c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                        ^
[  +0.000010]  ffff88811c068c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88811c068d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

Cc: Alex Deucher <[email protected]>
Cc: Christian König <[email protected]>
Cc: Lijo Lazar <[email protected]>
Cc: Jesse Zhang <[email protected]>
Cc: Arvind Yadav <[email protected]>

v2: drop amdgpu_drm_release() and assign drm_release()
    as the callback directly.(Alex)

Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Vitaly Prosyak <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 20, 2025
[  +0.000020] BUG: KASAN: slab-use-after-free in amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000817] Read of size 8 at addr ffff88812eec8c58 by task amd_pci_unplug/1733

[  +0.000027] CPU: 10 UID: 0 PID: 1733 Comm: amd_pci_unplug Tainted: G        W          6.14.0+ freebsd#2
[  +0.000009] Tainted: [W]=WARN
[  +0.000003] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000011]  print_report+0xce/0x600
[  +0.000009]  ? srso_return_thunk+0x5/0x5f
[  +0.000006]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000007]  ? kasan_addr_to_slab+0xd/0xb0
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000707]  kasan_report+0xbe/0x110
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000541]  __asan_report_load8_noabort+0x14/0x30
[  +0.000005]  amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000535]  ? stop_cpsch+0x396/0x600 [amdgpu]
[  +0.000556]  ? stop_cpsch+0x429/0x600 [amdgpu]
[  +0.000536]  ? __pfx_amdgpu_userq_suspend+0x10/0x10 [amdgpu]
[  +0.000536]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kgd2kfd_suspend+0x132/0x1d0 [amdgpu]
[  +0.000542]  amdgpu_device_fini_hw+0x581/0xe90 [amdgpu]
[  +0.000485]  ? down_write+0xbb/0x140
[  +0.000007]  ? __mutex_unlock_slowpath.constprop.0+0x317/0x360
[  +0.000005]  ? __pfx_amdgpu_device_fini_hw+0x10/0x10 [amdgpu]
[  +0.000482]  ? __kasan_check_write+0x14/0x30
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? up_write+0x55/0xb0
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? blocking_notifier_chain_unregister+0x6c/0xc0
[  +0.000008]  amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
[  +0.000484]  amdgpu_pci_remove+0x93/0x130 [amdgpu]
[  +0.000482]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000008]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000007]  device_release_driver+0x12/0x20
[  +0.000004]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000005]  remove_store+0xd7/0xf0
[  +0.000005]  ? __pfx_remove_store+0x10/0x10
[  +0.000006]  ? __pfx__copy_from_iter+0x10/0x10
[  +0.000006]  ? __pfx_dev_attr_store+0x10/0x10
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000006]  sysfs_kf_write+0x125/0x1d0
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __kasan_check_write+0x14/0x30
[  +0.000005]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000005]  ? rw_verify_area+0x70/0x420
[  +0.000005]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __pfx_vfs_write+0x10/0x10
[  +0.000004]  ? local_clock+0x15/0x30
[  +0.000008]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_slab_free+0x5f/0x80
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fdget_pos+0x1d3/0x500
[  +0.000007]  ksys_write+0x119/0x220
[  +0.000005]  ? putname+0x1c/0x30
[  +0.000006]  ? __pfx_ksys_write+0x10/0x10
[  +0.000007]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __pfx___x64_sys_openat+0x10/0x10
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000006] RIP: 0033:0x7480c0b14887
[  +0.000005] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  +0.000005] RSP: 002b:00007fff142b0058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007480c0b14887
[  +0.000003] RDX: 0000000000000001 RSI: 00007480c0e7365a RDI: 0000000000000004
[  +0.000003] RBP: 00007fff142b0080 R08: 0000563b2e73c170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff142b02f8
[  +0.000003] R13: 0000563b159a72a9 R14: 0000563b159a9d48 R15: 00007480c0f19040
[  +0.000008]  </TASK>

[  +0.000445] Allocated by task 427 on cpu 5 at 29.342331s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000005]  __kasan_kmalloc+0xc1/0xd0
[  +0.000006]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000007]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000493]  drm_file_alloc+0x569/0x9a0
[  +0.000007]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000006]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x510/0x10c0 [amdgpu]
[  +0.000483]  local_pci_probe+0xe7/0x1b0
[  +0.000006]  pci_device_probe+0x5bf/0x890
[  +0.000006]  really_probe+0x1fd/0x950
[  +0.000005]  __driver_probe_device+0x307/0x410
[  +0.000006]  driver_probe_device+0x4e/0x150
[  +0.000005]  __driver_attach+0x223/0x510
[  +0.000006]  bus_for_each_dev+0x102/0x1a0
[  +0.000005]  driver_attach+0x3d/0x60
[  +0.000006]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  rfcomm_dlc_clear_state+0x69/0x220 [rfcomm]
[  +0.000011]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000006]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] Freed by task 1733 on cpu 5 at 59.907086s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000005]  kasan_save_free_info+0x3b/0x60
[  +0.000005]  __kasan_slab_free+0x54/0x80
[  +0.000006]  kfree+0x127/0x470
[  +0.000006]  amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu]
[  +0.000493]  drm_file_free.part.0+0x5b1/0xba0
[  +0.000006]  drm_file_free+0x13/0x30
[  +0.000006]  drm_client_release+0x1c4/0x2b0
[  +0.000006]  drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper]
[  +0.000007]  put_fb_info+0x97/0xe0
[  +0.000007]  unregister_framebuffer+0x197/0x380
[  +0.000005]  drm_fb_helper_unregister_info+0x94/0x100
[  +0.000005]  drm_fbdev_client_unregister+0x3c/0x80
[  +0.000007]  drm_client_dev_unregister+0x144/0x330
[  +0.000006]  drm_dev_unregister+0x49/0x1b0
[  +0.000006]  drm_dev_unplug+0x4c/0xd0
[  +0.000006]  amdgpu_pci_remove+0x58/0x130 [amdgpu]
[  +0.000484]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000007]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000006]  device_release_driver+0x12/0x20
[  +0.000007]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000006]  remove_store+0xd7/0xf0
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000005]  sysfs_kf_write+0x125/0x1d0
[  +0.000006]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000006]  ksys_write+0x119/0x220
[  +0.000006]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000005]  do_syscall_64+0x7c/0x170
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000012] The buggy address belongs to the object at ffff88812eec8000
               which belongs to the cache kmalloc-rnd-07-4k of size 4096
[  +0.000016] The buggy address is located 3160 bytes inside of
               freed 4096-byte region [ffff88812eec8000, ffff88812eec9000)

[  +0.000023] The buggy address belongs to the physical page:
[  +0.000009] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12eec8
[  +0.000007] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000005] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000008] raw: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] raw: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] head: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004bbb201 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000005] page dumped because: kasan: bad access detected

[  +0.000010] Memory state around the buggy address:
[  +0.000009]  ffff88812eec8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88812eec8b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88812eec8c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                     ^
[  +0.000010]  ffff88812eec8c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88812eec8d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

The use-after-free occurs because a delayed work item (`suspend_work`) may still
be pending or running when resources it accesses are freed during device removal
or file close. The previous code used `flush_work(&fpriv->evf_mgr.suspend_work.work)`,
which does not wait for delayed work that has not yet started. As a result, the
delayed work could run after its memory was freed, causing a use-after-free.
By switching to `flush_delayed_work(&fpriv->evf_mgr.suspend_work)`, we ensure that
the kernel waits for both queued and delayed work to finish before
freeing memory, closing this race.

Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Signed-off-by: Vitaly Prosyak <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
dumbbell pushed a commit to dumbbell/drm-kmod that referenced this pull request Sep 22, 2025
Bos can be put with multiple unrelated dma-resv locks held. But
imported bos attempt to grab the bo dma-resv during dma-buf detach
that typically happens during cleanup. That leads to lockde splats
similar to the below and a potential ABBA deadlock.

Fix this by always taking the delayed workqueue cleanup path for
imported bos.

Requesting stable fixes from when the Xe driver was introduced,
since its usage of drm_exec and wide vm dma_resvs appear to be
the first reliable trigger of this.

[22982.116427] ============================================
[22982.116428] WARNING: possible recursive locking detected
[22982.116429] 6.10.0-rc2+ freebsd#10 Tainted: G     U  W
[22982.116430] --------------------------------------------
[22982.116430] glxgears:sh0/5785 is trying to acquire lock:
[22982.116431] ffff8c2bafa539a8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: dma_buf_detach+0x3b/0xf0
[22982.116438]
               but task is already holding lock:
[22982.116438] ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[22982.116442]
               other info that might help us debug this:
[22982.116442]  Possible unsafe locking scenario:

[22982.116443]        CPU0
[22982.116444]        ----
[22982.116444]   lock(reservation_ww_class_mutex);
[22982.116445]   lock(reservation_ww_class_mutex);
[22982.116447]
                *** DEADLOCK ***

[22982.116447]  May be due to missing lock nesting notation

[22982.116448] 5 locks held by glxgears:sh0/5785:
[22982.116449]  #0: ffff8c2d9aba58c8 (&xef->vm.lock){+.+.}-{3:3}, at: xe_file_close+0xde/0x1c0 [xe]
[22982.116507]  freebsd#1: ffff8c2e28cc8480 (&vm->lock){++++}-{3:3}, at: xe_vm_close_and_put+0x161/0x9b0 [xe]
[22982.116578]  freebsd#2: ffff8c2e31982970 (&val->lock){.+.+}-{3:3}, at: xe_validation_ctx_init+0x6d/0x70 [xe]
[22982.116647]  freebsd#3: ffffacdc469478a8 (reservation_ww_class_acquire){+.+.}-{0:0}, at: xe_vma_destroy_unlocked+0x7f/0xe0 [xe]
[22982.116716]  freebsd#4: ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[22982.116719]
               stack backtrace:
[22982.116720] CPU: 8 PID: 5785 Comm: glxgears:sh0 Tainted: G     U  W          6.10.0-rc2+ freebsd#10
[22982.116721] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
[22982.116723] Call Trace:
[22982.116724]  <TASK>
[22982.116725]  dump_stack_lvl+0x77/0xb0
[22982.116727]  __lock_acquire+0x1232/0x2160
[22982.116730]  lock_acquire+0xcb/0x2d0
[22982.116732]  ? dma_buf_detach+0x3b/0xf0
[22982.116734]  ? __lock_acquire+0x417/0x2160
[22982.116736]  __ww_mutex_lock.constprop.0+0xd0/0x13b0
[22982.116738]  ? dma_buf_detach+0x3b/0xf0
[22982.116741]  ? dma_buf_detach+0x3b/0xf0
[22982.116743]  ? ww_mutex_lock+0x2b/0x90
[22982.116745]  ww_mutex_lock+0x2b/0x90
[22982.116747]  dma_buf_detach+0x3b/0xf0
[22982.116749]  drm_prime_gem_destroy+0x2f/0x40 [drm]
[22982.116775]  xe_ttm_bo_destroy+0x32/0x220 [xe]
[22982.116818]  ? __mutex_unlock_slowpath+0x3a/0x290
[22982.116821]  drm_exec_unlock_all+0xa1/0xd0 [drm_exec]
[22982.116823]  drm_exec_fini+0x12/0xb0 [drm_exec]
[22982.116824]  xe_validation_ctx_fini+0x15/0x40 [xe]
[22982.116892]  xe_vma_destroy_unlocked+0xb1/0xe0 [xe]
[22982.116959]  xe_vm_close_and_put+0x41a/0x9b0 [xe]
[22982.117025]  ? xa_find+0xe3/0x1e0
[22982.117028]  xe_file_close+0x10a/0x1c0 [xe]
[22982.117074]  drm_file_free+0x22a/0x280 [drm]
[22982.117099]  drm_release_noglobal+0x22/0x70 [drm]
[22982.117119]  __fput+0xf1/0x2d0
[22982.117122]  task_work_run+0x59/0x90
[22982.117125]  do_exit+0x330/0xb40
[22982.117127]  do_group_exit+0x36/0xa0
[22982.117129]  get_signal+0xbd2/0xbe0
[22982.117131]  arch_do_signal_or_restart+0x3e/0x240
[22982.117134]  syscall_exit_to_user_mode+0x1e7/0x290
[22982.117137]  do_syscall_64+0xa1/0x180
[22982.117139]  ? lock_acquire+0xcb/0x2d0
[22982.117140]  ? __set_task_comm+0x28/0x1e0
[22982.117141]  ? find_held_lock+0x2b/0x80
[22982.117144]  ? __set_task_comm+0xe1/0x1e0
[22982.117145]  ? lock_release+0xca/0x290
[22982.117147]  ? __do_sys_prctl+0x245/0xab0
[22982.117149]  ? lockdep_hardirqs_on_prepare+0xde/0x190
[22982.117150]  ? syscall_exit_to_user_mode+0xb0/0x290
[22982.117152]  ? do_syscall_64+0xa1/0x180
[22982.117154]  ? __lock_acquire+0x417/0x2160
[22982.117155]  ? reacquire_held_locks+0xd1/0x1f0
[22982.117156]  ? do_user_addr_fault+0x30c/0x790
[22982.117158]  ? lock_acquire+0xcb/0x2d0
[22982.117160]  ? find_held_lock+0x2b/0x80
[22982.117162]  ? do_user_addr_fault+0x357/0x790
[22982.117163]  ? lock_release+0xca/0x290
[22982.117164]  ? do_user_addr_fault+0x361/0x790
[22982.117166]  ? trace_hardirqs_off+0x4b/0xc0
[22982.117168]  ? clear_bhb_loop+0x45/0xa0
[22982.117170]  ? clear_bhb_loop+0x45/0xa0
[22982.117172]  ? clear_bhb_loop+0x45/0xa0
[22982.117174]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[22982.117176] RIP: 0033:0x7f943d267169
[22982.117192] Code: Unable to access opcode bytes at 0x7f943d26713f.
[22982.117193] RSP: 002b:00007f9430bffc80 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[22982.117195] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f943d267169
[22982.117196] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005622f89579d0
[22982.117197] RBP: 00007f9430bffcb0 R08: 0000000000000000 R09: 00000000ffffffff
[22982.117198] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[22982.117199] R13: 0000000000000000 R14: 0000000000000000 R15: 00005622f89579d0
[22982.117202]  </TASK>

Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Christian König <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: <[email protected]> # v6.8+
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Matthew Brost <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 26, 2025
The GGTT is currently a 32 bit address space, but the HW and GuC
support 48b addresses in GGTT-related operations, both to keep the
interface/HW paths common between PPGTT and GGTT and to allow for
future increase of the GGTT size.
This leaves us having to program a 64b field with a 32b offset, which
currently we're in some cases doing this by using an upper_32_bits()
call on a 32b variable, which doesn't make any sense. To do this cleanly
we have 2 options:

1 - Set the upper 32 bits directly to zero.
2 - Use 64b variables for the offset and keep programming the whole thing,
    so we're ready if we ever have bigger offsets.

This patch goes with option freebsd#2 and switches the related variables to u64.

v2: don't change the log ctl flag variable (John)

Signed-off-by: Daniele Ceraolo Spurio <[email protected]>
Cc: John Harrison <[email protected]>
Reviewed-by: John Harrison <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 26, 2025
Bos can be put with multiple unrelated dma-resv locks held. But
imported bos attempt to grab the bo dma-resv during dma-buf detach
that typically happens during cleanup. That leads to lockde splats
similar to the below and a potential ABBA deadlock.

Fix this by always taking the delayed workqueue cleanup path for
imported bos.

Requesting stable fixes from when the Xe driver was introduced,
since its usage of drm_exec and wide vm dma_resvs appear to be
the first reliable trigger of this.

[22982.116427] ============================================
[22982.116428] WARNING: possible recursive locking detected
[22982.116429] 6.10.0-rc2+ freebsd#10 Tainted: G     U  W
[22982.116430] --------------------------------------------
[22982.116430] glxgears:sh0/5785 is trying to acquire lock:
[22982.116431] ffff8c2bafa539a8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: dma_buf_detach+0x3b/0xf0
[22982.116438]
               but task is already holding lock:
[22982.116438] ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[22982.116442]
               other info that might help us debug this:
[22982.116442]  Possible unsafe locking scenario:

[22982.116443]        CPU0
[22982.116444]        ----
[22982.116444]   lock(reservation_ww_class_mutex);
[22982.116445]   lock(reservation_ww_class_mutex);
[22982.116447]
                *** DEADLOCK ***

[22982.116447]  May be due to missing lock nesting notation

[22982.116448] 5 locks held by glxgears:sh0/5785:
[22982.116449]  #0: ffff8c2d9aba58c8 (&xef->vm.lock){+.+.}-{3:3}, at: xe_file_close+0xde/0x1c0 [xe]
[22982.116507]  freebsd#1: ffff8c2e28cc8480 (&vm->lock){++++}-{3:3}, at: xe_vm_close_and_put+0x161/0x9b0 [xe]
[22982.116578]  freebsd#2: ffff8c2e31982970 (&val->lock){.+.+}-{3:3}, at: xe_validation_ctx_init+0x6d/0x70 [xe]
[22982.116647]  freebsd#3: ffffacdc469478a8 (reservation_ww_class_acquire){+.+.}-{0:0}, at: xe_vma_destroy_unlocked+0x7f/0xe0 [xe]
[22982.116716]  freebsd#4: ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[22982.116719]
               stack backtrace:
[22982.116720] CPU: 8 PID: 5785 Comm: glxgears:sh0 Tainted: G     U  W          6.10.0-rc2+ freebsd#10
[22982.116721] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
[22982.116723] Call Trace:
[22982.116724]  <TASK>
[22982.116725]  dump_stack_lvl+0x77/0xb0
[22982.116727]  __lock_acquire+0x1232/0x2160
[22982.116730]  lock_acquire+0xcb/0x2d0
[22982.116732]  ? dma_buf_detach+0x3b/0xf0
[22982.116734]  ? __lock_acquire+0x417/0x2160
[22982.116736]  __ww_mutex_lock.constprop.0+0xd0/0x13b0
[22982.116738]  ? dma_buf_detach+0x3b/0xf0
[22982.116741]  ? dma_buf_detach+0x3b/0xf0
[22982.116743]  ? ww_mutex_lock+0x2b/0x90
[22982.116745]  ww_mutex_lock+0x2b/0x90
[22982.116747]  dma_buf_detach+0x3b/0xf0
[22982.116749]  drm_prime_gem_destroy+0x2f/0x40 [drm]
[22982.116775]  xe_ttm_bo_destroy+0x32/0x220 [xe]
[22982.116818]  ? __mutex_unlock_slowpath+0x3a/0x290
[22982.116821]  drm_exec_unlock_all+0xa1/0xd0 [drm_exec]
[22982.116823]  drm_exec_fini+0x12/0xb0 [drm_exec]
[22982.116824]  xe_validation_ctx_fini+0x15/0x40 [xe]
[22982.116892]  xe_vma_destroy_unlocked+0xb1/0xe0 [xe]
[22982.116959]  xe_vm_close_and_put+0x41a/0x9b0 [xe]
[22982.117025]  ? xa_find+0xe3/0x1e0
[22982.117028]  xe_file_close+0x10a/0x1c0 [xe]
[22982.117074]  drm_file_free+0x22a/0x280 [drm]
[22982.117099]  drm_release_noglobal+0x22/0x70 [drm]
[22982.117119]  __fput+0xf1/0x2d0
[22982.117122]  task_work_run+0x59/0x90
[22982.117125]  do_exit+0x330/0xb40
[22982.117127]  do_group_exit+0x36/0xa0
[22982.117129]  get_signal+0xbd2/0xbe0
[22982.117131]  arch_do_signal_or_restart+0x3e/0x240
[22982.117134]  syscall_exit_to_user_mode+0x1e7/0x290
[22982.117137]  do_syscall_64+0xa1/0x180
[22982.117139]  ? lock_acquire+0xcb/0x2d0
[22982.117140]  ? __set_task_comm+0x28/0x1e0
[22982.117141]  ? find_held_lock+0x2b/0x80
[22982.117144]  ? __set_task_comm+0xe1/0x1e0
[22982.117145]  ? lock_release+0xca/0x290
[22982.117147]  ? __do_sys_prctl+0x245/0xab0
[22982.117149]  ? lockdep_hardirqs_on_prepare+0xde/0x190
[22982.117150]  ? syscall_exit_to_user_mode+0xb0/0x290
[22982.117152]  ? do_syscall_64+0xa1/0x180
[22982.117154]  ? __lock_acquire+0x417/0x2160
[22982.117155]  ? reacquire_held_locks+0xd1/0x1f0
[22982.117156]  ? do_user_addr_fault+0x30c/0x790
[22982.117158]  ? lock_acquire+0xcb/0x2d0
[22982.117160]  ? find_held_lock+0x2b/0x80
[22982.117162]  ? do_user_addr_fault+0x357/0x790
[22982.117163]  ? lock_release+0xca/0x290
[22982.117164]  ? do_user_addr_fault+0x361/0x790
[22982.117166]  ? trace_hardirqs_off+0x4b/0xc0
[22982.117168]  ? clear_bhb_loop+0x45/0xa0
[22982.117170]  ? clear_bhb_loop+0x45/0xa0
[22982.117172]  ? clear_bhb_loop+0x45/0xa0
[22982.117174]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[22982.117176] RIP: 0033:0x7f943d267169
[22982.117192] Code: Unable to access opcode bytes at 0x7f943d26713f.
[22982.117193] RSP: 002b:00007f9430bffc80 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[22982.117195] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f943d267169
[22982.117196] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005622f89579d0
[22982.117197] RBP: 00007f9430bffcb0 R08: 0000000000000000 R09: 00000000ffffffff
[22982.117198] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[22982.117199] R13: 0000000000000000 R14: 0000000000000000 R15: 00005622f89579d0
[22982.117202]  </TASK>

Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Christian König <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: <[email protected]> # v6.8+
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Matthew Brost <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 26, 2025
 everywhere

This just standardizes the use of MIN() and MAX() macros, with the very
traditional semantics.  The goal is to use these for C constant
expressions and for top-level / static initializers, and so be able to
simplify the min()/max() macros.

These macro names were used by various kernel code - they are very
traditional, after all - and all such users have been fixed up, with a
few different approaches:

 - trivial duplicated macro definitions have been removed

   Note that 'trivial' here means that it's obviously kernel code that
   already included all the major kernel headers, and thus gets the new
   generic MIN/MAX macros automatically.

 - non-trivial duplicated macro definitions are guarded with #ifndef

   This is the "yes, they define their own versions, but no, the include
   situation is not entirely obvious, and maybe they don't get the
   generic version automatically" case.

 - strange use case freebsd#1

   A couple of drivers decided that the way they want to describe their
   versioning is with

        #define MAJ 1
        #define MIN 2
        #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN)

   which adds zero value and I just did my Alexander the Great
   impersonation, and rewrote that pointless Gordian knot as

        #define DRV_VERSION "1.2"

   instead.

 - strange use case freebsd#2

   A couple of drivers thought that it's a good idea to have a random
   'MIN' or 'MAX' define for a value or index into a table, rather than
   the traditional macro that takes arguments.

   These values were re-written as C enum's instead. The new
   function-line macros only expand when followed by an open
   parenthesis, and thus don't clash with enum use.

Happily, there weren't really all that many of these cases, and a lot of
users already had the pattern of using '#ifndef' guarding (or in one
case just using '#undef MIN') before defining their own private version
that does the same thing. I left such cases alone.

Cc: David Laight <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 26, 2025
Remove intel_dp_mst_suspend/resume from runtime suspend resume
sequences. It is incorrect as it depends on AUX transfers which
itself depend on the device being runtime resumed. This is
also why we see a lock_dep splat here.

<4> [76.011119] kworker/4:2/192 is trying to acquire lock:
<4> [76.011122] ffff8881120b3210 (&mgr->lock#2){+.+.}-{3:3}, at:
drm_dp_mst_topology_mgr_suspend+0x33/0xd0 [drm_display_helper]
<4> [76.011142]
but task is already holding lock:
<4> [76.011144] ffffffffa0bc3420
(xe_pm_runtime_lockdep_map){+.+.}-{0:0}, at:
xe_pm_runtime_suspend+0x51/0x3f0 [xe]
<4> [76.011223]
which lock already depends on the new lock.
<4> [76.011226]
the existing dependency chain (in reverse order) is:
<4> [76.011229]
-> freebsd#2 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}:
<4> [76.011233]        pm_runtime_lockdep_prime+0x2f/0x50 [xe]
<4> [76.011306]        xe_pm_runtime_resume_and_get+0x29/0x90 [xe]
<4> [76.011377]        intel_display_power_get+0x24/0x70 [xe]
<4> [76.011466]        intel_digital_port_connected_locked+0x4c/0xf0
[xe]
<4> [76.011551]        intel_dp_aux_xfer+0xb8/0x7c0 [xe]
<4> [76.011633]        intel_dp_aux_transfer+0x166/0x2e0 [xe]
<4> [76.011715]        drm_dp_dpcd_access+0x87/0x150
[drm_display_helper]
<4> [76.011726]        drm_dp_dpcd_probe+0x3d/0xf0 [drm_display_helper]
<4> [76.011737]        drm_dp_dpcd_read+0xdd/0x130 [drm_display_helper]
<4> [76.011747]        intel_dp_get_colorimetry_status+0x3a/0x70 [xe]
<4> [76.011886]        intel_dp_init_connector+0x4ff/0x1030 [xe]
<4> [76.011969]        intel_ddi_init+0xc5b/0x1030 [xe]
<4> [76.012058]        intel_bios_for_each_encoder+0x36/0x60 [xe]
<4> [76.012145]        intel_setup_outputs+0x201/0x460 [xe]
<4> [76.012233]        intel_display_driver_probe_nogem+0x155/0x1e0 [xe]
<4> [76.012320]        xe_display_init_noaccel+0x27/0x70 [xe]

Signed-off-by: Suraj Kandpal <[email protected]>
Reviewed-by: Rodrigo Vivi <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 26, 2025
Do not do intel_fbdev_set_suspend during runtime_suspend/resume
functions. This cause a big circular lock_dep splat.

kworker/0:4/198 is trying to acquire lock:
<4> [77.185594] ffffffff83398500 (console_lock){+.+.}-{0:0}, at:
intel_fbdev_set_suspend+0x169/0x1f0 [xe]
<4> [77.185947]
but task is already holding lock:
<4> [77.185949] ffffffffa09e9460
(xe_pm_runtime_lockdep_map){+.+.}-{0:0}, at:
xe_pm_runtime_suspend+0x51/0x3f0 [xe]
<4> [77.186262]
which lock already depends on the new lock.
<4> [77.186264]
the existing dependency chain (in reverse order) is:
<4> [77.186266]
-> freebsd#2 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}:
<4> [77.186276]        pm_runtime_lockdep_prime+0x2f/0x50 [xe]
<4> [77.186572]        xe_pm_runtime_resume_and_get+0x29/0x90 [xe]
<4> [77.186867]        intelfb_create+0x150/0x390 [xe]
<4> [77.187197]
__drm_fb_helper_initial_config_and_unlock+0x31c/0x5e0 [drm_kms_helper]
<4> [77.187243]        drm_fb_helper_initial_config+0x3d/0x50
[drm_kms_helper]
<4> [77.187274]        intel_fbdev_client_hotplug+0xb1/0x140 [xe]
<4> [77.187603]        drm_client_register+0x87/0xd0 [drm]
<4> [77.187704]        intel_fbdev_setup+0x51c/0x640 [xe]
<4> [77.188033]        intel_display_driver_register+0xb7/0xf0 [xe]
<4> [77.188438]        xe_display_register+0x21/0x40 [xe]
<4> [77.188809]        xe_device_probe+0xa8d/0xbf0 [xe]
<4> [77.189035]        xe_pci_probe+0x333/0x5b0 [xe]
<4> [77.189330]        local_pci_probe+0x48/0xb0
<4> [77.189341]        pci_device_probe+0xc8/0x280
<4> [77.189351]        really_probe+0xf8/0x390
<4> [77.189362]        __driver_probe_device+0x8a/0x170
<4> [77.189373]        driver_probe_device+0x23/0xb0

Signed-off-by: Suraj Kandpal <[email protected]>
Reviewed-by: Rodrigo Vivi <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 26, 2025
This commit addresses a circular locking dependency issue within the GFX
isolation mechanism. The problem was identified by a warning indicating
a potential deadlock due to inconsistent lock acquisition order.

- The `amdgpu_gfx_enforce_isolation_ring_begin_use` and
  `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously
  acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`,
  leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is
  called while `enforce_isolation_mutex` is held, and
  `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is
  held, it can create a circular dependency.

By ensuring consistent lock usage, this fix resolves the issue:

[  606.297333] ======================================================
[  606.297343] WARNING: possible circular locking dependency detected
[  606.297353] 6.10.0-amd-mlkd-610-311224-lof freebsd#19 Tainted: G           OE
[  606.297365] ------------------------------------------------------
[  606.297375] kworker/u96:3/3825 is trying to acquire lock:
[  606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610
[  606.297413]
               but task is already holding lock:
[  606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.297725]
               which lock already depends on the new lock.

[  606.297738]
               the existing dependency chain (in reverse order) is:
[  606.297749]
               -> freebsd#2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}:
[  606.297765]        __mutex_lock+0x85/0x930
[  606.297776]        mutex_lock_nested+0x1b/0x30
[  606.297786]        amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.298007]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.298225]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.298412]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.298603]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.298866]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.298880]        process_one_work+0x21e/0x680
[  606.298890]        worker_thread+0x190/0x350
[  606.298899]        kthread+0xe7/0x120
[  606.298908]        ret_from_fork+0x3c/0x60
[  606.298919]        ret_from_fork_asm+0x1a/0x30
[  606.298929]
               -> freebsd#1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}:
[  606.298947]        __mutex_lock+0x85/0x930
[  606.298956]        mutex_lock_nested+0x1b/0x30
[  606.298966]        amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu]
[  606.299190]        process_one_work+0x21e/0x680
[  606.299199]        worker_thread+0x190/0x350
[  606.299208]        kthread+0xe7/0x120
[  606.299217]        ret_from_fork+0x3c/0x60
[  606.299227]        ret_from_fork_asm+0x1a/0x30
[  606.299236]
               -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}:
[  606.299257]        __lock_acquire+0x16f9/0x2810
[  606.299267]        lock_acquire+0xd1/0x300
[  606.299276]        __flush_work+0x250/0x610
[  606.299286]        cancel_delayed_work_sync+0x71/0x80
[  606.299296]        amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
[  606.299509]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.299723]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.299909]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.300101]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.300355]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.300369]        process_one_work+0x21e/0x680
[  606.300378]        worker_thread+0x190/0x350
[  606.300387]        kthread+0xe7/0x120
[  606.300396]        ret_from_fork+0x3c/0x60
[  606.300406]        ret_from_fork_asm+0x1a/0x30
[  606.300416]
               other info that might help us debug this:

[  606.300428] Chain exists of:
                 (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex

[  606.300458]  Possible unsafe locking scenario:

[  606.300468]        CPU0                    CPU1
[  606.300476]        ----                    ----
[  606.300484]   lock(&adev->gfx.kfd_sch_mutex);
[  606.300494]                                lock(&adev->enforce_isolation_mutex);
[  606.300508]                                lock(&adev->gfx.kfd_sch_mutex);
[  606.300521]   lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work));
[  606.300536]
                *** DEADLOCK ***

[  606.300546] 5 locks held by kworker/u96:3/3825:
[  606.300555]  #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
[  606.300577]  freebsd#1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
[  606.300600]  freebsd#2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu]
[  606.300837]  freebsd#3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.301062]  freebsd#4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610
[  606.301083]
               stack backtrace:
[  606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G           OE      6.10.0-amd-mlkd-610-311224-lof freebsd#19
[  606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024
[  606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
[  606.301140] Call Trace:
[  606.301146]  <TASK>
[  606.301154]  dump_stack_lvl+0x9b/0xf0
[  606.301166]  dump_stack+0x10/0x20
[  606.301175]  print_circular_bug+0x26c/0x340
[  606.301187]  check_noncircular+0x157/0x170
[  606.301197]  ? register_lock_class+0x48/0x490
[  606.301213]  __lock_acquire+0x16f9/0x2810
[  606.301230]  lock_acquire+0xd1/0x300
[  606.301239]  ? __flush_work+0x232/0x610
[  606.301250]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.301261]  ? mark_held_locks+0x54/0x90
[  606.301274]  ? __flush_work+0x232/0x610
[  606.301284]  __flush_work+0x250/0x610
[  606.301293]  ? __flush_work+0x232/0x610
[  606.301305]  ? __pfx_wq_barrier_func+0x10/0x10
[  606.301318]  ? mark_held_locks+0x54/0x90
[  606.301331]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.301345]  cancel_delayed_work_sync+0x71/0x80
[  606.301356]  amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
[  606.301661]  amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.302050]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.302069]  amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.302452]  amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.302862]  ? drm_sched_entity_error+0x82/0x190 [gpu_sched]
[  606.302890]  amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.303366]  drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.303388]  process_one_work+0x21e/0x680
[  606.303409]  worker_thread+0x190/0x350
[  606.303424]  ? __pfx_worker_thread+0x10/0x10
[  606.303437]  kthread+0xe7/0x120
[  606.303449]  ? __pfx_kthread+0x10/0x10
[  606.303463]  ret_from_fork+0x3c/0x60
[  606.303476]  ? __pfx_kthread+0x10/0x10
[  606.303489]  ret_from_fork_asm+0x1a/0x30
[  606.303512]  </TASK>

v2: Refactor lock handling to resolve circular dependency (Alex)

- Introduced a `sched_work` flag to defer the call to
  `amdgpu_gfx_kfd_sch_ctrl` until after releasing
  `enforce_isolation_mutex`.
- This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside
  the critical section, preventing the circular dependency and deadlock.
- The `sched_work` flag is set within the mutex-protected section if
  conditions are met, and the actual function call is made afterward.
- This approach ensures consistent lock acquisition order.

Fixes: afefd6f24502 ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization")
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Suggested-by: Alex Deucher <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 0b6b2dd38336d5fd49214f0e4e6495e658e3ab44)
Cc: [email protected]
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 26, 2025
…void Priority Inversion in SRIOV

RLCG Register Access is a way for virtual functions to safely access GPU
registers in a virtualized environment., including TLB flushes and
register reads. When multiple threads or VFs try to access the same
registers simultaneously, it can lead to race conditions. By using the
RLCG interface, the driver can serialize access to the registers. This
means that only one thread can access the registers at a time,
preventing conflicts and ensuring that operations are performed
correctly. Additionally, when a low-priority task holds a mutex that a
high-priority task needs, ie., If a thread holding a spinlock tries to
acquire a mutex, it can lead to priority inversion. register access in
amdgpu_virt_rlcg_reg_rw especially in a fast code path is critical.

The call stack shows that the function amdgpu_virt_rlcg_reg_rw is being
called, which attempts to acquire the mutex. This function is invoked
from amdgpu_sriov_wreg, which in turn is called from
gmc_v11_0_flush_gpu_tlb.

The [ BUG: Invalid wait context ] indicates that a thread is trying to
acquire a mutex while it is in a context that does not allow it to sleep
(like holding a spinlock).

Fixes the below:

[  253.013423] =============================
[  253.013434] [ BUG: Invalid wait context ]
[  253.013446] 6.12.0-amdstaging-drm-next-lol-050225 freebsd#14 Tainted: G     U     OE
[  253.013464] -----------------------------
[  253.013475] kworker/0:1/10 is trying to lock:
[  253.013487] ffff9f30542e3cf8 (&adev->virt.rlcg_reg_lock){+.+.}-{3:3}, at: amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.013815] other info that might help us debug this:
[  253.013827] context-{4:4}
[  253.013835] 3 locks held by kworker/0:1/10:
[  253.013847]  #0: ffff9f3040050f58 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
[  253.013877]  freebsd#1: ffffb789c008be40 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
[  253.013905]  freebsd#2: ffff9f3054281838 (&adev->gmc.invalidate_lock){+.+.}-{2:2}, at: gmc_v11_0_flush_gpu_tlb+0x198/0x4f0 [amdgpu]
[  253.014154] stack backtrace:
[  253.014164] CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Tainted: G     U     OE      6.12.0-amdstaging-drm-next-lol-050225 freebsd#14
[  253.014189] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  253.014203] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/18/2024
[  253.014224] Workqueue: events work_for_cpu_fn
[  253.014241] Call Trace:
[  253.014250]  <TASK>
[  253.014260]  dump_stack_lvl+0x9b/0xf0
[  253.014275]  dump_stack+0x10/0x20
[  253.014287]  __lock_acquire+0xa47/0x2810
[  253.014303]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.014321]  lock_acquire+0xd1/0x300
[  253.014333]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.014562]  ? __lock_acquire+0xa6b/0x2810
[  253.014578]  __mutex_lock+0x85/0xe20
[  253.014591]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.014782]  ? sched_clock_noinstr+0x9/0x10
[  253.014795]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.014808]  ? local_clock_noinstr+0xe/0xc0
[  253.014822]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.015012]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.015029]  mutex_lock_nested+0x1b/0x30
[  253.015044]  ? mutex_lock_nested+0x1b/0x30
[  253.015057]  amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.015249]  amdgpu_sriov_wreg+0xc5/0xd0 [amdgpu]
[  253.015435]  gmc_v11_0_flush_gpu_tlb+0x44b/0x4f0 [amdgpu]
[  253.015667]  gfx_v11_0_hw_init+0x499/0x29c0 [amdgpu]
[  253.015901]  ? __pfx_smu_v13_0_update_pcie_parameters+0x10/0x10 [amdgpu]
[  253.016159]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.016173]  ? smu_hw_init+0x18d/0x300 [amdgpu]
[  253.016403]  amdgpu_device_init+0x29ad/0x36a0 [amdgpu]
[  253.016614]  amdgpu_driver_load_kms+0x1a/0xc0 [amdgpu]
[  253.017057]  amdgpu_pci_probe+0x1c2/0x660 [amdgpu]
[  253.017493]  local_pci_probe+0x4b/0xb0
[  253.017746]  work_for_cpu_fn+0x1a/0x30
[  253.017995]  process_one_work+0x21e/0x680
[  253.018248]  worker_thread+0x190/0x330
[  253.018500]  ? __pfx_worker_thread+0x10/0x10
[  253.018746]  kthread+0xe7/0x120
[  253.018988]  ? __pfx_kthread+0x10/0x10
[  253.019231]  ret_from_fork+0x3c/0x60
[  253.019468]  ? __pfx_kthread+0x10/0x10
[  253.019701]  ret_from_fork_asm+0x1a/0x30
[  253.019939]  </TASK>

v2: s/spin_trylock/spin_lock_irqsave to be safe (Christian).

Fixes: e864180ee49b ("drm/amdgpu: Add lock around VF RLCG interface")
Cc: lin cao <[email protected]>
Cc: Jingwen Chen <[email protected]>
Cc: Victor Skvortsov <[email protected]>
Cc: Zhigang Luo <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Suggested-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 26, 2025
Commit 7da55c27e76749b9 ("drm/amd/display: Remove incorrect FP context
start") removes the FP context protection of dml2_create(), and it said
"All the DC_FP_START/END should be used before call anything from DML2".

However, dml2_init()/dml21_init() are not protected from their callers,
causing such errors:

 do_fpu invoked from kernel context![freebsd#1]:
 CPU: 0 UID: 0 PID: 239 Comm: kworker/0:5 Not tainted 6.14.0-rc6+ freebsd#2
 Workqueue: events work_for_cpu_fn
 pc ffff80000319de80 ra ffff80000319de5c tp 900000010575c000 sp 900000010575f840
 a0 0000000000000000 a1 900000012f210130 a2 900000012f000000 a3 ffff80000357e268
 a4 ffff80000357e260 a5 900000012ea52cf0 a6 0000000400000004 a7 0000012c00001388
 t0 00001900000015e0 t1 ffff80000379d000 t2 0000000010624dd3 t3 0000006400000014
 t4 00000000000003e8 t5 0000005000000018 t6 0000000000000020 t7 0000000f00000064
 t8 000000000000002f u0 5f5e9200f8901912 s9 900000012d380010 s0 900000012ea51fd8
 s1 900000012f000000 s2 9000000109296000 s3 0000000000000001 s4 0000000000001fd8
 s5 0000000000000001 s6 ffff800003415000 s7 900000012d390000 s8 ffff800003211f80
    ra: ffff80000319de5c dml21_apply_soc_bb_overrides+0x3c/0x960 [amdgpu]
   ERA: ffff80000319de80 dml21_apply_soc_bb_overrides+0x60/0x960 [amdgpu]
  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
  PRMD: 00000004 (PPLV0 +PIE -PWE)
  EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
  ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
 ESTAT: 000f0000 [FPD] (IS= ECode=15 EsubCode=0)
  PRID: 0014d010 (Loongson-64bit, Loongson-3C6000/S)
 Process kworker/0:5 (pid: 239, threadinfo=00000000927eadc6, task=000000008fd31682)
 Stack : 00040dc000003164 0000000000000001 900000012f210130 900000012eabeeb8
         900000012f000000 ffff80000319fe48 900000012f210000 900000012f210130
         900000012f000000 900000012eabeeb8 0000000000000001 ffff8000031a0064
         900000010575f9f0 900000012f210130 900000012eac0000 900000012ea80000
         900000012f000000 ffff8000031cefc4 900000010575f9f0 ffff8000035859c0
         ffff800003414000 900000010575fa78 900000012f000000 ffff8000031b4c50
         0000000000000000 9000000101c9d700 9000000109c40000 5f5e9200f8901912
         900000012d3c4bd0 900000012d3c5000 ffff8000034aed18 900000012d380010
         900000012d3c4bd0 ffff800003414000 900000012d380000 ffff800002ea49dc
         0000000000000001 900000012d3c6000 00000000ffffe423 0000000000010000
         ...
 Call Trace:
 [<ffff80000319de80>] dml21_apply_soc_bb_overrides+0x60/0x960 [amdgpu]
 [<ffff80000319fe44>] dml21_init+0xa4/0x280 [amdgpu]
 [<ffff8000031a0060>] dml21_create+0x40/0x80 [amdgpu]
 [<ffff8000031cefc0>] dc_state_create+0x100/0x160 [amdgpu]
 [<ffff8000031b4c4c>] dc_create+0x44c/0x640 [amdgpu]
 [<ffff800002ea49d8>] amdgpu_dm_init+0x3f8/0x2060 [amdgpu]
 [<ffff800002ea6658>] dm_hw_init+0x18/0x60 [amdgpu]
 [<ffff800002b16738>] amdgpu_device_init+0x1938/0x27e0 [amdgpu]
 [<ffff800002b18e80>] amdgpu_driver_load_kms+0x20/0xa0 [amdgpu]
 [<ffff800002b0c8f0>] amdgpu_pci_probe+0x1b0/0x580 [amdgpu]
 [<900000000448eae4>] local_pci_probe+0x44/0xc0
 [<9000000003b02b18>] work_for_cpu_fn+0x18/0x40
 [<9000000003b05da0>] process_one_work+0x160/0x300
 [<9000000003b06718>] worker_thread+0x318/0x440
 [<9000000003b11b8c>] kthread+0x12c/0x220
 [<9000000003ac1484>] ret_from_kernel_thread+0x8/0xa4

Unfortunately, protecting dml2_init()/dml21_init() out of DML2 causes
"sleeping function called from invalid context", so protect them with
DC_FP_START() and DC_FP_END() inside.

Fixes: 7da55c27e767 ("drm/amd/display: Remove incorrect FP context start")
Cc: [email protected]
Signed-off-by: Huacai Chen <[email protected]>
Reviewed-by: Aurabindo Pillai <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 26, 2025
 amdgpu_userq_mgr_fini+0x70c
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The issue was reproduced on NV10 using IGT pci_unplug test.
It is expected that `amdgpu_driver_postclose_kms()` is called prior to `amdgpu_drm_release()`.
However, the bug is that `amdgpu_fpriv` was freed in `amdgpu_driver_postclose_kms()`, and then
later accessed in `amdgpu_drm_release()` via a call to `amdgpu_userq_mgr_fini()`.
As a result, KASAN detected a use-after-free condition, as shown in the log below.
The proposed fix is to move the calls to `amdgpu_eviction_fence_destroy()` and
`amdgpu_userq_mgr_fini()` into `amdgpu_driver_postclose_kms()`, so they are invoked before
`amdgpu_fpriv` is freed.

This also ensures symmetry with the initialization path in `amdgpu_driver_open_kms()`,
where the following components are initialized:
- `amdgpu_userq_mgr_init()`
- `amdgpu_eviction_fence_init()`
- `amdgpu_ctx_mgr_init()`

Correspondingly, in `amdgpu_driver_postclose_kms()` we should clean up using:
- `amdgpu_userq_mgr_fini()`
- `amdgpu_eviction_fence_destroy()`
- `amdgpu_ctx_mgr_fini()`

This change eliminates the use-after-free and improves consistency in resource management between open and close paths.

[  +0.094367] ==================================================================
[  +0.000026] BUG: KASAN: slab-use-after-free in amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000866] Write of size 8 at addr ffff88811c068c60 by task amd_pci_unplug/1737
[  +0.000026] CPU: 3 UID: 0 PID: 1737 Comm: amd_pci_unplug Not tainted 6.14.0+ freebsd#2
[  +0.000008] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000010]  print_report+0xce/0x600
[  +0.000009]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000790]  ? srso_return_thunk+0x5/0x5f
[  +0.000007]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000008]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000684]  kasan_report+0xbe/0x110
[  +0.000007]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000601]  __asan_report_store8_noabort+0x17/0x30
[  +0.000007]  amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000801]  ? __pfx_amdgpu_userq_mgr_fini+0x10/0x10 [amdgpu]
[  +0.000819]  ? srso_return_thunk+0x5/0x5f
[  +0.000008]  amdgpu_drm_release+0xa3/0xe0 [amdgpu]
[  +0.000604]  __fput+0x354/0xa90
[  +0.000010]  __fput_sync+0x59/0x80
[  +0.000005]  __x64_sys_close+0x7d/0xe0
[  +0.000006]  x64_sys_call+0x2505/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? kasan_record_aux_stack+0xae/0xd0
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kmem_cache_free+0x398/0x580
[  +0.000006]  ? __fput+0x543/0xa90
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __fput+0x543/0xa90
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000005] RIP: 0033:0x7ffff7b14f67
[  +0.000005] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff
[  +0.000004] RSP: 002b:00007fffffffe358 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffff7b14f67
[  +0.000003] RDX: 0000000000000000 RSI: 00007ffff7f5755a RDI: 0000000000000003
[  +0.000003] RBP: 00007fffffffe380 R08: 0000555555568170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffffffe5c8
[  +0.000003] R13: 00005555555552a9 R14: 0000555555557d48 R15: 00007ffff7ffd040
[  +0.000007]  </TASK>

[  +0.000286] Allocated by task 425 on cpu 11 at 29.751192s:
[  +0.000013]  kasan_save_stack+0x28/0x60
[  +0.000008]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000006]  __kasan_kmalloc+0xc1/0xd0
[  +0.000005]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000006]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000521]  drm_file_alloc+0x569/0x9a0
[  +0.000008]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000007]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x50b/0x10d0 [amdgpu]
[  +0.000482]  local_pci_probe+0xe7/0x1b0
[  +0.000008]  pci_device_probe+0x5bf/0x890
[  +0.000005]  really_probe+0x1fd/0x950
[  +0.000007]  __driver_probe_device+0x307/0x410
[  +0.000005]  driver_probe_device+0x4e/0x150
[  +0.000006]  __driver_attach+0x223/0x510
[  +0.000005]  bus_for_each_dev+0x102/0x1a0
[  +0.000006]  driver_attach+0x3d/0x60
[  +0.000005]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo]
[  +0.000008]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000007]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000286] Allocated by task 425 on cpu 11 at 29.751192s:
[  +0.000013]  kasan_save_stack+0x28/0x60
[  +0.000008]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000006]  __kasan_kmalloc+0xc1/0xd0
[  +0.000005]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000006]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000521]  drm_file_alloc+0x569/0x9a0
[  +0.000008]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000007]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x50b/0x10d0 [amdgpu]
[  +0.000482]  local_pci_probe+0xe7/0x1b0
[  +0.000008]  pci_device_probe+0x5bf/0x890
[  +0.000005]  really_probe+0x1fd/0x950
[  +0.000007]  __driver_probe_device+0x307/0x410
[  +0.000005]  driver_probe_device+0x4e/0x150
[  +0.000006]  __driver_attach+0x223/0x510
[  +0.000005]  bus_for_each_dev+0x102/0x1a0
[  +0.000006]  driver_attach+0x3d/0x60
[  +0.000005]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo]
[  +0.000008]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000007]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] The buggy address belongs to the object at ffff88811c068000
               which belongs to the cache kmalloc-rnd-01-4k of size 4096
[  +0.000016] The buggy address is located 3168 bytes inside of
               freed 4096-byte region [ffff88811c068000, ffff88811c069000)

[  +0.000022] The buggy address belongs to the physical page:
[  +0.000010] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88811c06e000 pfn:0x11c068
[  +0.000006] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000006] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000007] raw: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] raw: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] head: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004701a01 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000004] page dumped because: kasan: bad access detected

[  +0.000011] Memory state around the buggy address:
[  +0.000009]  ffff88811c068b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88811c068b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88811c068c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                        ^
[  +0.000010]  ffff88811c068c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88811c068d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

Cc: Alex Deucher <[email protected]>
Cc: Christian König <[email protected]>
Cc: Lijo Lazar <[email protected]>
Cc: Jesse Zhang <[email protected]>
Cc: Arvind Yadav <[email protected]>

v2: drop amdgpu_drm_release() and assign drm_release()
    as the callback directly.(Alex)

Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Vitaly Prosyak <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 26, 2025
[  +0.000020] BUG: KASAN: slab-use-after-free in amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000817] Read of size 8 at addr ffff88812eec8c58 by task amd_pci_unplug/1733

[  +0.000027] CPU: 10 UID: 0 PID: 1733 Comm: amd_pci_unplug Tainted: G        W          6.14.0+ freebsd#2
[  +0.000009] Tainted: [W]=WARN
[  +0.000003] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000011]  print_report+0xce/0x600
[  +0.000009]  ? srso_return_thunk+0x5/0x5f
[  +0.000006]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000007]  ? kasan_addr_to_slab+0xd/0xb0
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000707]  kasan_report+0xbe/0x110
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000541]  __asan_report_load8_noabort+0x14/0x30
[  +0.000005]  amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000535]  ? stop_cpsch+0x396/0x600 [amdgpu]
[  +0.000556]  ? stop_cpsch+0x429/0x600 [amdgpu]
[  +0.000536]  ? __pfx_amdgpu_userq_suspend+0x10/0x10 [amdgpu]
[  +0.000536]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kgd2kfd_suspend+0x132/0x1d0 [amdgpu]
[  +0.000542]  amdgpu_device_fini_hw+0x581/0xe90 [amdgpu]
[  +0.000485]  ? down_write+0xbb/0x140
[  +0.000007]  ? __mutex_unlock_slowpath.constprop.0+0x317/0x360
[  +0.000005]  ? __pfx_amdgpu_device_fini_hw+0x10/0x10 [amdgpu]
[  +0.000482]  ? __kasan_check_write+0x14/0x30
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? up_write+0x55/0xb0
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? blocking_notifier_chain_unregister+0x6c/0xc0
[  +0.000008]  amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
[  +0.000484]  amdgpu_pci_remove+0x93/0x130 [amdgpu]
[  +0.000482]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000008]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000007]  device_release_driver+0x12/0x20
[  +0.000004]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000005]  remove_store+0xd7/0xf0
[  +0.000005]  ? __pfx_remove_store+0x10/0x10
[  +0.000006]  ? __pfx__copy_from_iter+0x10/0x10
[  +0.000006]  ? __pfx_dev_attr_store+0x10/0x10
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000006]  sysfs_kf_write+0x125/0x1d0
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __kasan_check_write+0x14/0x30
[  +0.000005]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000005]  ? rw_verify_area+0x70/0x420
[  +0.000005]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __pfx_vfs_write+0x10/0x10
[  +0.000004]  ? local_clock+0x15/0x30
[  +0.000008]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_slab_free+0x5f/0x80
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fdget_pos+0x1d3/0x500
[  +0.000007]  ksys_write+0x119/0x220
[  +0.000005]  ? putname+0x1c/0x30
[  +0.000006]  ? __pfx_ksys_write+0x10/0x10
[  +0.000007]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __pfx___x64_sys_openat+0x10/0x10
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000006] RIP: 0033:0x7480c0b14887
[  +0.000005] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  +0.000005] RSP: 002b:00007fff142b0058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007480c0b14887
[  +0.000003] RDX: 0000000000000001 RSI: 00007480c0e7365a RDI: 0000000000000004
[  +0.000003] RBP: 00007fff142b0080 R08: 0000563b2e73c170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff142b02f8
[  +0.000003] R13: 0000563b159a72a9 R14: 0000563b159a9d48 R15: 00007480c0f19040
[  +0.000008]  </TASK>

[  +0.000445] Allocated by task 427 on cpu 5 at 29.342331s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000005]  __kasan_kmalloc+0xc1/0xd0
[  +0.000006]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000007]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000493]  drm_file_alloc+0x569/0x9a0
[  +0.000007]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000006]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x510/0x10c0 [amdgpu]
[  +0.000483]  local_pci_probe+0xe7/0x1b0
[  +0.000006]  pci_device_probe+0x5bf/0x890
[  +0.000006]  really_probe+0x1fd/0x950
[  +0.000005]  __driver_probe_device+0x307/0x410
[  +0.000006]  driver_probe_device+0x4e/0x150
[  +0.000005]  __driver_attach+0x223/0x510
[  +0.000006]  bus_for_each_dev+0x102/0x1a0
[  +0.000005]  driver_attach+0x3d/0x60
[  +0.000006]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  rfcomm_dlc_clear_state+0x69/0x220 [rfcomm]
[  +0.000011]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000006]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] Freed by task 1733 on cpu 5 at 59.907086s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000005]  kasan_save_free_info+0x3b/0x60
[  +0.000005]  __kasan_slab_free+0x54/0x80
[  +0.000006]  kfree+0x127/0x470
[  +0.000006]  amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu]
[  +0.000493]  drm_file_free.part.0+0x5b1/0xba0
[  +0.000006]  drm_file_free+0x13/0x30
[  +0.000006]  drm_client_release+0x1c4/0x2b0
[  +0.000006]  drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper]
[  +0.000007]  put_fb_info+0x97/0xe0
[  +0.000007]  unregister_framebuffer+0x197/0x380
[  +0.000005]  drm_fb_helper_unregister_info+0x94/0x100
[  +0.000005]  drm_fbdev_client_unregister+0x3c/0x80
[  +0.000007]  drm_client_dev_unregister+0x144/0x330
[  +0.000006]  drm_dev_unregister+0x49/0x1b0
[  +0.000006]  drm_dev_unplug+0x4c/0xd0
[  +0.000006]  amdgpu_pci_remove+0x58/0x130 [amdgpu]
[  +0.000484]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000007]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000006]  device_release_driver+0x12/0x20
[  +0.000007]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000006]  remove_store+0xd7/0xf0
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000005]  sysfs_kf_write+0x125/0x1d0
[  +0.000006]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000006]  ksys_write+0x119/0x220
[  +0.000006]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000005]  do_syscall_64+0x7c/0x170
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000012] The buggy address belongs to the object at ffff88812eec8000
               which belongs to the cache kmalloc-rnd-07-4k of size 4096
[  +0.000016] The buggy address is located 3160 bytes inside of
               freed 4096-byte region [ffff88812eec8000, ffff88812eec9000)

[  +0.000023] The buggy address belongs to the physical page:
[  +0.000009] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12eec8
[  +0.000007] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000005] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000008] raw: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] raw: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] head: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004bbb201 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000005] page dumped because: kasan: bad access detected

[  +0.000010] Memory state around the buggy address:
[  +0.000009]  ffff88812eec8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88812eec8b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88812eec8c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                     ^
[  +0.000010]  ffff88812eec8c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88812eec8d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

The use-after-free occurs because a delayed work item (`suspend_work`) may still
be pending or running when resources it accesses are freed during device removal
or file close. The previous code used `flush_work(&fpriv->evf_mgr.suspend_work.work)`,
which does not wait for delayed work that has not yet started. As a result, the
delayed work could run after its memory was freed, causing a use-after-free.
By switching to `flush_delayed_work(&fpriv->evf_mgr.suspend_work)`, we ensure that
the kernel waits for both queued and delayed work to finish before
freeing memory, closing this race.

Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Signed-off-by: Vitaly Prosyak <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 27, 2025
 amdgpu_userq_mgr_fini+0x70c
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The issue was reproduced on NV10 using IGT pci_unplug test.
It is expected that `amdgpu_driver_postclose_kms()` is called prior to `amdgpu_drm_release()`.
However, the bug is that `amdgpu_fpriv` was freed in `amdgpu_driver_postclose_kms()`, and then
later accessed in `amdgpu_drm_release()` via a call to `amdgpu_userq_mgr_fini()`.
As a result, KASAN detected a use-after-free condition, as shown in the log below.
The proposed fix is to move the calls to `amdgpu_eviction_fence_destroy()` and
`amdgpu_userq_mgr_fini()` into `amdgpu_driver_postclose_kms()`, so they are invoked before
`amdgpu_fpriv` is freed.

This also ensures symmetry with the initialization path in `amdgpu_driver_open_kms()`,
where the following components are initialized:
- `amdgpu_userq_mgr_init()`
- `amdgpu_eviction_fence_init()`
- `amdgpu_ctx_mgr_init()`

Correspondingly, in `amdgpu_driver_postclose_kms()` we should clean up using:
- `amdgpu_userq_mgr_fini()`
- `amdgpu_eviction_fence_destroy()`
- `amdgpu_ctx_mgr_fini()`

This change eliminates the use-after-free and improves consistency in resource management between open and close paths.

[  +0.094367] ==================================================================
[  +0.000026] BUG: KASAN: slab-use-after-free in amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000866] Write of size 8 at addr ffff88811c068c60 by task amd_pci_unplug/1737
[  +0.000026] CPU: 3 UID: 0 PID: 1737 Comm: amd_pci_unplug Not tainted 6.14.0+ freebsd#2
[  +0.000008] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000010]  print_report+0xce/0x600
[  +0.000009]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000790]  ? srso_return_thunk+0x5/0x5f
[  +0.000007]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000008]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000684]  kasan_report+0xbe/0x110
[  +0.000007]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000601]  __asan_report_store8_noabort+0x17/0x30
[  +0.000007]  amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000801]  ? __pfx_amdgpu_userq_mgr_fini+0x10/0x10 [amdgpu]
[  +0.000819]  ? srso_return_thunk+0x5/0x5f
[  +0.000008]  amdgpu_drm_release+0xa3/0xe0 [amdgpu]
[  +0.000604]  __fput+0x354/0xa90
[  +0.000010]  __fput_sync+0x59/0x80
[  +0.000005]  __x64_sys_close+0x7d/0xe0
[  +0.000006]  x64_sys_call+0x2505/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? kasan_record_aux_stack+0xae/0xd0
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kmem_cache_free+0x398/0x580
[  +0.000006]  ? __fput+0x543/0xa90
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __fput+0x543/0xa90
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000005] RIP: 0033:0x7ffff7b14f67
[  +0.000005] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff
[  +0.000004] RSP: 002b:00007fffffffe358 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffff7b14f67
[  +0.000003] RDX: 0000000000000000 RSI: 00007ffff7f5755a RDI: 0000000000000003
[  +0.000003] RBP: 00007fffffffe380 R08: 0000555555568170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffffffe5c8
[  +0.000003] R13: 00005555555552a9 R14: 0000555555557d48 R15: 00007ffff7ffd040
[  +0.000007]  </TASK>

[  +0.000286] Allocated by task 425 on cpu 11 at 29.751192s:
[  +0.000013]  kasan_save_stack+0x28/0x60
[  +0.000008]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000006]  __kasan_kmalloc+0xc1/0xd0
[  +0.000005]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000006]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000521]  drm_file_alloc+0x569/0x9a0
[  +0.000008]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000007]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x50b/0x10d0 [amdgpu]
[  +0.000482]  local_pci_probe+0xe7/0x1b0
[  +0.000008]  pci_device_probe+0x5bf/0x890
[  +0.000005]  really_probe+0x1fd/0x950
[  +0.000007]  __driver_probe_device+0x307/0x410
[  +0.000005]  driver_probe_device+0x4e/0x150
[  +0.000006]  __driver_attach+0x223/0x510
[  +0.000005]  bus_for_each_dev+0x102/0x1a0
[  +0.000006]  driver_attach+0x3d/0x60
[  +0.000005]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo]
[  +0.000008]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000007]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000286] Allocated by task 425 on cpu 11 at 29.751192s:
[  +0.000013]  kasan_save_stack+0x28/0x60
[  +0.000008]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000006]  __kasan_kmalloc+0xc1/0xd0
[  +0.000005]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000006]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000521]  drm_file_alloc+0x569/0x9a0
[  +0.000008]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000007]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x50b/0x10d0 [amdgpu]
[  +0.000482]  local_pci_probe+0xe7/0x1b0
[  +0.000008]  pci_device_probe+0x5bf/0x890
[  +0.000005]  really_probe+0x1fd/0x950
[  +0.000007]  __driver_probe_device+0x307/0x410
[  +0.000005]  driver_probe_device+0x4e/0x150
[  +0.000006]  __driver_attach+0x223/0x510
[  +0.000005]  bus_for_each_dev+0x102/0x1a0
[  +0.000006]  driver_attach+0x3d/0x60
[  +0.000005]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo]
[  +0.000008]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000007]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] The buggy address belongs to the object at ffff88811c068000
               which belongs to the cache kmalloc-rnd-01-4k of size 4096
[  +0.000016] The buggy address is located 3168 bytes inside of
               freed 4096-byte region [ffff88811c068000, ffff88811c069000)

[  +0.000022] The buggy address belongs to the physical page:
[  +0.000010] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88811c06e000 pfn:0x11c068
[  +0.000006] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000006] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000007] raw: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] raw: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] head: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004701a01 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000004] page dumped because: kasan: bad access detected

[  +0.000011] Memory state around the buggy address:
[  +0.000009]  ffff88811c068b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88811c068b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88811c068c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                        ^
[  +0.000010]  ffff88811c068c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88811c068d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

Cc: Alex Deucher <[email protected]>
Cc: Christian König <[email protected]>
Cc: Lijo Lazar <[email protected]>
Cc: Jesse Zhang <[email protected]>
Cc: Arvind Yadav <[email protected]>

v2: drop amdgpu_drm_release() and assign drm_release()
    as the callback directly.(Alex)

Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Vitaly Prosyak <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Sep 27, 2025
[  +0.000020] BUG: KASAN: slab-use-after-free in amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000817] Read of size 8 at addr ffff88812eec8c58 by task amd_pci_unplug/1733

[  +0.000027] CPU: 10 UID: 0 PID: 1733 Comm: amd_pci_unplug Tainted: G        W          6.14.0+ freebsd#2
[  +0.000009] Tainted: [W]=WARN
[  +0.000003] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000011]  print_report+0xce/0x600
[  +0.000009]  ? srso_return_thunk+0x5/0x5f
[  +0.000006]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000007]  ? kasan_addr_to_slab+0xd/0xb0
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000707]  kasan_report+0xbe/0x110
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000541]  __asan_report_load8_noabort+0x14/0x30
[  +0.000005]  amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000535]  ? stop_cpsch+0x396/0x600 [amdgpu]
[  +0.000556]  ? stop_cpsch+0x429/0x600 [amdgpu]
[  +0.000536]  ? __pfx_amdgpu_userq_suspend+0x10/0x10 [amdgpu]
[  +0.000536]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kgd2kfd_suspend+0x132/0x1d0 [amdgpu]
[  +0.000542]  amdgpu_device_fini_hw+0x581/0xe90 [amdgpu]
[  +0.000485]  ? down_write+0xbb/0x140
[  +0.000007]  ? __mutex_unlock_slowpath.constprop.0+0x317/0x360
[  +0.000005]  ? __pfx_amdgpu_device_fini_hw+0x10/0x10 [amdgpu]
[  +0.000482]  ? __kasan_check_write+0x14/0x30
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? up_write+0x55/0xb0
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? blocking_notifier_chain_unregister+0x6c/0xc0
[  +0.000008]  amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
[  +0.000484]  amdgpu_pci_remove+0x93/0x130 [amdgpu]
[  +0.000482]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000008]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000007]  device_release_driver+0x12/0x20
[  +0.000004]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000005]  remove_store+0xd7/0xf0
[  +0.000005]  ? __pfx_remove_store+0x10/0x10
[  +0.000006]  ? __pfx__copy_from_iter+0x10/0x10
[  +0.000006]  ? __pfx_dev_attr_store+0x10/0x10
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000006]  sysfs_kf_write+0x125/0x1d0
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __kasan_check_write+0x14/0x30
[  +0.000005]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000005]  ? rw_verify_area+0x70/0x420
[  +0.000005]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __pfx_vfs_write+0x10/0x10
[  +0.000004]  ? local_clock+0x15/0x30
[  +0.000008]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_slab_free+0x5f/0x80
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fdget_pos+0x1d3/0x500
[  +0.000007]  ksys_write+0x119/0x220
[  +0.000005]  ? putname+0x1c/0x30
[  +0.000006]  ? __pfx_ksys_write+0x10/0x10
[  +0.000007]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __pfx___x64_sys_openat+0x10/0x10
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000006] RIP: 0033:0x7480c0b14887
[  +0.000005] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  +0.000005] RSP: 002b:00007fff142b0058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007480c0b14887
[  +0.000003] RDX: 0000000000000001 RSI: 00007480c0e7365a RDI: 0000000000000004
[  +0.000003] RBP: 00007fff142b0080 R08: 0000563b2e73c170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff142b02f8
[  +0.000003] R13: 0000563b159a72a9 R14: 0000563b159a9d48 R15: 00007480c0f19040
[  +0.000008]  </TASK>

[  +0.000445] Allocated by task 427 on cpu 5 at 29.342331s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000005]  __kasan_kmalloc+0xc1/0xd0
[  +0.000006]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000007]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000493]  drm_file_alloc+0x569/0x9a0
[  +0.000007]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000006]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x510/0x10c0 [amdgpu]
[  +0.000483]  local_pci_probe+0xe7/0x1b0
[  +0.000006]  pci_device_probe+0x5bf/0x890
[  +0.000006]  really_probe+0x1fd/0x950
[  +0.000005]  __driver_probe_device+0x307/0x410
[  +0.000006]  driver_probe_device+0x4e/0x150
[  +0.000005]  __driver_attach+0x223/0x510
[  +0.000006]  bus_for_each_dev+0x102/0x1a0
[  +0.000005]  driver_attach+0x3d/0x60
[  +0.000006]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  rfcomm_dlc_clear_state+0x69/0x220 [rfcomm]
[  +0.000011]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000006]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] Freed by task 1733 on cpu 5 at 59.907086s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000005]  kasan_save_free_info+0x3b/0x60
[  +0.000005]  __kasan_slab_free+0x54/0x80
[  +0.000006]  kfree+0x127/0x470
[  +0.000006]  amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu]
[  +0.000493]  drm_file_free.part.0+0x5b1/0xba0
[  +0.000006]  drm_file_free+0x13/0x30
[  +0.000006]  drm_client_release+0x1c4/0x2b0
[  +0.000006]  drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper]
[  +0.000007]  put_fb_info+0x97/0xe0
[  +0.000007]  unregister_framebuffer+0x197/0x380
[  +0.000005]  drm_fb_helper_unregister_info+0x94/0x100
[  +0.000005]  drm_fbdev_client_unregister+0x3c/0x80
[  +0.000007]  drm_client_dev_unregister+0x144/0x330
[  +0.000006]  drm_dev_unregister+0x49/0x1b0
[  +0.000006]  drm_dev_unplug+0x4c/0xd0
[  +0.000006]  amdgpu_pci_remove+0x58/0x130 [amdgpu]
[  +0.000484]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000007]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000006]  device_release_driver+0x12/0x20
[  +0.000007]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000006]  remove_store+0xd7/0xf0
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000005]  sysfs_kf_write+0x125/0x1d0
[  +0.000006]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000006]  ksys_write+0x119/0x220
[  +0.000006]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000005]  do_syscall_64+0x7c/0x170
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000012] The buggy address belongs to the object at ffff88812eec8000
               which belongs to the cache kmalloc-rnd-07-4k of size 4096
[  +0.000016] The buggy address is located 3160 bytes inside of
               freed 4096-byte region [ffff88812eec8000, ffff88812eec9000)

[  +0.000023] The buggy address belongs to the physical page:
[  +0.000009] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12eec8
[  +0.000007] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000005] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000008] raw: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] raw: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] head: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004bbb201 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000005] page dumped because: kasan: bad access detected

[  +0.000010] Memory state around the buggy address:
[  +0.000009]  ffff88812eec8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88812eec8b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88812eec8c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                     ^
[  +0.000010]  ffff88812eec8c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88812eec8d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

The use-after-free occurs because a delayed work item (`suspend_work`) may still
be pending or running when resources it accesses are freed during device removal
or file close. The previous code used `flush_work(&fpriv->evf_mgr.suspend_work.work)`,
which does not wait for delayed work that has not yet started. As a result, the
delayed work could run after its memory was freed, causing a use-after-free.
By switching to `flush_delayed_work(&fpriv->evf_mgr.suspend_work)`, we ensure that
the kernel waits for both queued and delayed work to finish before
freeing memory, closing this race.

Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Signed-off-by: Vitaly Prosyak <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Nov 21, 2025
On completion of i915_vma_pin_ww(), a synchronous variant of
dma_fence_work_commit() is called.  When pinning a VMA to GGTT address
space on a Cherry View family processor, or on a Broxton generation SoC
with VTD enabled, i.e., when stop_machine() is then called from
intel_ggtt_bind_vma(), that can potentially lead to lock inversion among
reservation_ww and cpu_hotplug locks.

[86.861179] ======================================================
[86.861193] WARNING: possible circular locking dependency detected
[86.861209] 6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ freebsd#1 Tainted: G     U
[86.861226] ------------------------------------------------------
[86.861238] i915_module_loa/1432 is trying to acquire lock:
[86.861252] ffffffff83489090 (cpu_hotplug_lock){++++}-{0:0}, at: stop_machine+0x1c/0x50
[86.861290]
but task is already holding lock:
[86.861303] ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915]
[86.862233]
which lock already depends on the new lock.
[86.862251]
the existing dependency chain (in reverse order) is:
[86.862265]
-> freebsd#5 (reservation_ww_class_mutex){+.+.}-{3:3}:
[86.862292]        dma_resv_lockdep+0x19a/0x390
[86.862315]        do_one_initcall+0x60/0x3f0
[86.862334]        kernel_init_freeable+0x3cd/0x680
[86.862353]        kernel_init+0x1b/0x200
[86.862369]        ret_from_fork+0x47/0x70
[86.862383]        ret_from_fork_asm+0x1a/0x30
[86.862399]
-> freebsd#4 (reservation_ww_class_acquire){+.+.}-{0:0}:
[86.862425]        dma_resv_lockdep+0x178/0x390
[86.862440]        do_one_initcall+0x60/0x3f0
[86.862454]        kernel_init_freeable+0x3cd/0x680
[86.862470]        kernel_init+0x1b/0x200
[86.862482]        ret_from_fork+0x47/0x70
[86.862495]        ret_from_fork_asm+0x1a/0x30
[86.862509]
-> freebsd#3 (&mm->mmap_lock){++++}-{3:3}:
[86.862531]        down_read_killable+0x46/0x1e0
[86.862546]        lock_mm_and_find_vma+0xa2/0x280
[86.862561]        do_user_addr_fault+0x266/0x8e0
[86.862578]        exc_page_fault+0x8a/0x2f0
[86.862593]        asm_exc_page_fault+0x27/0x30
[86.862607]        filldir64+0xeb/0x180
[86.862620]        kernfs_fop_readdir+0x118/0x480
[86.862635]        iterate_dir+0xcf/0x2b0
[86.862648]        __x64_sys_getdents64+0x84/0x140
[86.862661]        x64_sys_call+0x1058/0x2660
[86.862675]        do_syscall_64+0x91/0xe90
[86.862689]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[86.862703]
-> freebsd#2 (&root->kernfs_rwsem){++++}-{3:3}:
[86.862725]        down_write+0x3e/0xf0
[86.862738]        kernfs_add_one+0x30/0x3c0
[86.862751]        kernfs_create_dir_ns+0x53/0xb0
[86.862765]        internal_create_group+0x134/0x4c0
[86.862779]        sysfs_create_group+0x13/0x20
[86.862792]        topology_add_dev+0x1d/0x30
[86.862806]        cpuhp_invoke_callback+0x4b5/0x850
[86.862822]        cpuhp_issue_call+0xbf/0x1f0
[86.862836]        __cpuhp_setup_state_cpuslocked+0x111/0x320
[86.862852]        __cpuhp_setup_state+0xb0/0x220
[86.862866]        topology_sysfs_init+0x30/0x50
[86.862879]        do_one_initcall+0x60/0x3f0
[86.862893]        kernel_init_freeable+0x3cd/0x680
[86.862908]        kernel_init+0x1b/0x200
[86.862921]        ret_from_fork+0x47/0x70
[86.862934]        ret_from_fork_asm+0x1a/0x30
[86.862947]
-> freebsd#1 (cpuhp_state_mutex){+.+.}-{3:3}:
[86.862969]        __mutex_lock+0xaa/0xed0
[86.862982]        mutex_lock_nested+0x1b/0x30
[86.862995]        __cpuhp_setup_state_cpuslocked+0x67/0x320
[86.863012]        __cpuhp_setup_state+0xb0/0x220
[86.863026]        page_alloc_init_cpuhp+0x2d/0x60
[86.863041]        mm_core_init+0x22/0x2d0
[86.863054]        start_kernel+0x576/0xbd0
[86.863068]        x86_64_start_reservations+0x18/0x30
[86.863084]        x86_64_start_kernel+0xbf/0x110
[86.863098]        common_startup_64+0x13e/0x141
[86.863114]
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
[86.863135]        __lock_acquire+0x1635/0x2810
[86.863152]        lock_acquire+0xc4/0x2f0
[86.863166]        cpus_read_lock+0x41/0x100
[86.863180]        stop_machine+0x1c/0x50
[86.863194]        bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915]
[86.863987]        intel_ggtt_bind_vma+0x43/0x70 [i915]
[86.864735]        __vma_bind+0x55/0x70 [i915]
[86.865510]        fence_work+0x26/0xa0 [i915]
[86.866248]        fence_notify+0xa1/0x140 [i915]
[86.866983]        __i915_sw_fence_complete+0x8f/0x270 [i915]
[86.867719]        i915_sw_fence_commit+0x39/0x60 [i915]
[86.868453]        i915_vma_pin_ww+0x462/0x1360 [i915]
[86.869228]        i915_vma_pin.constprop.0+0x133/0x1d0 [i915]
[86.870001]        initial_plane_vma+0x307/0x840 [i915]
[86.870774]        intel_initial_plane_config+0x33f/0x670 [i915]
[86.871546]        intel_display_driver_probe_nogem+0x1c6/0x260 [i915]
[86.872330]        i915_driver_probe+0x7fa/0xe80 [i915]
[86.873057]        i915_pci_probe+0xe6/0x220 [i915]
[86.873782]        local_pci_probe+0x47/0xb0
[86.873802]        pci_device_probe+0xf3/0x260
[86.873817]        really_probe+0xf1/0x3c0
[86.873833]        __driver_probe_device+0x8c/0x180
[86.873848]        driver_probe_device+0x24/0xd0
[86.873862]        __driver_attach+0x10f/0x220
[86.873876]        bus_for_each_dev+0x7f/0xe0
[86.873892]        driver_attach+0x1e/0x30
[86.873904]        bus_add_driver+0x151/0x290
[86.873917]        driver_register+0x5e/0x130
[86.873931]        __pci_register_driver+0x7d/0x90
[86.873945]        i915_pci_register_driver+0x23/0x30 [i915]
[86.874678]        i915_init+0x37/0x120 [i915]
[86.875347]        do_one_initcall+0x60/0x3f0
[86.875369]        do_init_module+0x97/0x2a0
[86.875385]        load_module+0x2c54/0x2d80
[86.875398]        init_module_from_file+0x96/0xe0
[86.875413]        idempotent_init_module+0x117/0x330
[86.875426]        __x64_sys_finit_module+0x77/0x100
[86.875440]        x64_sys_call+0x24de/0x2660
[86.875454]        do_syscall_64+0x91/0xe90
[86.875470]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[86.875486]
other info that might help us debug this:
[86.875502] Chain exists of:
  cpu_hotplug_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex
[86.875539]  Possible unsafe locking scenario:
[86.875552]        CPU0                    CPU1
[86.875563]        ----                    ----
[86.875573]   lock(reservation_ww_class_mutex);
[86.875588]                                lock(reservation_ww_class_acquire);
[86.875606]                                lock(reservation_ww_class_mutex);
[86.875624]   rlock(cpu_hotplug_lock);
[86.875637]
 *** DEADLOCK ***
[86.875650] 3 locks held by i915_module_loa/1432:
[86.875663]  #0: ffff888101f5c1b0 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x104/0x220
[86.875699]  freebsd#1: ffffc90002e0b4a0 (reservation_ww_class_acquire){+.+.}-{0:0}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915]
[86.876512]  freebsd#2: ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915]
[86.877305]
stack backtrace:
[86.877326] CPU: 0 UID: 0 PID: 1432 Comm: i915_module_loa Tainted: G     U              6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ freebsd#1 PREEMPT(voluntary)
[86.877334] Tainted: [U]=USER
[86.877336] Hardware name:  /NUC5CPYB, BIOS PYBSWCEL.86A.0079.2020.0420.1316 04/20/2020
[86.877339] Call Trace:
[86.877344]  <TASK>
[86.877353]  dump_stack_lvl+0x91/0xf0
[86.877364]  dump_stack+0x10/0x20
[86.877369]  print_circular_bug+0x285/0x360
[86.877379]  check_noncircular+0x135/0x150
[86.877390]  __lock_acquire+0x1635/0x2810
[86.877403]  lock_acquire+0xc4/0x2f0
[86.877408]  ? stop_machine+0x1c/0x50
[86.877422]  ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915]
[86.878173]  cpus_read_lock+0x41/0x100
[86.878182]  ? stop_machine+0x1c/0x50
[86.878191]  ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915]
[86.878916]  stop_machine+0x1c/0x50
[86.878927]  bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915]
[86.879652]  intel_ggtt_bind_vma+0x43/0x70 [i915]
[86.880375]  __vma_bind+0x55/0x70 [i915]
[86.881133]  fence_work+0x26/0xa0 [i915]
[86.881851]  fence_notify+0xa1/0x140 [i915]
[86.882566]  __i915_sw_fence_complete+0x8f/0x270 [i915]
[86.883286]  i915_sw_fence_commit+0x39/0x60 [i915]
[86.884003]  i915_vma_pin_ww+0x462/0x1360 [i915]
[86.884756]  ? i915_vma_pin.constprop.0+0x6c/0x1d0 [i915]
[86.885513]  i915_vma_pin.constprop.0+0x133/0x1d0 [i915]
[86.886281]  initial_plane_vma+0x307/0x840 [i915]
[86.887049]  intel_initial_plane_config+0x33f/0x670 [i915]
[86.887819]  intel_display_driver_probe_nogem+0x1c6/0x260 [i915]
[86.888587]  i915_driver_probe+0x7fa/0xe80 [i915]
[86.889293]  ? mutex_unlock+0x12/0x20
[86.889301]  ? drm_privacy_screen_get+0x171/0x190
[86.889308]  ? acpi_dev_found+0x66/0x80
[86.889321]  i915_pci_probe+0xe6/0x220 [i915]
[86.890038]  local_pci_probe+0x47/0xb0
[86.890049]  pci_device_probe+0xf3/0x260
[86.890058]  really_probe+0xf1/0x3c0
[86.890067]  __driver_probe_device+0x8c/0x180
[86.890072]  driver_probe_device+0x24/0xd0
[86.890078]  __driver_attach+0x10f/0x220
[86.890083]  ? __pfx___driver_attach+0x10/0x10
[86.890088]  bus_for_each_dev+0x7f/0xe0
[86.890097]  driver_attach+0x1e/0x30
[86.890101]  bus_add_driver+0x151/0x290
[86.890107]  driver_register+0x5e/0x130
[86.890113]  __pci_register_driver+0x7d/0x90
[86.890119]  i915_pci_register_driver+0x23/0x30 [i915]
[86.890833]  i915_init+0x37/0x120 [i915]
[86.891482]  ? __pfx_i915_init+0x10/0x10 [i915]
[86.892135]  do_one_initcall+0x60/0x3f0
[86.892145]  ? __kmalloc_cache_noprof+0x33f/0x470
[86.892157]  do_init_module+0x97/0x2a0
[86.892164]  load_module+0x2c54/0x2d80
[86.892168]  ? __kernel_read+0x15c/0x300
[86.892185]  ? kernel_read_file+0x2b1/0x320
[86.892195]  init_module_from_file+0x96/0xe0
[86.892199]  ? init_module_from_file+0x96/0xe0
[86.892211]  idempotent_init_module+0x117/0x330
[86.892224]  __x64_sys_finit_module+0x77/0x100
[86.892230]  x64_sys_call+0x24de/0x2660
[86.892236]  do_syscall_64+0x91/0xe90
[86.892243]  ? irqentry_exit+0x77/0xb0
[86.892249]  ? sysvec_apic_timer_interrupt+0x57/0xc0
[86.892256]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[86.892261] RIP: 0033:0x7303e1b2725d
[86.892271] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48
[86.892276] RSP: 002b:00007ffddd1fdb38 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[86.892281] RAX: ffffffffffffffda RBX: 00005d771d88fd90 RCX: 00007303e1b2725d
[86.892285] RDX: 0000000000000000 RSI: 00005d771d893aa0 RDI: 000000000000000c
[86.892287] RBP: 00007ffddd1fdbf0 R08: 0000000000000040 R09: 00007ffddd1fdb80
[86.892289] R10: 00007303e1c03b20 R11: 0000000000000246 R12: 00005d771d893aa0
[86.892292] R13: 0000000000000000 R14: 00005d771d88f0d0 R15: 00005d771d895710
[86.892304]  </TASK>

Call asynchronous variant of dma_fence_work_commit() in that case.

v3: Provide more verbose in-line comment (Andi),
  - mention target environments in commit message.

Fixes: 7d1c2618eac59 ("drm/i915: Take reservation lock around i915_vma_pin.")
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14985
Cc: Andi Shyti <[email protected]>
Signed-off-by: Janusz Krzysztofik <[email protected]>
Reviewed-by: Sebastian Brzezinka <[email protected]>
Reviewed-by: Krzysztof Karas <[email protected]>
Acked-by: Andi Shyti <[email protected]>
Signed-off-by: Andi Shyti <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit 648ef1324add1c2e2b6041cdf0b28d31fbca5f13)
Signed-off-by: Rodrigo Vivi <[email protected]>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Nov 21, 2025
When a connector is connected but inactive (e.g., disabled by desktop
environments), pipe_ctx->stream_res.tg will be destroyed. Then, reading
odm_combine_segments causes kernel NULL pointer dereference.

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0000 [freebsd#1] SMP NOPTI
 CPU: 16 UID: 0 PID: 26474 Comm: cat Not tainted 6.17.0+ freebsd#2 PREEMPT(lazy)  e6a17af9ee6db7c63e9d90dbe5b28ccab67520c6
 Hardware name: LENOVO 21Q4/LNVNB161216, BIOS PXCN25WW 03/27/2025
 RIP: 0010:odm_combine_segments_show+0x93/0xf0 [amdgpu]
 Code: 41 83 b8 b0 00 00 00 01 75 6e 48 98 ba a1 ff ff ff 48 c1 e0 0c 48 8d 8c 07 d8 02 00 00 48 85 c9 74 2d 48 8b bc 07 f0 08 00 00 <48> 8b 07 48 8b 80 08 02 00>
 RSP: 0018:ffffd1bf4b953c58 EFLAGS: 00010286
 RAX: 0000000000005000 RBX: ffff8e35976b02d0 RCX: ffff8e3aeed052d8
 RDX: 00000000ffffffa1 RSI: ffff8e35a3120800 RDI: 0000000000000000
 RBP: 0000000000000000 R08: ffff8e3580eb0000 R09: ffff8e35976b02d0
 R10: ffffd1bf4b953c78 R11: 0000000000000000 R12: ffffd1bf4b953d08
 R13: 0000000000040000 R14: 0000000000000001 R15: 0000000000000001
 FS:  00007f44d3f9f740(0000) GS:ffff8e3caa47f000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000006485c2000 CR4: 0000000000f50ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  seq_read_iter+0x125/0x490
  ? __alloc_frozen_pages_noprof+0x18f/0x350
  seq_read+0x12c/0x170
  full_proxy_read+0x51/0x80
  vfs_read+0xbc/0x390
  ? __handle_mm_fault+0xa46/0xef0
  ? do_syscall_64+0x71/0x900
  ksys_read+0x73/0xf0
  do_syscall_64+0x71/0x900
  ? count_memcg_events+0xc2/0x190
  ? handle_mm_fault+0x1d7/0x2d0
  ? do_user_addr_fault+0x21a/0x690
  ? exc_page_fault+0x7e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x6c/0x74
 RIP: 0033:0x7f44d4031687
 Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00>
 RSP: 002b:00007ffdb4b5f0b0 EFLAGS: 00000202 ORIG_RAX: 0000000000000000
 RAX: ffffffffffffffda RBX: 00007f44d3f9f740 RCX: 00007f44d4031687
 RDX: 0000000000040000 RSI: 00007f44d3f5e000 RDI: 0000000000000003
 RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000202 R12: 00007f44d3f5e000
 R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000040000
  </TASK>
 Modules linked in: tls tcp_diag inet_diag xt_mark ccm snd_hrtimer snd_seq_dummy snd_seq_midi snd_seq_oss snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device x>
  snd_hda_codec_atihdmi snd_hda_codec_realtek_lib lenovo_wmi_helpers think_lmi snd_hda_codec_generic snd_hda_codec_hdmi snd_soc_core kvm snd_compress uvcvideo sn>
  platform_profile joydev amd_pmc mousedev mac_hid sch_fq_codel uinput i2c_dev parport_pc ppdev lp parport nvme_fabrics loop nfnetlink ip_tables x_tables dm_cryp>
 CR2: 0000000000000000
 ---[ end trace 0000000000000000 ]---
 RIP: 0010:odm_combine_segments_show+0x93/0xf0 [amdgpu]
 Code: 41 83 b8 b0 00 00 00 01 75 6e 48 98 ba a1 ff ff ff 48 c1 e0 0c 48 8d 8c 07 d8 02 00 00 48 85 c9 74 2d 48 8b bc 07 f0 08 00 00 <48> 8b 07 48 8b 80 08 02 00>
 RSP: 0018:ffffd1bf4b953c58 EFLAGS: 00010286
 RAX: 0000000000005000 RBX: ffff8e35976b02d0 RCX: ffff8e3aeed052d8
 RDX: 00000000ffffffa1 RSI: ffff8e35a3120800 RDI: 0000000000000000
 RBP: 0000000000000000 R08: ffff8e3580eb0000 R09: ffff8e35976b02d0
 R10: ffffd1bf4b953c78 R11: 0000000000000000 R12: ffffd1bf4b953d08
 R13: 0000000000040000 R14: 0000000000000001 R15: 0000000000000001
 FS:  00007f44d3f9f740(0000) GS:ffff8e3caa47f000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000006485c2000 CR4: 0000000000f50ef0
 PKRU: 55555554

Fix this by checking pipe_ctx->stream_res.tg before dereferencing.

Fixes: 07926ba8a44f ("drm/amd/display: Add debugfs interface for ODM combine info")
Signed-off-by: Rong Zhang <[email protected]>
Reviewed-by: Mario Limoncello <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit f19bbecd34e3c15eed7e5e593db2ac0fc7a0e6d8)
Cc: [email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants