Skip to content

fix: high CPU usage (200-600%) on EVDI/DisplayLink outputs#2109

Open
pcortellezzi wants to merge 1 commit intopop-os:masterfrom
pcortellezzi:feat/evdi-v2
Open

fix: high CPU usage (200-600%) on EVDI/DisplayLink outputs#2109
pcortellezzi wants to merge 1 commit intopop-os:masterfrom
pcortellezzi:feat/evdi-v2

Conversation

@pcortellezzi
Copy link

Problem

EVDI/DisplayLink outputs use llvmpipe (software OpenGL) for rendering, causing 200-600% CPU usage with simple mouse movements on high-resolution displays (e.g. 3440×1440). This makes EVDI-connected monitors essentially unusable for desktop use.

The root cause is twofold:

  1. The EVDI virtual GPU has no hardware renderer — smithay initializes it with llvmpipe, so all rendering is CPU-bound.
  2. When cosmic-comp detects that the render node differs from the target node, it uses MultiRenderer (cross-device path), which copies pixels back via glReadPixels — another expensive CPU operation on top of software rendering.

Solution

Replace the EVDI swapchain allocator with the primary (hardware) GPU's GBM device using DrmCompositor::set_format() with Modifier::Linear, then render using single_renderer on the primary GPU. This is similar to how niri handles display-only devices.

After initialize_output(), for software targets:

  1. set_format() swaps the swapchain allocator from EVDI's GBM (llvmpipe) to the primary GPU's GBM with Linear modifier
  2. The surface thread uses (primary_node, primary_node) as (render_node, effective_target), keeping everything in the single_renderer path
  3. GbmFramebufferExporter detects the buffer as "foreign" and imports it via dmabuf into EVDI's GBM for DRM framebuffer creation — no CPU-side pixel copy

The render path change is guarded by a shared AtomicBool (swapchain_on_primary) that is only set to true when set_format() succeeds. If it fails, the surface thread falls back to the original render path.

Important implementation note

The compositor is accessed directly through drm.compositors().get(crtc) instead of DrmOutput::with_compositor() because LockedDrmOutputManager already holds a write lock on the compositor RwLock — calling with_compositor() would deadlock trying to acquire a read lock on it.

Test plan

  • Tested on a 3440×1440 EVDI/DisplayLink monitor connected via USB — CPU usage dropped from 200-600% to normal levels
  • Non-EVDI outputs (direct GPU) are unaffected (code path only activates when is_software == true AND set_format() succeeds)
  • Fallback path when set_format() fails (not tested — would require a setup where the DRM test commit rejects Linear buffers from the primary GPU)

Checklist

  • I have disclosed use of any AI generated code in my commit messages.
  • I understand these changes in full and will be able to respond to review comments.
  • My change is accurately described in the commit message.
  • My contribution is tested and working as described.
  • I have read the Developer Certificate of Origin and certify my contribution under its conditions.

Replace llvmpipe (CPU) rendering on EVDI/DisplayLink outputs with
hardware GPU rendering using a niri-style swapchain replacement.

After initialize_output(), DrmCompositor::set_format() swaps the EVDI
swapchain allocator with the primary GPU's GBM using Linear modifier.
The surface thread then uses single_renderer with the primary GPU node,
keeping everything on one device. The GbmFramebufferExporter detects
the buffer as foreign and imports it via dmabuf into EVDI's GBM for
DRM framebuffer creation — no CPU-side pixel copy.

The render path change is guarded by a shared AtomicBool flag
(swapchain_on_primary) that is only set to true when set_format()
succeeds. If set_format() fails, the surface thread falls back to
the original render path, avoiding a mismatch between the swapchain
allocator and the renderer.

Key changes:
- mod.rs: Extract primary_gbm before device loop, call set_format()
  on software targets after initialize_output(), set swapchain_on_primary
  flag on success
- surface/mod.rs: Thread is_software and swapchain_on_primary flags,
  use (primary_node, primary_node) as render/target for software outputs
  only when swapchain was successfully moved to primary GPU
- device.rs: Pass is_software to Surface::new()

Note: The compositor is accessed directly through
drm.compositors().get(crtc) instead of DrmOutput::with_compositor()
to avoid deadlocking on the RwLock that LockedDrmOutputManager
already holds as a write guard.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@Drakulix
Copy link
Member

The analysis of the AI agent here is wrong/incomplete.

The core issue is, that the cursor is composited using llvmpipe, because of the cursor planes of the userspace evdi driver showing corrupted buffer contents on some systems.
This means we need to re-copy the whole framebuffer, when the mouse is moved, causing the lag you are seeing.

We are deliberately not using (any) GPU's GBM pipeline to allocate buffers, as this causes performance issues on other setups, while the copy operation doesn't and would happen with the GBM pipeline inside the driver as well.

The correct solution is to get rid of this part of the code: https://github.com/pop-os/cosmic-comp/blob/master/src/backend/kms/mod.rs#L887-L897. And then figure out, what causes the corrupted cursor frames and test and verify on various machines with different gpus and drivers in use.

@pcortellezzi
Copy link
Author

Thanks for the detailed feedback. I agree that disabling cursor planes is a significant contributor to the mouse movement CPU spikes, and fixing cursor plane corruption would help.

However, I'm also observing 100-120% sustained CPU usage at idle with just a few terminals and a browser open — no mouse movement at all. Any screen update (cursor blinking, web content, window redraws) goes through llvmpipe, which is inherently expensive on high-resolution displays.

So it seems like there are two separate issues:

  1. Cursor plane disabled → full framebuffer recomposition on every mouse move (what you described, fixable by restoring cursor planes)
  2. All rendering through llvmpipe → high baseline CPU for any frame update (not addressed by the cursor plane fix)

For reference, I'm currently running this patch on my daily setup (two 3440×1440 EVDI/DisplayLink monitors). In the same conditions (terminals + browser, normal usage), CPU usage dropped from 100-120% at idle (200-600% with mouse movement) down to 10-15%. It's been a game changer for usability.

You mentioned that using the primary GPU's GBM pipeline causes performance issues on other setups — could you elaborate on what issues you've seen ? That would help understand whether this approach is viable or if there's a better path.

@Drakulix
Copy link
Member

Any screen update (cursor blinking, web content, window redraws) goes through llvmpipe, which is inherently expensive on high-resolution displays.

It only does, because we need to composite the cursor on top of every new frame inside the DrmCompositor. Otherwise we wouldn't be drawing with llvmpipe, but purely copying buffers around with glReadPixels, since the MultiRenderer is constructed with a real render-node.

If the buffer would be instead allocated via GBM, the system would have to migrate the buffer into system memory, so that the evdi driver can read it. This means a lot of drivers could not directly render into the buffer anymore, but would internally copy from device-memory into system memory. Potentially with less information, because they don't know which regions changed (we already limit the copy path as much as we can).

So your proposed change wouldn't eliminate a copy nor any costly llvmpipe render operations once the cursor-plane is fixed. The only reason it performs better at the moment, is that it composites the cursor before copying.

@pcortellezzi
Copy link
Author

Thank you for the detailed explanation, that really clarifies things.
I'll explore the cursor plane fix direction and report back with the findings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants