Very high disk read #9825

CoREse · 2025-08-27T13:43:49Z

CoREse
Aug 27, 2025

📦 部署环境

Docker

📦 部署模式

服务端模式(lobe-chat-database 镜像)

📌 软件版本

v1.116.0

💻 系统环境

Ubuntu

🌐 浏览器

Chrome

🐛 问题描述

很久的问题了，在某一个版本更新后（就是minio干坏事的那个节点），一运行lobechat就会导致服务器卡死，看监控是磁盘读取一直保持很高（大概130MB/s）导致用光了磁盘的读取配额。然后因为当时正好minio做坏事了嘛，我就以为是minio的问题，但是降到minio干坏事之前的版本也没用，然后我发现关掉梯子的话问题就消失了，就没管。

我觉得这个可能是minio作者预先埋的雷，某个时间节点之后就让你服务器用不了，没有梯子没有通讯导致不会暴雷，心想这也太坏了。直到今天，我用garage完全取代了minio，完全没有minio了，但是当我上线lobe-chat-db的时候，服务器还是卡死了，还是一样的130+的磁盘读，iotop看不到是什么东西读的。

只有我一个人遇到这个bug吗？这完全是恶性bug了吧。

这是我的配置文件：

services:
  network-service:
    image: alpine
    container_name: lobe-network
    ports:
      - '${CASDOOR_PORT}:${CASDOOR_PORT}' # Casdoor
      - '${LOBE_PORT}:3210' # LobeChat
    command: tail -f /dev/null
    networks:
      - lobe-network

  postgresql:
    image: pgvector/pgvector:pg16
    container_name: lobe-postgres
    ports:
      - '5432:5432'
    volumes:
      - './data:/var/lib/postgresql/data'
    environment:
      - 'POSTGRES_DB=${LOBE_DB_NAME}'
      - 'POSTGRES_PASSWORD=${POSTGRES_PASSWORD}'
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U postgres']
      interval: 5s
      timeout: 5s
      retries: 5
    restart: always
    networks:
      - lobe-network

  casdoor:
    image: casbin/casdoor
    container_name: lobe-casdoor
    entrypoint: /bin/sh -c './server --createDatabase=true'
    network_mode: 'service:network-service'
    depends_on:
      postgresql:
        condition: service_healthy
    environment:
      RUNNING_IN_DOCKER: 'true'
      driverName: 'postgres'
      dataSourceName: 'user=postgres password=${POSTGRES_PASSWORD} host=postgresql port=5432 sslmode=disable dbname=casdoor'
        #origin: 'http://localhost:${CASDOOR_PORT}'
      origin: 'https://lobe-auth-api.${TOPDOMAIN}'
      runmode: 'dev'
        #volumes:
        #- ./init_data.json:/init_data.json

  lobe:
    image: lobehub/lobe-chat-database
    container_name: lobe-chat
    network_mode: 'service:network-service'
    depends_on:
      postgresql:
        condition: service_healthy
      network-service:
        condition: service_started
      casdoor:
        condition: service_started

    environment:
      - 'APP_URL=https://${DOMAIN}'
      - 'NEXT_AUTH_SSO_PROVIDERS=casdoor'
      - 'KEY_VAULTS_SECRET=<SECRET>'
      - 'NEXT_AUTH_SECRET=<SECRET>'
      - 'AUTH_URL=https://${DOMAIN}/api/auth'
      - 'AUTH_CASDOOR_ISSUER=https://lobe-auth-api.${TOPDOMAIN}'
      - 'DATABASE_URL=postgresql://postgres:${POSTGRES_PASSWORD}@postgresql:5432/${LOBE_DB_NAME}'
      - 'S3_BUCKET=${S3_LOBE_BUCKET}'
      - 'S3_ENABLE_PATH_STYLE=0'
      - 'LLM_VISION_IMAGE_USE_BASE64=1'
    env_file:
      - .env
    restart: always

volumes:
  data:
    driver: local
  s3_data:
    driver: local

networks:
  lobe-network:
    driver: bridge

📷 复现步骤

No response

🚦 期望结果

No response

📝 补充信息

No response

lobehubbot · 2025-08-27T13:44:00Z

lobehubbot
Aug 27, 2025
Maintainer

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

📦 Deployment environment

Docker

📦 Deployment mode

Server-side mode (lobe-chat-database mirror)

📌 Software version

v1.116.0

💻 System environment

Ubuntu

🌐 Browser

Chrome

🐛 Question description

It's been a long time since a certain version was updated (the node where miniio did bad things), once running lobechat would cause the server to get stuck. It is believed that the disk reads are always high (about 130MB/s), resulting in the disk reading quota used up. Then because minio was doing bad things at that time, I thought it was a problem with minio, but it was useless to lower the version of minio before doing bad things. Then I found that if I turned off the ladder, the problem disappeared and I didn't care.

I think this may be a thunder buried in advance by the minio author. After a certain time node, your server will not be used. There is no ladder or communication, which will lead to a thunder. I think this is too bad. Until today, I completely replaced minio with garage, and there was no minio at all. However, when I went online lobe-chat-db, the server was still stuck, and it was still the same 130+ disk reading, and I can't see what it reads from iotop.

Am I the only one who encountered this bug? This is a completely vicious bug.

Here is my configuration file:

services:
  network-service:
    image: alpine
    container_name: lobe-network
    Ports:
      - '${CASDOOR_PORT}:${CASDOOR_PORT}' # Casdoor
      - '${LOBE_PORT}:3210' # LobeChat
    command: tail -f /dev/null
    networks:
      - lobe-network

  postgresql:
    image: pgvector/pgvector:pg16
    container_name: lobe-postgres
    Ports:
      - '5432:5432'
    Volumes:
      - './data:/var/lib/postgresql/data'
    environment:
      - 'POSTGRES_DB=${LOBE_DB_NAME}'
      - 'POSTGRES_PASSWORD=${POSTGRES_PASSWORD}'
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U postgres']
      interval: 5s
      timeout: 5s
      retries: 5
    restart: always
    networks:
      - lobe-network

  casdoor:
    image: casbin/casdoor
    container_name: lobe-casdoor
    entrypoint: /bin/sh -c './server --createDatabase=true'
    network_mode: 'service:network-service'
    depends_on:
      postgresql:
        condition: service_healthy
    environment:
      RUNNING_IN_DOCKER: 'true'
      driverName: 'postgres'
      dataSourceName: 'user=postgres password=${POSTGRES_PASSWORD} host=postgresql port=5432 sslmode=disable dbname=casdoor'
        #origin: 'http://localhost:${CASDOOR_PORT}'
      origin: 'https://lobe-auth-api.${TOPDOMAIN}'
      runmode: 'dev'
        #volumes:
        #- ./init_data.json:/init_data.json

  lobe:
    image: lobehub/lobe-chat-database
    container_name: lobe-chat
    network_mode: 'service:network-service'
    depends_on:
      postgresql:
        condition: service_healthy
      network-service:
        condition: service_started
      casdoor:
        condition: service_started

    environment:
      - 'APP_URL=https://${DOMAIN}'
      - 'NEXT_AUTH_SSO_PROVIDERS=casdoor'
      - 'KEY_VAULTS_SECRET=<SECRET>'
      - 'NEXT_AUTH_SECRET=<SECRET>'
      - 'AUTH_URL=https://${DOMAIN}/api/auth'
      - 'AUTH_CASDOOR_ISSUER=https://lobe-auth-api.${TOPDOMAIN}'
      - 'DATABASE_URL=postgresql://postgres:${POSTGRES_PASSWORD}@postgresql:5432/${LOBE_DB_NAME}'
      - 'S3_BUCKET=${S3_LOBE_BUCKET}'
      - 'S3_ENABLE_PATH_STYLE=0'
      - 'LLM_VISION_IMAGE_USE_BASE64=1'
    env_file:
      - .env
    restart: always

Volumes:
  data:
    driver: local
  s3_data:
    driver: local

networks:
  lobe-network:
    driver: bridge

📷 Reproduction steps

No response

🚦 Expected results

No response

📝 Supplementary information

No response

0 replies

lobehubbot · 2025-08-27T13:44:01Z

lobehubbot
Aug 27, 2025
Maintainer

👀 @CoREse

Thank you for raising an issue. We will investigate into the matter and get back to you as soon as possible.
Please make sure you have given us as much context as possible.
非常感谢您提交 issue。我们会尽快调查此事，并尽快回复您。请确保您已经提供了尽可能多的背景信息。

0 replies

2025-08-27T13:47:35Z

dosubot[bot]
bot Aug 27, 2025

目前没有发现其他用户报告与你描述的这种极高磁盘读（130MB/s）相关的问题，也没有官方文档或近期 PR 涉及类似的 bug 或修复。

根据代码分析，lobe-chat-database 的文件服务会通过 S3 兼容接口（MinIO、Garage 等）频繁读取文件内容或字节流，如果有大量文件或某些逻辑反复触发这些操作，确实可能导致磁盘读飙升，尤其是在 S3 存储后端配置异常或网络异常时，应用可能会不断重试或全量扫描文件，进一步加剧 I/O 压力。

有类似的案例表明，切换存储后端（如从 MinIO 切换到 Garage 或 Cloudflare R2）时，所有相关的环境变量都必须彻底更新，否则会导致上传/下载异常，但一般不会直接导致高磁盘读，除非有某些兼容性或配置问题导致应用反复尝试访问本地或远程存储[相关讨论]。

目前代码和文档中没有发现专门的性能诊断或详细日志开关。建议你可以尝试：

检查 S3 相关环境变量，确保没有残留的 MinIO 配置项，全部替换为 Garage 的参数。
临时移除 S3 相关配置，仅用 PostgreSQL 启动，观察磁盘读是否恢复正常，排查是否为 S3 相关逻辑导致。
检查数据库表（如文件表）是否有大量历史遗留数据，或是否有定时任务/后台任务在全量扫描或同步文件。
尝试在 Docker 外部用 strace、lsof 等工具定位具体是哪个容器、哪个进程在高频读盘（有时 iotop 捕捉不到短时高频 I/O）。
逐步精简 compose 配置，只保留最小可运行单元，逐步排查是哪一块触发高 I/O。

如果有进一步的日志或复现细节，可以继续补充，有助于更精确定位问题。

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

arvinxx · 2025-08-27T14:23:47Z

arvinxx
Aug 27, 2025
Maintainer

如果觉得可能是 Minio 的问题，可以尝试下将 S3 切换成 Cloudflare R2 这类三方服务，如果还出现这种问题那可能是和 pg 相关的情况。

目前社区里好像的确就只有你遇到这个问题，我好像之前没看到过类似的反馈。

欢迎验证后反馈

0 replies

lobehubbot · 2025-08-27T14:23:59Z

lobehubbot
Aug 27, 2025
Maintainer

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

If you think it may be a problem with Minio, you can try switching S3 to Cloudflare R2 and other three-party services. If this problem still occurs, it may be a pg-related situation.

You seem to be the only one in the community that has encountered this problem. I don’t seem to have seen similar feedback before.

Feedback after verification is welcome

0 replies

CoREse · 2025-09-01T14:01:02Z

CoREse
Sep 1, 2025
Author

如果觉得可能是 Minio 的问题，可以尝试下将 S3 切换成 Cloudflare R2 这类三方服务，如果还出现这种问题那可能是和 pg 相关的情况。

目前社区里好像的确就只有你遇到这个问题，我好像之前没看到过类似的反馈。

欢迎验证后反馈

更新，我后续分别将pgvector和minio都拆开了，然后pgvector对结果无影响，minio和(lobe+casdoor)两者单独运行正常，一起运行会触发bug。

然后我将服务器其它服务关掉，一起运行正常，所以我怀疑是docker的问题？docker在内存占用高（但并没有满）的情况下会导致大量的磁盘读取？不管有没有设置swapfile都是这样。这也可以解释之前开启v2ray会导致bug，v2ray似乎占用内存不少。

但是之后，我将minio换成garage，由于garage内存占用显著小于minio，所以可以正常运行，这符合预期，但是当我浏览lobe里的图片时，浏览到第二张又卡死了，一看，又是这个bug。那么我想到的有两个可能：

docker的bug，在浏览文件的时候garage或者lobe的内存占用上升，导致内存占用达到阈值。
lobe的bug，在读取文件的时候可能触发。因为minio和garage都存在同样的bug所以应该是lobe的bug。

商业的s3解决方案暂时还没有试过。

0 replies

lobehubbot · 2025-09-01T14:01:14Z

lobehubbot
Sep 1, 2025
Maintainer

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

If you think it may be a Minio problem, you can try switching S3 to Cloudflare R2 and other three-party services. If this problem still occurs, it may be a pg-related situation.

It seems that you are the only one in the community that has encountered this problem. I don’t seem to have seen similar feedback before.

Welcome to feedback after verification

Update, I will disassemble both pgvector and minio respectively in the future, and then pgvector has no effect on the result. Minio and (lobe+casdoor) run normally, and running together will trigger a bug.

Then I turned off other services on the server and ran normally, so I suspect it was a problem with docker? Does docker cause a lot of disk reads when the memory footprint is high (but not full)? This is true regardless of whether the swapfile is set or not. This can also explain that turning on v2ray before will cause bugs, and v2ray seems to take up a lot of memory.

But afterwards, I switched minio to garage. Since garage's memory usage is significantly smaller than minio, it works normally, which is as expected, but when I browse the picture in lobe, I browse to the second picture and it was stuck again. When I saw that it was this bug again. Then there are two possibilities that come to my mind:

Docker bug, the memory usage of garage or lobe increases when browsing files, causing the memory usage to reach the threshold.
Lobe bug, which may be triggered when reading files. Because minio and garage have the same bug, it should be a bug of lobe.

The commercial s3 solution has not been tried yet.

0 replies

CoREse · 2025-09-01T14:17:36Z

CoREse
Sep 1, 2025
Author

如果觉得可能是 Minio 的问题，可以尝试下将 S3 切换成 Cloudflare R2 这类三方服务，如果还出现这种问题那可能是和 pg 相关的情况。
目前社区里好像的确就只有你遇到这个问题，我好像之前没看到过类似的反馈。
欢迎验证后反馈

更新，我后续分别将pgvector和minio都拆开了，然后pgvector对结果无影响，minio和(lobe+casdoor)两者单独运行正常，一起运行会触发bug。

然后我将服务器其它服务关掉，一起运行正常，所以我怀疑是docker的问题？docker在内存占用高（但并没有满）的情况下会导致大量的磁盘读取？不管有没有设置swapfile都是这样。这也可以解释之前开启v2ray会导致bug，v2ray似乎占用内存不少。

但是之后，我将minio换成garage，由于garage内存占用显著小于minio，所以可以正常运行，这符合预期，但是当我浏览lobe里的图片时，浏览到第二张又卡死了，一看，又是这个bug。那么我想到的有两个可能：

docker的bug，在浏览文件的时候garage或者lobe的内存占用上升，导致内存占用达到阈值。

lobe的bug，在读取文件的时候可能触发。因为minio和garage都存在同样的bug所以应该是lobe的bug。

商业的s3解决方案暂时还没有试过。

浏览图片的服务器死了但没完全死。我看了下top，next-server (v15.3.5)这个进程占用了很多内存，600多兆的物理内存和22g的虚拟内存，我在想会不会是它的问题，它导致了内存占用高导致了docker的bug？

0 replies

lobehubbot · 2025-09-01T14:17:46Z

lobehubbot
Sep 1, 2025
Maintainer

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

If you think it may be a problem with Minio, you can try switching S3 to Cloudflare R2 and other three-party services. If this problem still occurs, it may be a pg-related situation.
At present, you seem to be the only one in the community that has encountered this problem. I don’t seem to have seen similar feedback before.
Welcome to feedback after verification

Update, I will disassemble both pgvector and minio respectively in the future, and then pgvector has no effect on the result. Minio and (lobe+casdoor) run normally separately, and running together will trigger a bug.

Then I turned off other services on the server and ran normally together, so I suspect it was a problem with docker? Does docker cause a lot of disk reads when the memory footprint is high (but not full)? This is true regardless of whether the swapfile is set or not. This can also explain that turning on v2ray before will cause bugs, and v2ray seems to take up a lot of memory.

But afterwards, I switched minio to garage. Since garage's memory usage is significantly smaller than minio, it works normally, which is as expected, but when I browse the picture in lobe, I browse to the second picture and it was stuck again. When I saw that it was this bug again. Then there are two possibilities that come to my mind:

Docker bug, when browsing files, the memory usage of garage or lobe increases, causing the memory usage to reach the threshold.

Lobe bug may be triggered when reading the file. Because minio and garage have the same bug, it should be a bug of lobe.

Commercial s3 solution has not been tried yet.

The server that browses the picture is dead but not completely dead. I looked at the top. Next-server (v15.3.5) This process occupies a lot of memory, more than 600 megabytes of physical memory and 22g of virtual memory. I was wondering if it was its problem. It caused high memory usage and caused docker bugs?

0 replies

arvinxx · 2025-10-21T03:30:13Z

arvinxx
Oct 21, 2025
Maintainer

The next-server process should be LobeChat's, so this indicates that this is the source of the freeze. But the question is, under what circumstances would this be triggered? May I ask on which page you performed this operation?

This comment was translated by Claude.

Original Content

next-server 这个进程应该是 LobeChat 的,那么这说明卡死的来源就是这个。但问题在于什么情况下会触发这种情况?请问你是在哪个页面做的这个操作?

0 replies

CoREse · 2025-10-21T09:33:54Z

CoREse
Oct 21, 2025
Author

The next-server process should be LobeChat's, so this indicates that this is the source of the freeze. But the question is, under what circumstances would this be triggered? May I ask on which page you performed this operation?

This comment was translated by Claude.

Original Content
next-server 这个进程应该是 LobeChat 的,那么这说明卡死的来源就是这个。但问题在于什么情况下会触发这种情况?请问你是在哪个页面做的这个操作?

就是普通的聊天界面。我现在lobe界面和数据库，s3啥的都分开放了，lobe单独放在了一个大内存机器上面，就没有问题了。当然只是我个人没有问题了，至于问题还存不存在我不清楚。

0 replies

arvinxx · 2025-10-21T11:22:17Z

arvinxx
Oct 21, 2025
Maintainer

那我先转一个 Discussion 吧，看看未来社区有没有其他用户会遇到

0 replies

CoREse · 2025-10-31T03:05:07Z

CoREse
Oct 31, 2025
Author

更新：搞不好还真是minio的问题，我可能没有错怪它。今天我的没有装lobechat，但装了minio的服务器又出现这个问题了，然后我重启后发现minio挂了。虽然我没有证据证明是它干的，但是这个相关性已经很大了。

2 replies

arvinxx Nov 1, 2025
Maintainer

@CoREse 有没有可能试试换个别的 S3 服务？

CoREse Nov 2, 2025
Author

@arvinxx 已经换了，换成了rustfs（突然想到我还没调试好）

CoREse · 2025-11-03T05:38:27Z

CoREse
Nov 3, 2025
Author

更新：实锤了！就是minio！我的s3已经换成了rustfs，然后想看看minio的policy怎么写的。然后，启动minio的瞬间，服务器就挂了，然后看服务器监控就是超高的硬盘读。然后强制重启后，再打开minio，基本都是一分钟内就开始（我连上console之后查看policy，基本看一个policy之后服务器就会挂掉）。

1 reply

arvinxx Nov 3, 2025
Maintainer

目前你换成 rustfs 以后，S3 相关的功能是否可用？我们后续的部署方案应该也不会用 minio 了。官方已经不维护镜像了，我们需要一个新的方案。

016 · 2025-11-03T15:04:36Z

016
Nov 3, 2025

我也遇到了这个问题，应该也是 minio 的问题。我最近会进行一些测试，稍后反馈

btw, 可能和我用的 minio 版本有关系，我用的是最后一个能实现 Web 管理的那个版本, minio/minio:RELEASE.2025-04-22T22-12-26Z

0 replies

016 · 2025-11-03T17:08:20Z

016
Nov 3, 2025

更新一下进度，从 minio/minio:RELEASE.2025-04-22T22-12-26Z 换成 minio/minio:RELEASE.2025-04-08T15-41-24Z 版本后, 整个服务就稳定下来了，会继续测试，预计一周后再做一次更新。目前看有可能是版本导致的。用早一个时间的版本就行了。

0 replies

Uh oh!

Very high disk read #9825

Uh oh!

CoREse Aug 27, 2025

📦 部署环境

📦 部署模式

📌 软件版本

💻 系统环境

🌐 浏览器

🐛 问题描述

📷 复现步骤

🚦 期望结果

📝 补充信息

Replies: 16 comments · 3 replies

Uh oh!

lobehubbot Aug 27, 2025 Maintainer

📦 Deployment environment

📦 Deployment mode

📌 Software version

💻 System environment

🌐 Browser

🐛 Question description

📷 Reproduction steps

🚦 Expected results

📝 Supplementary information

Uh oh!

lobehubbot Aug 27, 2025 Maintainer

Uh oh!

dosubot[bot] bot Aug 27, 2025

Uh oh!

arvinxx Aug 27, 2025 Maintainer

Uh oh!

lobehubbot Aug 27, 2025 Maintainer

Uh oh!

CoREse Sep 1, 2025 Author

Uh oh!

lobehubbot Sep 1, 2025 Maintainer

Uh oh!

CoREse Sep 1, 2025 Author

Uh oh!

lobehubbot Sep 1, 2025 Maintainer

Uh oh!

Uh oh!

arvinxx Oct 21, 2025 Maintainer

Uh oh!

CoREse Oct 21, 2025 Author

Uh oh!

arvinxx Oct 21, 2025 Maintainer

Uh oh!

Uh oh!

CoREse Oct 31, 2025 Author

Uh oh!

arvinxx Nov 1, 2025 Maintainer

Uh oh!

Uh oh!

CoREse Nov 2, 2025 Author

Uh oh!

CoREse Nov 3, 2025 Author

Uh oh!

Uh oh!

arvinxx Nov 3, 2025 Maintainer

Uh oh!

016 Nov 3, 2025

Uh oh!

016 Nov 3, 2025

CoREse
Aug 27, 2025

Replies: 16 comments 3 replies

lobehubbot
Aug 27, 2025
Maintainer

lobehubbot
Aug 27, 2025
Maintainer

dosubot[bot]
bot Aug 27, 2025

arvinxx
Aug 27, 2025
Maintainer

lobehubbot
Aug 27, 2025
Maintainer

CoREse
Sep 1, 2025
Author

lobehubbot
Sep 1, 2025
Maintainer

CoREse
Sep 1, 2025
Author

lobehubbot
Sep 1, 2025
Maintainer

arvinxx
Oct 21, 2025
Maintainer

CoREse
Oct 21, 2025
Author

arvinxx
Oct 21, 2025
Maintainer

CoREse
Oct 31, 2025
Author

arvinxx Nov 1, 2025
Maintainer

CoREse Nov 2, 2025
Author

CoREse
Nov 3, 2025
Author

arvinxx Nov 3, 2025
Maintainer

016
Nov 3, 2025

016
Nov 3, 2025