-
Notifications
You must be signed in to change notification settings - Fork 836
Description
Bug Report: Server OOM When Copying Large CSV into Table
Description
When copying a large CSV file into a table, the server unexpectedly encounters an OutOfMemory (OOM) error. This issue occurs during the data ingestion process and prevents successful completion of large file imports.
Root Cause Analysis
The OOM error stems from how the CSV reading process handles large files. The current implementation uses concurrent reading that continues until EOF even when no consumer is actively reading the data, leading to excessive memory consumption:
Relevant code section:
databend/src/query/storages/stage/src/read/row_based/processors/reader.rs
Lines 132 to 140 in 6cbbbf7
| let reader = self | |
| .op | |
| .reader_with(&file.path) | |
| .chunk(self.io_size) | |
| // TODO: Use 4 concurrent for test, let's extract as a new setting. | |
| .concurrent(4) | |
| .await? | |
| .into_futures_async_read(0..file.size as u64) | |
| .await?; |
This behavior was introduced in PR #15442, which implemented concurrent reading via OpenDAL. OpenDAL's concurrent option initiates multiple read operations that continue buffering data regardless of whether it's being consumed, eventually exhausting available memory for large files.
OpenDAL has addressed this issue in their PR #6449 by introducing a new prefetch option that provides better control over data buffering.
Proposed Solutions
We have two potential approaches to resolve this issue:
-
Implement prefetch/buffering within the pipeline source processor (I prefer)
- Effectively equivalent to using OpenDAL's
concurrent=1, prefetch=Nconfiguration - Advantage: More direct control over memory usage, with all read operations utilizing the pipeline's runtime without hidden mechanisms
- It seems OpenDAL reader with concurrent=1 do not support prefetch https://github.com/apache/opendal/blob/4fe414170bf670a952c4e48e8b78db5b3a865b8d/core/src/raw/futures_util.rs#L230
- Effectively equivalent to using OpenDAL's
-
Upgrade OpenDAL and utilize the new
prefetchoption- Advantage: Allows maintaining higher
concurrentvalues, note only better at the start of reading a file - potential problems: multiple (concurrent * max_threads) concurrent read tasks that could compete for bandwidth , Could potentially slow down head chunk processing, which might outweigh performance benefits since we need to precess chunks in order. maybe we can use
concurrent=2, prefetch=2 - need to make sure new version of OpendDAL is stable.
- Advantage: Allows maintaining higher