bug: Server OOM When Copying Large CSV into Table

## Bug Report: Server OOM When Copying Large CSV into Table

### Description
When copying a large CSV file into a table, the server unexpectedly encounters an OutOfMemory (OOM) error. This issue occurs during the data ingestion process and prevents successful completion of large file imports.

### Root Cause Analysis

The OOM error stems from how the CSV reading process handles large files. The current implementation uses concurrent reading that continues until EOF even when no consumer is actively reading the data, leading to excessive memory consumption:

Relevant code section:
https://github.com/databendlabs/databend/blob/6cbbbf7e0631f2644a8bf70cb8787cde902687ed/src/query/storages/stage/src/read/row_based/processors/reader.rs#L132-L140

This behavior was introduced in PR #15442, which implemented concurrent reading via OpenDAL. OpenDAL's `concurrent` option initiates multiple read operations that continue buffering data regardless of whether it's being consumed, eventually exhausting available memory for large files.

OpenDAL has addressed this issue in their PR [#6449](https://github.com/apache/opendal/pull/6449) by introducing a new `prefetch` option that provides better control over data buffering.

### Proposed Solutions

We have two potential approaches to resolve this issue:

1. **Implement prefetch/buffering within the pipeline source processor**  (I prefer)
   - Effectively equivalent to using OpenDAL's `concurrent=1, prefetch=N` configuration
   - Advantage: More direct control over memory usage, with all read operations utilizing the pipeline's runtime without hidden mechanisms
   - It seems OpenDAL reader with concurrent=1 do not support prefetch  https://github.com/apache/opendal/blob/4fe414170bf670a952c4e48e8b78db5b3a865b8d/core/src/raw/futures_util.rs#L230

2. **Upgrade OpenDAL and utilize the new `prefetch` option**
   - Advantage:  Allows maintaining higher `concurrent` values, note only better at the start of reading a file 
   - potential problems: multiple (concurrent * max_threads) concurrent read tasks that could compete for bandwidth , Could potentially slow down head chunk processing, which might outweigh performance benefits  since we need to precess chunks in order.  maybe we can use` concurrent=2, prefetch=2`
   - need to make sure new version of OpendDAL is stable.


	let reader = self
	.op
	.reader_with(&file.path)
	.chunk(self.io_size)
	// TODO: Use 4 concurrent for test, let's extract as a new setting.
	.concurrent(4)
	.await?
	.into_futures_async_read(0..file.size as u64)
	.await?;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: Server OOM When Copying Large CSV into Table #18829

Bug Report: Server OOM When Copying Large CSV into Table

Description

Root Cause Analysis

Proposed Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: Server OOM When Copying Large CSV into Table #18829

Description

Bug Report: Server OOM When Copying Large CSV into Table

Description

Root Cause Analysis

Proposed Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions