Skip to content

bug: Server OOM When Copying Large CSV into Table #18829

@youngsofun

Description

@youngsofun

Bug Report: Server OOM When Copying Large CSV into Table

Description

When copying a large CSV file into a table, the server unexpectedly encounters an OutOfMemory (OOM) error. This issue occurs during the data ingestion process and prevents successful completion of large file imports.

Root Cause Analysis

The OOM error stems from how the CSV reading process handles large files. The current implementation uses concurrent reading that continues until EOF even when no consumer is actively reading the data, leading to excessive memory consumption:

Relevant code section:

let reader = self
.op
.reader_with(&file.path)
.chunk(self.io_size)
// TODO: Use 4 concurrent for test, let's extract as a new setting.
.concurrent(4)
.await?
.into_futures_async_read(0..file.size as u64)
.await?;

This behavior was introduced in PR #15442, which implemented concurrent reading via OpenDAL. OpenDAL's concurrent option initiates multiple read operations that continue buffering data regardless of whether it's being consumed, eventually exhausting available memory for large files.

OpenDAL has addressed this issue in their PR #6449 by introducing a new prefetch option that provides better control over data buffering.

Proposed Solutions

We have two potential approaches to resolve this issue:

  1. Implement prefetch/buffering within the pipeline source processor (I prefer)

  2. Upgrade OpenDAL and utilize the new prefetch option

    • Advantage: Allows maintaining higher concurrent values, note only better at the start of reading a file
    • potential problems: multiple (concurrent * max_threads) concurrent read tasks that could compete for bandwidth , Could potentially slow down head chunk processing, which might outweigh performance benefits since we need to precess chunks in order. maybe we can use concurrent=2, prefetch=2
    • need to make sure new version of OpendDAL is stable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions