Skip to content

[SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table #11122

@TarunMootala

Description

@TarunMootala

Describe the problem you faced
We have spark streaming job that reads data from an input stream and appends the data to a COW table partitioned on subject area. This streaming job has a batch internal of 120 seconds.

Intermittently the job is failing with error

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

To Reproduce

No specific steps.

Expected behavior

The job should commit the data successfully and continue with next micro batch.

Environment Description

  • Hudi version : 0.12.1 (Glue 4.0)

  • Spark version : Spark 3.3.0

  • Hive version : N/A

  • Hadoop version : N/A

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context

We are not sure on the exact fix and root cause. However, the workaround (not ideal) is to manually delete (archive) few of the oldest Hudi metadata from Active timeline (.hoodie folder) and reduce hoodie.keep.max.commits. This is only working when we reduce max commits, and whenever the max commits are reduced it run perfectly for few months before failing again.

Our requirement is to store 1500 commits to enable incremental query capability on last 2 days of changes. Initially we started with max commits of 1500 and gradually came down to 400.

Hudi Config

            "hoodie.table.name": "table_name",
            "hoodie.datasource.write.keygenerator.type": "COMPLEX",
            "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
            "hoodie.datasource.write.partitionpath.field": "entity_name",
            "hoodie.datasource.write.recordkey.field": "partition_key,sequence_number",
            "hoodie.datasource.write.precombine.field": "approximate_arrival_timestamp",
            "hoodie.datasource.write.operation": "insert",
            "hoodie.insert.shuffle.parallelism": 10,
            "hoodie.bulkinsert.shuffle.parallelism": 10,
            "hoodie.upsert.shuffle.parallelism": 10,
            "hoodie.delete.shuffle.parallelism": 10,
            "hoodie.metadata.enable": "false",
            "hoodie.datasource.hive_sync.use_jdbc": "false",
            "hoodie.datasource.hive_sync.enable": "false",
            "hoodie.datasource.hive_sync.database": "database_name",
            "hoodie.datasource.hive_sync.table": "table_name",
            "hoodie.datasource.hive_sync.partition_fields": "entity_name",
            "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
            "hoodie.datasource.hive_sync.support_timestamp": "true",
            "hoodie.keep.min.commits": 450,  # to preserve commits for at least 2 days with processingTime="120 seconds"
            "hoodie.keep.max.commits": 480,  # to preserve commits for at least 2 days with processingTime="120 seconds"
            "hoodie.cleaner.commits.retained": 449,

Stacktrace

Add the stacktrace of the error.

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

Debugged multiple failure logs, always failing at the stage collect at HoodieSparkEngineContext.java:118 (CleanPlanActionExecutor)

Metadata

Metadata

Assignees

No one assigned

    Labels

    engine:sparkSpark integrationpriority:criticalProduction degraded; pipelines stalledpriority:highSignificant impact; potential bugs

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    ⏳ Awaiting Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions