[SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table

**Describe the problem you faced**
We have spark streaming job that reads data from an input stream and appends the data to a COW table partitioned on subject area. This streaming job has a batch internal of 120 seconds.

Intermittently the job is failing with error 

```
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
```

**To Reproduce**

No specific steps. 

**Expected behavior**

The job should commit the data successfully and continue with next micro batch. 

**Environment Description**

* Hudi version : 0.12.1 (Glue 4.0)

* Spark version : Spark 3.3.0

* Hive version : N/A

* Hadoop version : N/A

* Storage (HDFS/S3/GCS..) : S3 

* Running on Docker? (yes/no) : no


**Additional context**

We are not sure on the exact fix and root cause. However, the workaround (not ideal) is to manually delete (archive) few of the oldest Hudi metadata from Active timeline (`.hoodie` folder) and reduce `hoodie.keep.max.commits`. This is only working when we reduce max commits, and whenever the max commits are reduced it run perfectly for few months before failing again.

Our requirement is to store 1500 commits to enable incremental query capability on last 2 days of changes. Initially we started with max commits of 1500 and gradually came down to 400. 

**Hudi Config**

```
            "hoodie.table.name": "table_name",
            "hoodie.datasource.write.keygenerator.type": "COMPLEX",
            "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
            "hoodie.datasource.write.partitionpath.field": "entity_name",
            "hoodie.datasource.write.recordkey.field": "partition_key,sequence_number",
            "hoodie.datasource.write.precombine.field": "approximate_arrival_timestamp",
            "hoodie.datasource.write.operation": "insert",
            "hoodie.insert.shuffle.parallelism": 10,
            "hoodie.bulkinsert.shuffle.parallelism": 10,
            "hoodie.upsert.shuffle.parallelism": 10,
            "hoodie.delete.shuffle.parallelism": 10,
            "hoodie.metadata.enable": "false",
            "hoodie.datasource.hive_sync.use_jdbc": "false",
            "hoodie.datasource.hive_sync.enable": "false",
            "hoodie.datasource.hive_sync.database": "database_name",
            "hoodie.datasource.hive_sync.table": "table_name",
            "hoodie.datasource.hive_sync.partition_fields": "entity_name",
            "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
            "hoodie.datasource.hive_sync.support_timestamp": "true",
            "hoodie.keep.min.commits": 450,  # to preserve commits for at least 2 days with processingTime="120 seconds"
            "hoodie.keep.max.commits": 480,  # to preserve commits for at least 2 days with processingTime="120 seconds"
            "hoodie.cleaner.commits.retained": 449,
```

**Stacktrace**

```Add the stacktrace of the error.```

```
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
```

Debugged multiple failure logs, always failing at the stage `collect at HoodieSparkEngineContext.java:118 (CleanPlanActionExecutor)`




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table #11122

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table #11122

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions