Describe the problem you faced
We have spark streaming job that reads data from an input stream and appends the data to a COW table partitioned on subject area. This streaming job has a batch internal of 120 seconds.
Intermittently the job is failing with error
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
To Reproduce
No specific steps.
Expected behavior
The job should commit the data successfully and continue with next micro batch.
Environment Description
-
Hudi version : 0.12.1 (Glue 4.0)
-
Spark version : Spark 3.3.0
-
Hive version : N/A
-
Hadoop version : N/A
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) : no
Additional context
We are not sure on the exact fix and root cause. However, the workaround (not ideal) is to manually delete (archive) few of the oldest Hudi metadata from Active timeline (.hoodie folder) and reduce hoodie.keep.max.commits. This is only working when we reduce max commits, and whenever the max commits are reduced it run perfectly for few months before failing again.
Our requirement is to store 1500 commits to enable incremental query capability on last 2 days of changes. Initially we started with max commits of 1500 and gradually came down to 400.
Hudi Config
"hoodie.table.name": "table_name",
"hoodie.datasource.write.keygenerator.type": "COMPLEX",
"hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.partitionpath.field": "entity_name",
"hoodie.datasource.write.recordkey.field": "partition_key,sequence_number",
"hoodie.datasource.write.precombine.field": "approximate_arrival_timestamp",
"hoodie.datasource.write.operation": "insert",
"hoodie.insert.shuffle.parallelism": 10,
"hoodie.bulkinsert.shuffle.parallelism": 10,
"hoodie.upsert.shuffle.parallelism": 10,
"hoodie.delete.shuffle.parallelism": 10,
"hoodie.metadata.enable": "false",
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.datasource.hive_sync.enable": "false",
"hoodie.datasource.hive_sync.database": "database_name",
"hoodie.datasource.hive_sync.table": "table_name",
"hoodie.datasource.hive_sync.partition_fields": "entity_name",
"hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.support_timestamp": "true",
"hoodie.keep.min.commits": 450, # to preserve commits for at least 2 days with processingTime="120 seconds"
"hoodie.keep.max.commits": 480, # to preserve commits for at least 2 days with processingTime="120 seconds"
"hoodie.cleaner.commits.retained": 449,
Stacktrace
Add the stacktrace of the error.
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Debugged multiple failure logs, always failing at the stage collect at HoodieSparkEngineContext.java:118 (CleanPlanActionExecutor)
Describe the problem you faced
We have spark streaming job that reads data from an input stream and appends the data to a COW table partitioned on subject area. This streaming job has a batch internal of 120 seconds.
Intermittently the job is failing with error
To Reproduce
No specific steps.
Expected behavior
The job should commit the data successfully and continue with next micro batch.
Environment Description
Hudi version : 0.12.1 (Glue 4.0)
Spark version : Spark 3.3.0
Hive version : N/A
Hadoop version : N/A
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
We are not sure on the exact fix and root cause. However, the workaround (not ideal) is to manually delete (archive) few of the oldest Hudi metadata from Active timeline (
.hoodiefolder) and reducehoodie.keep.max.commits. This is only working when we reduce max commits, and whenever the max commits are reduced it run perfectly for few months before failing again.Our requirement is to store 1500 commits to enable incremental query capability on last 2 days of changes. Initially we started with max commits of 1500 and gradually came down to 400.
Hudi Config
Stacktrace
Add the stacktrace of the error.Debugged multiple failure logs, always failing at the stage
collect at HoodieSparkEngineContext.java:118 (CleanPlanActionExecutor)