[HUDI-3921] Reconcile schema evolution logic with base file re-writing #5376

xiarixiaoyao · 2022-04-20T12:32:47Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

xiarixiaoyao · 2022-04-20T12:35:47Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

+  // TODO Unify the logical of rewriteRecordWithMetadata and rewriteEvolutionRecordWithMetadata, and delete this function.
+  public static GenericRecord rewriteEvolutionRecordWithMetadata(GenericRecord genericRecord, Schema newSchema, String fileName) {
+    GenericRecord newRecord = HoodieAvroUtils.rewriteRecordWithNewSchema(genericRecord, newSchema, new HashMap<>());
+    // do not preserve FILENAME_METADATA_FIELD


just copy from rewriteRecordWithMetadata,
but i donnot know why we need rewrite genericRecord, it will cost some time.
can we modfiy genericRecord directly ? just like genericRecord.put(HoodieRecord.FILENAME_METADATA_FIELD_POS, fileName);

Records should be immutable by default, with only limited scopes where treating them as mutable is acceptable

thanks. After this operation, we will write parquet files directly， The life cycle of genericRecord has come to an end. I think we can try to turn these records into mutable in this place. Of course, let me try this in the next version.

These records might actually be used upstream in some follow-up operations, hence it's preferred to keep them immutable since at this level we don't control their lifecycle

yes, lets revisit after 0.11 to see if we can avoid full rewrite in some cases. I understand the intent to recreate new record to avoid mutations, but it does incur perf hits. I should have thought about this when we fixed the HoodieMergeHandle for commit time fix in earlier patch. missed to bring it up.

xiarixiaoyao · 2022-04-20T13:02:33Z

@alexeykudinkin @bvaradar @xushiyan
could you pls help me review this pr？ thanks

alexeykudinkin · 2022-04-20T16:57:56Z

@xiarixiaoyao left some comments.

Can you please add a description for this PR? There's very little context in this PR itself, but also not a lot of in the Jira issue, so hard to understand how exactly HUDI-3855 is related to it.

alexeykudinkin · 2022-04-20T16:23:27Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java

  protected final String writeToken;
  protected final TaskContextSupplier taskContextSupplier;
+  // For full schema evolution
+  protected final boolean schemaOnReadEnable;


nit: better to suffix it w/ Enabled

alexeykudinkin · 2022-04-20T16:52:59Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

+  // TODO Unify the logical of rewriteRecordWithMetadata and rewriteEvolutionRecordWithMetadata, and delete this function.
+  public static GenericRecord rewriteEvolutionRecordWithMetadata(GenericRecord genericRecord, Schema newSchema, String fileName) {
+    GenericRecord newRecord = HoodieAvroUtils.rewriteRecordWithNewSchema(genericRecord, newSchema, new HashMap<>());
+    // do not preserve FILENAME_METADATA_FIELD


Records should be immutable by default, with only limited scopes where treating them as mutable is acceptable

alexeykudinkin · 2022-04-20T16:56:37Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

+          if (!renameCols.isEmpty() && oldSchema.getField(field.name()) == null) {
+            String fieldName = field.name();
+            for (Map.Entry<String, String> entry : renameCols.entrySet()) {
+              List<String> nameParts = Arrays.asList(entry.getKey().split("\\."));


If we're accepting the dot-path specification, why are we looking at the last element of the chain?
Since we're doing top-down traversal, we should look at the first and make sure we're also need to either to keep an index at what level we are, or trim the column names as we traverse

xiarixiaoyao · 2022-04-21T03:07:29Z

@alexeykudinkin
Thank you very much for your review， addressed all comments
add more test for nested rename operation.

by HUDI-3855: we will rewrite old record before write it to parquet file
for schema evolution rename scene， since old parquet file has old name， when we rewrite the old record with new schema， the value belong to old name will be missed which lead to a serious problem
for example；
1）now current cow hoodie table have a old parquet file which schema is： a int, b string
2) we rename a -> aa, now new schema for hoodie table will be : aa int, b string
3) let us insert new data to current hoodie table, during the insert operation we need to read old record from old parquet file,
before HUDI-3855: we can read old record directly and write it to new parquet directly, rename operation has nothing influence to it
after HUDI-3855: before we write old record, we need rewrite it with new schema, now the schema of old record is: a int, b string but the new schema is: aa int, b string, if we rewrite the old record forcely we will miss the value of column a since it is not exists in new schema.

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

.../hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java

nsivabalan · 2022-04-21T14:28:57Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

+  // TODO Unify the logical of rewriteRecordWithMetadata and rewriteEvolutionRecordWithMetadata, and delete this function.
+  public static GenericRecord rewriteEvolutionRecordWithMetadata(GenericRecord genericRecord, Schema newSchema, String fileName) {
+    GenericRecord newRecord = HoodieAvroUtils.rewriteRecordWithNewSchema(genericRecord, newSchema, new HashMap<>());
+    // do not preserve FILENAME_METADATA_FIELD


yes, lets revisit after 0.11 to see if we can avoid full rewrite in some cases. I understand the intent to recreate new record to avoid mutations, but it does incur perf hits. I should have thought about this when we fixed the HoodieMergeHandle for commit time fix in earlier patch. missed to bring it up.

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

xiarixiaoyao · 2022-04-21T16:17:41Z

@nsivabalan @alexeykudinkin could you pls review again

… comments

hudi-bot · 2022-04-21T21:24:25Z

CI report:

cbdc498 Azure: SUCCESS
192b87f Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

alexeykudinkin

LGTM

@xiarixiaoyao i left a bunch of comments, would be great if you could follow up on them in a separate PR

alexeykudinkin · 2022-04-21T21:44:26Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java


+  // TODO Unify the logical of rewriteRecordWithMetadata and rewriteEvolutionRecordWithMetadata, and delete this function.
+  public static GenericRecord rewriteEvolutionRecordWithMetadata(GenericRecord genericRecord, Schema newSchema, String fileName) {
+    GenericRecord newRecord = HoodieAvroUtils.rewriteRecordWithNewSchema(genericRecord, newSchema, new HashMap<>());


@xiarixiaoyao in general instead of doing new HashMap let's do Collections.emptyMap to avoid allocating any unnecessary objects on the hot-path

alexeykudinkin · 2022-04-21T21:45:27Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

-  public static GenericRecord rewriteRecordWithNewSchema(IndexedRecord oldRecord, Schema newSchema) {
-    Object newRecord = rewriteRecordWithNewSchema(oldRecord, oldRecord.getSchema(), newSchema);
+  public static GenericRecord rewriteRecordWithNewSchema(IndexedRecord oldRecord, Schema newSchema, Map<String, String> renameCols) {
+    Object newRecord = rewriteRecordWithNewSchema(oldRecord, oldRecord.getSchema(), newSchema, renameCols, new LinkedList<>());


Would suggest to use ArrayDeque instead (it's more performant than LinkedList under most loads)

alexeykudinkin · 2022-04-21T21:47:38Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

          if (oldSchema.getField(field.name()) != null) {
            Schema.Field oldField = oldSchema.getField(field.name());
-            helper.put(i, rewriteRecordWithNewSchema(indexedRecord.get(oldField.pos()), oldField.schema(), fields.get(i).schema()));
+            helper.put(i, rewriteRecordWithNewSchema(indexedRecord.get(oldField.pos()), oldField.schema(), fields.get(i).schema(), renameCols, fieldNames));


Why do we need helper? We can just insert into the target record right away, right?

alexeykudinkin · 2022-04-21T21:59:31Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

+            helper.put(i, rewriteRecordWithNewSchema(indexedRecord.get(oldField.pos()), oldField.schema(), fields.get(i).schema(), renameCols, fieldNames));
+          } else {
+            String fieldFullName = createFullName(fieldNames);
+            String[] colNamePartsFromOldSchema = renameCols.getOrDefault(fieldFullName, "").split("\\.");


We don't need to split actually, we just need to find the part after the last "." (will reduce amount of memory churn)

alexeykudinkin · 2022-04-21T22:00:30Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

+          } else {
+            String fieldFullName = createFullName(fieldNames);
+            String[] colNamePartsFromOldSchema = renameCols.getOrDefault(fieldFullName, "").split("\\.");
+            String lastColNameFromOldSchema = colNamePartsFromOldSchema[colNamePartsFromOldSchema.length - 1];


nit: fieldNameFromOldSchema

alexeykudinkin · 2022-04-21T22:09:12Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/action/InternalSchemaMerger.java

    if (newType.isNestedType()) {
-      return Types.Field.get(oldField.fieldId(), oldField.isOptional(), nameFromFileSchema, newType, oldField.doc());
+      return Types.Field.get(oldField.fieldId(), oldField.isOptional(),
+          useColNameFromFileSchema ? nameFromFileSchema : nameFromQuerySchema, newType, oldField.doc());


Please inline as a var and reuse

alexeykudinkin · 2022-04-21T22:13:40Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/InternalSchemaUtils.java

+    return colNamesFromWriteSchema.stream().filter(f -> {
+      int filedIdFromWriteSchema = oldSchema.findIdByName(f);
+      // try to find the cols which has the same id, but have different colName;
+      return newSchema.getAllIds().contains(filedIdFromWriteSchema) && !newSchema.findfullName(filedIdFromWriteSchema).equalsIgnoreCase(f);


Instead of duplicating the code just do a map first, where you map the name if it's a rename, otherwise return null, then filter all nulls

alexeykudinkin · 2022-04-21T22:13:52Z

...common/src/test/java/org/apache/hudi/internal/schema/utils/TestAvroSchemaEvolutionUtils.java

    InternalSchema newSchema = SchemaChangeUtils.applyTableChanges2Schema(internalSchema, updateChange);
    Schema newAvroSchema = AvroInternalSchemaConverter.convert(newSchema, avroSchema.getName());
-    GenericRecord newRecord = HoodieAvroUtils.rewriteRecordWithNewSchema(avroRecord, newAvroSchema);
+    GenericRecord newRecord = HoodieAvroUtils.rewriteRecordWithNewSchema(avroRecord, newAvroSchema, new HashMap<>());


Same comment as above

alexeykudinkin · 2022-04-21T22:13:58Z

...common/src/test/java/org/apache/hudi/internal/schema/utils/TestAvroSchemaEvolutionUtils.java


    Schema newAvroSchema = AvroInternalSchemaConverter.convert(newRecord, schema.getName());
-    GenericRecord newAvroRecord = HoodieAvroUtils.rewriteRecordWithNewSchema(avroRecord, newAvroSchema);
+    GenericRecord newAvroRecord = HoodieAvroUtils.rewriteRecordWithNewSchema(avroRecord, newAvroSchema, new HashMap<>());


Here as well

alexeykudinkin · 2022-04-21T22:14:34Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestSpark3DDL.scala

            Seq(null),
            Seq(Map("t1" -> 10.0d))
          )
+          spark.sql(s"alter table ${tableName} rename column members to mem")


Let's in addition to these ones add tests for record rewriting utils in HoodieAvroUtils

thanks, already added

nsivabalan · 2022-04-21T22:25:36Z

@xiarixiaoyao : please address the feedback in a follow up PR. I am going ahead and landing this.

- when columns names are renamed (schema evolution enabled), while copying records from old data file with HoodieMergeHande, renamed columns wasn't handled well.

xiarixiaoyao · 2022-04-22T01:29:24Z

@nsivabalan @alexeykudinkin thanks for your review, let me put another pr to optimize the code

xiarixiaoyao · 2022-04-22T06:35:49Z

@alexeykudinkin @nsivabalan fixed all comments on https://github.com/apache/hudi/pull/5393/files

- when columns names are renamed (schema evolution enabled), while copying records from old data file with HoodieMergeHande, renamed columns wasn't handled well.

xiarixiaoyao commented Apr 20, 2022

View reviewed changes

xiarixiaoyao force-pushed the cf branch from a59523f to 9cc64da Compare April 20, 2022 12:41

xiarixiaoyao added the priority:critical Production degraded; pipelines stalled label Apr 20, 2022

xushiyan added priority:blocker Production down; release blocker and removed priority:critical Production degraded; pipelines stalled labels Apr 20, 2022

alexeykudinkin reviewed Apr 20, 2022

View reviewed changes

xiarixiaoyao force-pushed the cf branch from 3b28446 to 42dedcf Compare April 21, 2022 03:13

nsivabalan reviewed Apr 21, 2022

View reviewed changes

xiarixiaoyao added 3 commits April 21, 2022 14:16

[HUDI-3921] Fixed schema evolution cannot work with HUDI-3855

13b673b

[address comments and add more test]

aebda94

remove schemacheck for rewriteEvolutionRecordWithMetadata and address…

192b87f

… comments

yihua force-pushed the cf branch from cbdc498 to 192b87f Compare April 21, 2022 21:16

alexeykudinkin approved these changes Apr 21, 2022

View reviewed changes

nsivabalan approved these changes Apr 21, 2022

View reviewed changes

nsivabalan merged commit 037f89e into apache:master Apr 21, 2022

xushiyan mentioned this pull request May 23, 2022

[HUDI-3921] Improve rewriteRecordWithNewSchema and refactor code #5393

Closed

xushiyan changed the title ~~[HUDI-3921] Fixed schema evolution cannot work with HUDI-3855~~ [HUDI-3921] Reconcile schema evolution logic with base file re-writing May 23, 2022

[HUDI-3921] Reconcile schema evolution logic with base file re-writing #5376

[HUDI-3921] Reconcile schema evolution logic with base file re-writing #5376

Uh oh!

Conversation

xiarixiaoyao commented Apr 20, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiarixiaoyao commented Apr 20, 2022

Uh oh!

alexeykudinkin commented Apr 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiarixiaoyao commented Apr 21, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xiarixiaoyao commented Apr 21, 2022

Uh oh!

hudi-bot commented Apr 21, 2022

CI report:

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!