-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation #5470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java
Show resolved
Hide resolved
1ba983d to
b14a25c
Compare
b14a25c to
1489d37
Compare
| "Key-generator class name is required") | ||
|
|
||
| val prependedRdd: RDD[InternalRow] = | ||
| df.queryExecution.toRdd.mapPartitions { iter => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this toRdd incur any perf hit? if yes, can you do some benchmark w/ udfs based vs this and report what do you see. Alternatively you can also, run a benchmark w/ raw parquet write w/ bulk insert row writer non partitioned and no sort mode and ensure we see comparable nos w/ this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seeing some perf hit w/ this code change. will wait to sync up with Alexey on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toRdd is how Datasets are getting executed in Spark eventually. There's no perf hit by using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR should only be applied in conjunction w/ this one #5523
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm this issue has been resolved in Master branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think issue @nsivabalan is referring to is the fact that this PR shouldn't be measured in isolation but only together with #5523 (which is landed as well)
| val updatedDF = HoodieUnsafeRDDUtils.createDataFrame(df.sparkSession, prependedRdd, updatedSchema) | ||
|
|
||
| if (!populateMetaFields) { | ||
| updatedDF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably it was a gap before. but we may not have to support dropPartitionColumns even with virtual key code path. can we fix that please
| } | ||
|
|
||
| private def dedupeRows(df: DataFrame, preCombineFieldRef: String, isGlobalIndex: Boolean): DataFrame = { | ||
| val recordKeyMetaFieldOrd = df.schema.fieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may also need to be fixed for virtual key path, or we can call it out that its not supported for now. even prior to this patch, we did not have support for de-duping in virtual key flow in row writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, virtual keys de-duping isn't supported currently
| @Override | ||
| @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING) | ||
| public String getRecordKey(InternalRow internalRow, StructType schema) { | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the changes in my other patch, we don't need to deserialize to Row to fetch the value. Can you take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are temporary changes, that are addressed in #5523
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are resolution on this? Did you end up backing out the temporary changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. These are revisited
|
|
||
| @Test | ||
| def testGetNestedRowValue(): Unit = { | ||
| val schema = StructType(Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: if you intend to use the same schema for many tests, we can make this an instance variable and not declare in every test. its immutable and so we could even make static final.
| // NOTE: Helper keeps track of [[lastKnownPartitionPath]] as [[UTF8String]] to avoid | ||
| // conversion from Catalyst internal representation into a [[String]] | ||
| partitionPath = row.getString( | ||
| HoodieRecord.HOODIE_META_COLUMNS_NAME_TO_POS.get(HoodieRecord.PARTITION_PATH_METADATA_FIELD)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can directly use 3 here instead of looking up in hashmap
| } | ||
|
|
||
| @Override | ||
| public String getRecordKey(Row row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we migrate this fix to SimpleKeyGen, if you feel existing impl in SimpleKeyGen could be fixed? why making changes just to NonPartitionedKeyGen only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is addressed in #5523
dba2edf to
9c7e7ea
Compare
...nt/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java
Show resolved
Hide resolved
|
|
||
| String writeToken = getWriteToken(taskPartitionId, taskId, taskEpochId); | ||
| String fileName = FSUtils.makeDataFileName(instantTime, writeToken, fileId, | ||
| table.getMetaClient().getTableConfig().getBaseFileFormat().getFileExtension()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table.getBaseFileExtension()
9c7e7ea to
dc91261
Compare
0600a70 to
6dfc22b
Compare
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notable improvements and refactoring are from this PR. @alexeykudinkin have you benchmarked the performance from this set of changes?
| }; | ||
|
|
||
| this.row = row; | ||
| this.containsMetaFields = containsMetaFields; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if containsMetaFields is false, should the length of metaFields be 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some confusion: containsMetaFields relates to whether inner row contains the meta-fields itself. However, HIR will always override the meta-fields by overlaying on top of whatever the source row contains (this is necessary b/c UnsafeRow can't be updated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am gonna update the docs to make it more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg.
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java
Show resolved
Hide resolved
| private int rebaseOrdinal(int ordinal) { | ||
| // NOTE: In cases when source row does not contain meta fields, we will have to | ||
| // rebase ordinal onto its indexes | ||
| return containsMetaFields ? ordinal : ordinal - metaFields.length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the source row does not contain meta fields (containsMetaFields is false), and assuming metaFields is empty, the logic here for adjusting the ordinal is not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check my comments above -- we always overlay meta-fields, since we need them to be mutable (they're being updated dynamically in writer)
| * @throws IOException on any exception while writing. | ||
| */ | ||
| void writeRow(String key, InternalRow row) throws IOException; | ||
| void writeRow(UTF8String key, InternalRow row) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the usage of UTF8String type for performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct -- to avoid conversion b/w String and UTF8String
| public HoodieTimer(boolean shouldStart) { | ||
| if (shouldStart) { | ||
| startTimer(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not obvious (timer = new HoodieTime(true)) compared to exsiting way (timer = new HoodieTimer().startTimer()). Should we revert the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old semantic is still preserved: it works as it have been, and just adds new way when you don't need to invoke startTimer explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I'm saying that timer = new HoodieTimer().startTimer() looks obvious for starting the timer instead of looking into what the boolean represents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough
| } else { | ||
| minRecordKey = recordKey; | ||
| public void add(UTF8String recordKey) { | ||
| this.bloomFilter.add(recordKey.getBytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the content or the byte array of the String and UTF8String instances should be the same here, right? So that the bloom filter lookup is not affected based on the String key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bloom filter always ingest UTF8 (Java by default encodes in UTF16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg, want to make sure there is no gap between Spark UTF8String and UTF8 encoding in Java, since this is going to affect the Bloom Index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BloomFilter add does . So we seem to be fine. it's good to trust-but-verify once though that recordKey.getBytes() is equal to string.getBytes(StandardCharsets.UTF_8). @alexeykudinkin you probably checked that during development?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
| if (keyGeneratorOpt.isPresent() && keyGeneratorOpt.get() instanceof SimpleKeyGenerator) { | ||
| this.simpleKeyGen = true; | ||
| this.simplePartitionFieldIndex = (Integer) structType.getFieldIndex(keyGeneratorOpt.get().getPartitionPathFields().get(0)).get(); | ||
| this.simplePartitionFieldDataType = structType.fields()[simplePartitionFieldIndex].dataType(); | ||
| } else { | ||
| this.simpleKeyGen = false; | ||
| this.simplePartitionFieldIndex = -1; | ||
| this.simplePartitionFieldDataType = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is out-of-scope for this PR, but why do we need special-case handling for the simple key generator here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually do not (it was done for perf reasons before). This is addressed in #5523
|
|
||
| import scala.collection.JavaConverters.asScalaBufferConverter | ||
|
|
||
| object HoodieDatasetBulkInsertHelper extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this refactored based on hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. It's a simplified version converted into Scala (to handle RDDs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to understand this better. Why did this need to be in scala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid back-n-forth Java/Scala conversions
| } | ||
| } | ||
|
|
||
| val metaFields = Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use HoodieRecord.HOODIE_META_COLUMNS and transformation to form the meta fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. The reason i didn't do it in the first place was b/c order is critical here, and even though we're using a list, i didn't want this constraint to be instead obscured in other class (where order actually might not matter at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. But shouldn't all the places use the same order so that we can maintain the order in one place like HoodieRecord.HOODIE_META_COLUMNS to avoid discrepancy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. Problem is that ordering only matters in a handful of contexts (compared to all usages of this list) and it harder to justify why ordering matters when you're looking at just the HoodieRecord class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, for this one we can keep it. Sth to think of for consistency to avoid bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should follow up and consolidate into one list in HoodieRecord. +1. Unless the other usages break with or . different ordering, I don't see any reason why we won't
| } | ||
| // NOTE: It's critical whenever we keep the reference to the row, to make a copy | ||
| // since Spark might be providing us with a mutable copy (updated during the iteration) | ||
| (rowKey, row.copy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is row.copy() needed here for reduceByKey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only can get away w/o copying when we do one-pass (streaming-like) processing. If at any point we need to hold a reference to it -- we will have to make a copy (it's gonna fail otherwise)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, later on, I think we need to revisit this pattern of copy() in the DAG to make sure they are needed. Could you create a ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, what i'm saying is only applicable to InternalRow which don't copy by default and instead point into shared, mutable underlying buffer (actually holding what's been read)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exact scenarios cause Spark to fail without copy. Could you please expand on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exact code will fail if we remove copy, b/c often InternalRow is a mutable copy that Spark changes during iteration, which is safe when we access just the one under the pointer, but in the subsequent reduceByKey we access 2 rows at the same time
…line the code and avoid unnecessary conversions
Added more tests;
Tidying up;
- Removing incorrect exclusions - Adding "hadoop-hdfs" test dependency
…verride `InternalRow` methods; Fixed tests;
…w` to make sure Avro/Spark key-gens are in-sync; Added `RowKeyGenUtils`;
d02c06e to
34b0127
Compare
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
…unnecessary `DataFrame` > `RDD` conversions
505ee48 to
b4573ac
Compare
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| } | ||
| // NOTE: It's critical whenever we keep the reference to the row, to make a copy | ||
| // since Spark might be providing us with a mutable copy (updated during the iteration) | ||
| (rowKey, row.copy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykudinkin could you please respond!follow up on the minor comments.
this PR by itself seems to incur RDD conversion. Will look into #5523 as pointed out in review responses.
| } else { | ||
| minRecordKey = recordKey; | ||
| public void add(UTF8String recordKey) { | ||
| this.bloomFilter.add(recordKey.getBytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BloomFilter add does . So we seem to be fine. it's good to trust-but-verify once though that recordKey.getBytes() is equal to string.getBytes(StandardCharsets.UTF_8). @alexeykudinkin you probably checked that during development?
| import org.apache.spark.sql.catalyst.InternalRow; | ||
| import org.apache.spark.sql.types.DataType; | ||
| import org.apache.spark.sql.types.StructType; | ||
| import scala.Function1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't really like scala imports in Java (becomes an issue - for us one day when we want to shrink scala spread in code). Any way we can avoid this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is removed in #5523
| @Override | ||
| @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING) | ||
| public String getRecordKey(InternalRow internalRow, StructType schema) { | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are resolution on this? Did you end up backing out the temporary changes
|
|
||
| import scala.collection.JavaConverters.asScalaBufferConverter | ||
|
|
||
| object HoodieDatasetBulkInsertHelper extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to understand this better. Why did this need to be in scala?
| } | ||
| } | ||
|
|
||
| val metaFields = Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should follow up and consolidate into one list in HoodieRecord. +1. Unless the other usages break with or . different ordering, I don't see any reason why we won't
| "Key-generator class name is required") | ||
|
|
||
| val prependedRdd: RDD[InternalRow] = | ||
| df.queryExecution.toRdd.mapPartitions { iter => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm this issue has been resolved in Master branch
| } | ||
| // NOTE: It's critical whenever we keep the reference to the row, to make a copy | ||
| // since Spark might be providing us with a mutable copy (updated during the iteration) | ||
| (rowKey, row.copy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exact scenarios cause Spark to fail without copy. Could you please expand on that?
|
I need to convince myself of the RDD conversion in place. So this is marked - "major concerns" until then |
|
@alexeykudinkin Want to get my understanding straight, as well make sure we have an explanation for how these factors play out with the new changes.
I want to understand how we are avoiding the RDD conversion costs, in the current approach? This cost becomes obvious when you do records with large number of columns (due to overhead per record) |
|
@vinothchandar TL;DR is the difference b/w
|
Tips
What is the purpose of the pull request
Replacing UDF in Bulk Insert w/ RDD transformation.
Brief change log
TBD
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.