[SPARK-33314][SQL] Avoid dropping rows in Avro reader #30221

bersprockets · 2020-11-02T01:03:53Z

What changes were proposed in this pull request?

This PR adds a check to RowReader#hasNextRow such that multiple calls to RowReader#hasNextRow with no intervening call to RowReader#nextRow will avoid consuming more than 1 record.

This PR also modifies RowReader#nextRow such that consecutive calls will return new rows (previously consecutive calls would return the same row).

Why are the changes needed?

SPARK-32346 slightly refactored the AvroFileFormat and AvroPartitionReaderFactory to use a new iterator-like trait called AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and stores the deserialized row for the next call to RowReader#nextRow. Unfortunately, sometimes hasNextRow is called twice before nextRow is called, resulting in a lost row.

For example (which assumes V1 Avro reader):

val df = spark.range(0, 25).toDF("index")
df.write.mode("overwrite").format("avro").save("index_avro")
val loaded = spark.read.format("avro").load("index_avro")
// The following will give the expected size
loaded.collect.size
// The following will give the wrong size
loaded.orderBy("index").collect.size

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added tests, which fail without the fix.

HyukjinKwon

The fix seems good

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

SparkQA · 2020-11-02T01:40:16Z

Test build #130511 has finished for PR 30221 at commit 9597080.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-02T03:02:45Z

Test build #130512 has finished for PR 30221 at commit 57e10c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-11-02T03:29:45Z

cc @gengliangwang FYI

viirya · 2020-11-02T08:03:34Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala

+      if (!interveningNext) {
+        // until a row is consumed, return previous result of hasNextRow
+        return prevHasNextRow
+      }


Can't we just reset currentRow in nextRow and check currentRow.isDefined here?

I also feel @viirya 's suggestion would be simpler.

In addition, looks like the implementation didn't respect the Iterator's contracts - calling hasNextRow explicitly shouldn't be prerequisite to call nextRow.

Below code would fix the original issue (as this code passes the new test), as well as it would also work for the case which only calls nextRow with handling NoSuchElementException.

def hasNextRow: Boolean = { while (!completed && currentRow.isEmpty) { val r = fileReader.hasNext && !fileReader.pastSync(stopPosition) if (!r) { fileReader.close() completed = true currentRow = None } else { val record = fileReader.next() currentRow = deserializer.deserialize(record).asInstanceOf[Option[InternalRow]] } } currentRow.isDefined } def nextRow: InternalRow = { if (currentRow.isEmpty) { if (!hasNextRow) { throw new NoSuchElementException("next on empty iterator") } } val row = currentRow.get currentRow = None row }

I had the same thought initially but then I realised that @bersprockets might have wanted to do the least aggressive way. For example, previously we could get the same row from the next call of nextRow. After the approach above, we cannot although I think it's fine. I don't mind either way.

Yeah, this approach is better, because it also handles the case where nextRow is called multiple times (which previously, would return the same row over and over, which, as @HyukjinKwon pointed out would keep the status quo, but it probably wouldn't be correct, since the users of this code are Iterator implementations).

From the API semantics perspective, shouldn't nextRow return the next row? It looks okay if hasNextRow has been called multiple times before nextRow is called. But it sounds weird that nextRow will be called with the same row. As a fix this looks fine, but the API, if it is called like that way, sounds a weird design, in particular it is documented as iterator-like interface.

But it sounds weird that nextRow will be called with the same row.

Yes, I agree. As I mentioned above, leaving it that way would keep the status quo, but probably wouldn't be correct since the users of this RowReader are Iterator implementations.

Follow up to my previous comment: As @HyukjinKwon pointed out, I was earlier trying to fix the stated bug without changing other behavior. But I should probably fix nextRow while I am at it, which makes RowReader follow a more recognized pattern and makes the actual Iterators that use it more correct.

gengliangwang · 2020-11-02T10:56:03Z

So, if there is a scenario that calls hasNextRow multiple times, the code changes in #29145 also bring in perf regression:

Before [SPARK-32346][SQL] Support filters pushdown in Avro datasource #29145, there is no deserialization in the method hasNext https://github.com/apache/spark/pull/29145/files#diff-a70c279fcf47a5f521902ddd83b7f4dd0f594c14b0f85240705faa83da79d859L104.
After [SPARK-32346][SQL] Support filters pushdown in Avro datasource #29145, there is always deserialization if there is a next row: https://github.com/apache/spark/pull/29145/files#diff-22181c0e0050f9694efac388063535cf77e92a82dd962fec3f8507dfae45e52cR185

I am sorry but shall we consider reverting #29145? CC @MaxGekk @cloud-fan

bersprockets · 2020-11-02T17:08:46Z

After [SPARK-32346][SQL] Support filters pushdown in Avro datasource #29145, there is always deserialization if there is a next row: https://github.com/apache/spark/pull/29145/files#diff-22181c0e0050f9694efac388063535cf77e92a82dd962fec3f8507dfae45e52cR185

I am sorry but shall we consider reverting #29145? CC @MaxGekk @cloud-fan

@gengliangwang I noted only a single extra call to hasNextRow per task, so the issue was not performance but dropped records (I suppose there could be some scenario I don't know about where hasNextRow is called many extra times).

Anyway, both the fix I proposed and the suggested improvements to my proposed fix would alleviate that concern, since deserialization would be called only once per Avro record (regardless of how many times hasNextRow is called).

SparkQA · 2020-11-03T03:35:55Z

Test build #130545 has finished for PR 30221 at commit 6d1b468.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-03T03:54:31Z

I think I'm not qualified to give +1 (as the code change is now closer to my suggestion), but I think this should be OK.

Regarding revert of #29145, I'm not sure how much it hurts to deserialize one element before, as it will need to be deserialized anyway in nextRow. If there're cases caller calls hasNext multiple times and expect the behavior to point to the "different" rows for each call (in any reasons, like performance), it's no longer matching the contract of Iterator, and we'd be better to define another interface for that.

gengliangwang · 2020-11-03T09:06:44Z

FYI I just tried and can't find a scenario that has multiple method calls on hasNext() without next(). Now I think we can merge this one instead of reverting #29145.

bersprockets · 2020-11-03T15:51:42Z

FYI I just tried and can't find a scenario that has multiple method calls on hasNext() without next().

@gengliangwang My reprod case is such an example. When BypassMergeSortShuffleWriter#write is driving the scan, there will be multiple consecutive calls to hasNext (at the start of each task). This causes trouble only with V1 Avro. In datasource V2, there seems to be some intervening iterator which properly handles the multiple hasNext calls, therefore protecting the iterator in AvroPartitionReaderFactory from these multiple calls.

I know of no case where there are consecutive calls to next without an intervening hasNext, but the latest commit to this PR handles it.

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

SparkQA · 2020-11-04T03:18:50Z

Test build #130585 has finished for PR 30221 at commit 134c12c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-11-05T02:49:53Z

Merged to master.

MaxGekk · 2021-05-17T15:51:21Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

  }
 }
+
+class AvroRowReaderSuite


@bersprockets @HyukjinKwon I just noticed recently that this suite is in AvroSuite.scala. Are there any specific reasons to not place it to AvroRowReaderSuite.scala?

I don't think so. Feel free to move it there.

Please, review the PR #32607

bersprockets added 2 commits November 1, 2020 11:21

Initial attempt

0e9764b

Add test

9597080

HyukjinKwon reviewed Nov 2, 2020

View reviewed changes

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Outdated Show resolved Hide resolved

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Outdated Show resolved Hide resolved

Address review comments

57e10c6

bersprockets changed the title ~~[SPARK-33314][SQL][WIP] Avoid dropping rows in Avro reader~~ [SPARK-33314][SQL] Avoid dropping rows in Avro reader Nov 2, 2020

HyukjinKwon approved these changes Nov 2, 2020

View reviewed changes

viirya reviewed Nov 2, 2020

View reviewed changes

review feedback; add test

6d1b468

bersprockets commented Nov 3, 2020

View reviewed changes

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Outdated Show resolved Hide resolved

A little cleanup of new test

134c12c

HyukjinKwon closed this in 7e8eb04 Nov 5, 2020

MaxGekk reviewed May 17, 2021

View reviewed changes

bersprockets deleted the avro_iterator_play branch November 2, 2022 00:25

[SPARK-33314][SQL] Avoid dropping rows in Avro reader #30221

[SPARK-33314][SQL] Avoid dropping rows in Avro reader #30221

Uh oh!

Conversation

bersprockets commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 2, 2020

Uh oh!

SparkQA commented Nov 2, 2020

Uh oh!

HyukjinKwon commented Nov 2, 2020

Uh oh!

viirya Nov 2, 2020

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bersprockets Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Nov 2, 2020

Choose a reason for hiding this comment

Uh oh!

bersprockets Nov 2, 2020

Choose a reason for hiding this comment

Uh oh!

bersprockets Nov 2, 2020

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bersprockets commented Nov 2, 2020

Uh oh!

SparkQA commented Nov 3, 2020

Uh oh!

HeartSaVioR commented Nov 3, 2020

Uh oh!

gengliangwang commented Nov 3, 2020

Uh oh!

bersprockets commented Nov 3, 2020

Uh oh!

Uh oh!

SparkQA commented Nov 4, 2020

Uh oh!

HyukjinKwon commented Nov 5, 2020

Uh oh!

MaxGekk May 17, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 17, 2021

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 20, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

bersprockets commented Nov 2, 2020 •

edited

Loading

HeartSaVioR Nov 2, 2020 •

edited

Loading

HyukjinKwon Nov 2, 2020 •

edited

Loading

bersprockets Nov 2, 2020 •

edited

Loading

gengliangwang commented Nov 2, 2020 •

edited

Loading