Spark: Fix IllegarlArgumentException when filtering on BinaryType column #3460

izchen · 2021-11-03T17:35:30Z

The test case reproducing the exception：

CREATE TABLE filter_binary_test (cbinary BINARY) USING ICEBERG;
SELECT cbinary FROM filter_binary_test WHERE cbinary > X'110F';

java.lang.IllegalArgumentException: Cannot convert bytes to SQL literal: java.nio.HeapByteBuffer[pos=0 lim=2 cap=2]

Like this SQL, BinaryLiteral(ByteBuffer) is not an illegal value.

in addition, the BinaryLiteral string format in this PR is the same as spark.

https://github.com/apache/spark/blob/b874bf5dca4f1b7272f458350eb153e7b272f8c8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L347

izchen · 2021-11-03T17:36:53Z

Related issue: #2934

izchen · 2021-11-03T18:17:46Z

@rdblue @kbendick , could you help to review this PR? :)

kbendick

Thanks for your contribution @izchen!

Left some feedback. I have some concerns around the removal of sqlString function for literals as I don't see any corresponding changes for strings.

Eventually, we'll need to update more than just Spark v3.2. Additionally, it would be nice if Flink and even MR could be handled in this PR. But I'm especially interested in Flink, as the documentation seems that it uses the constructor looking syntax. I admit I'm not familiar enough with using byte literals in Flink SQL to be sure though.

kbendick · 2021-11-03T18:24:28Z

api/src/main/java/org/apache/iceberg/expressions/Literals.java

+      byte[] binary = new byte[value().remaining()];
+      value().duplicate().get(binary);


Same note as above.

kbendick · 2021-11-03T18:27:15Z

api/src/main/java/org/apache/iceberg/expressions/Literals.java

+    public String toString() {
+      byte[] binary = new byte[value().remaining()];
+      value().duplicate().get(binary);
+      return "0x" + BaseEncoding.base16().encode(binary);


I'm not sure if we have any other places where we convert a ByteBufer to a string.

If there are any, it seems like potentially we should add the stringify function to the ByteBuffer utility class and then try to use it uniformly everywhere. You don't necessarily need to update the other places to use them in this PR, but would you mind taking a look and seeing if you find any?

Follow up question: Are there cases where people might not use base16 for the representation or where the leading zero might not be used? I don't think so for Spark, but I'm less sure with Flink for example.

For Spark, looking at the documentation, it seems like it's just X, without the leading zero: https://spark.apache.org/docs/latest/sql-ref-literals.html#binary-literal

I'm not as sure about Trino, Flink, and other systems though.

For Trino, it looks to also start with the capital X (without the zero): https://trino.io/docs/current/language/types.html#varbinary

Looking it up, for Flink, it looks like maybe it's actually represented as BINARY(n) or VARBINARY(n). This is the one I'm least sure about: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/table/types/#binary

This should just be for debugging. If we are converting a literal to a string and back, then that's a problem.

This should just be for debugging.

In the base class, the toString method is String.valueOf(value) (In StringLiteral, it is "\"" + value() + "\"").

For other types of literals, the return value of toString is clear. But for ByteBuffer, the return value is java.nio.HeapByteBuffer[pos=0 lim=16 cap=16].

So I implemented the literal toString method of ByteBuffer type, for logging.

kbendick · 2021-11-03T18:40:33Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+          return pred.ref().name() + " = " + pred.literal();
        case NOT_EQ:
-          return pred.ref().name() + " != " + sqlString(pred.literal());
+          return pred.ref().name() + " != " + pred.literal();


Why did we get rid of the sqlString function here? Is there a change that removes the need to quote strings for example?

Also, since the representation of ByteBuffer is potentially engine specific, would it be better to add the conversion to the sqlString here for Spark (e.g. using the leading capital X, like X'123456')?

This would allow other engines to handle it themselves (e.g. if Flink doesn't use leading X format).

@kbendick is right. This should not remove the sqlString function. That converts a literal into a form that is usable by Spark. Instead, you should update that function to produce Spark's binary literal format.

sqlString is implemented and used in the DescribeExpressionVisitor class.

I think this class is just for generating description information, and the description will be displayed in the log and Spark UI for debugging. So in my opinion, it is an acceptable way to use Literal.toString directly, which ensures that the description is the same as the style of other logs.

In fact, except for the difference between StringLiteral's single and double quotes, the current sqlString and Literal.toString are the same.

But if you think that keeping sqlString is a better way, I can also modify the code.
In addition, there are some differences between the description form of spark here, for example, the StringLiteral of spark does not have quotation marks. If you think that this should be the same as the behavior of spark, I can also modify the code.

kbendick · 2021-11-03T18:41:42Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java

 import org.junit.AfterClass;
 import org.junit.Assert;
 import org.junit.BeforeClass;
+import org.junit.internal.ExactComparisonCriteria;


Is it normal to use this class that's got internal in the package name? Do we have to worry about behavioral changes if we upgrade JUnit?

Thanks for your review, I have modified the code

kbendick · 2021-11-03T18:42:11Z

api/src/main/java/org/apache/iceberg/expressions/Literals.java

+      byte[] binary = new byte[value().remaining()];
+      value().duplicate().get(binary);


There's an existing ByteBuffers utility class, org.apache.iceberg.util.ByteBuffers, which you can use here. Can you please use that for consistency? It's also more efficient in some cases (doesn't always necessarily allocate, handles offsets more thoroughly, etc).

Thanks for your review, I have modified the code

kbendick · 2021-11-03T18:50:09Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

-      } else {
-        return lit.value().toString();
-      }
+      return literals.stream().map(Object::toString).collect(Collectors.joining(", "));


This seems possibly incorrect or a deviation from the behavior before.

It looks like this code is calling Object.toString on Literal<T>, whereas before, the toString call came from Literal<T>::value. Is that the intended change?

Thanks for your review. After testing, I think this is correct

rdblue · 2021-11-03T19:55:34Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java

+          new ExactComparisonCriteria().arrayEquals(newContext, expectedValue, actualValue);
+        } else {
+          assertEquals(newContext, (Object[]) expectedValue, (Object[]) actualValue);
+        }


Why was this needed?

Thanks for your review, I have modified the code

rdblue · 2021-11-03T19:56:14Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java

-    sql("CREATE TABLE %s (id bigint, data string, float float) USING iceberg", tableName);
-    sql("INSERT INTO %s VALUES (1, 'a', 1.0), (2, 'b', 2.0), (3, 'c', float('NaN'))", tableName);
+    sql("CREATE TABLE %s (id bigint, data string, float float, binary binary) USING iceberg", tableName);
+    sql("INSERT INTO %s VALUES (1, 'a', 1.0, X''), (2, 'b', 2.0, X'11'), (3, 'c', float('NaN'), X'1111')", tableName);


Rather than repurposing old test cases, can you create new ones? We want to avoid mixing tests together in confusing ways that look like other failures.

I'd also add that repurposing existing test cases makes it harder for people who maintain forks and might not cherry-pick every commit.

Thanks for your review, I have modified the code

rdblue · 2021-11-14T20:16:58Z

api/src/main/java/org/apache/iceberg/util/ByteBuffers.java

  }

+  public static String encodeHexString(ByteBuffer buffer) {
+    byte[] bytes = toByteArray(buffer.duplicate());


No need for duplicate here because toByteArray won't modify the buffer.

Thanks for your review, I have modified the code

izchen · 2021-11-16T19:55:09Z

Eventually, we'll need to update more than just Spark v3.2. Additionally, it would be nice if Flink and even MR could be handled in this PR.

Flink

Executing this SQL test case in flink will report an error:

[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.ValidationException: Data type 'BINARY(2) NOT NULL' with conversion class '[B' does not support a value literal of class 'org.apache.calcite.avatica.util.ByteString'.

It seems that this is an internal error of flink, which has nothing to do with iceberg. I can add the following code to the flink code to solve this problem (I will try to submit a PR to the flink community in the future):

case BINARY =>
	// convert to byte[]
	literal.getValueAs(classOf[Array[Byte]])

https://github.com/apache/flink/blob/44378fa5fde6c17e1712a62b834cb6251605f416/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/utils/RexNodeExtractor.scala#L400-L466

After using this code to fix flink, the SQL test case runs normally, returns the correct result, and the SQL description string in the flink UI is the correct X'110F'. I think this problem does not exist in Flink-runtime.

Hive

According to the hive documentation, there is no binary type literal in hive. I think this problem does not exist in Hive-runtime.

Trino

For Trino, I have not actually used Trino. Judging from the Trino code, Trino may not have this MR problem, but I do not have a Trino test environment. And we cannot modify the Trino-runtime code in this MR.

izchen · 2021-11-16T19:57:16Z

Sorry for the late reply. I left some feedback. Thank you for your review @kbendick @rdblue ！

izchen · 2021-12-08T08:37:55Z

@rdblue @kbendick Could you help review continue? 😄

izchen · 2022-01-06T06:32:45Z

I found a related trino community MR in the email. The current behavior of spark iceberg is used in this MR UT.
trinodb/trino#10214

izchen · 2022-01-06T06:41:25Z

Do you think this PR is necessary? if not, I will close this PR. Ping @rdblue @kbendick :)

izchen · 2022-01-11T07:34:42Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java

          } else if (value instanceof scala.collection.Map) {
            return row.getJavaMap(pos);
+          } else if (value.getClass().isArray() && value.getClass().getComponentType().isPrimitive()) {
+            return IntStream.range(0, Array.getLength(value)).mapToObj(i -> Array.get(value, i)).toArray();


I think this is correct. The toJava method assumes that the elements of the returned array are all wrapper types, which is a false assumption. Spark's java api result has the potential to return an array of primitive types.

rdblue · 2022-01-12T16:57:33Z

This is fixed by #3728, which I have been reviewing lately and just committed. I didn't realize this was a separate PR until yesterday, so sorry that I hadn't come back to this one yet (I actually thought I was reviewing the same one!). In any case, the fix is nearly identical so this should be fixed by that other issue. Thanks, @izchen!

zhangchen added 2 commits November 4, 2021 00:50

ut

4ff799c

fix

5d4e280

github-actions bot added API spark labels Nov 3, 2021

java11

2bd1d7e

kbendick reviewed Nov 3, 2021

View reviewed changes

rdblue reviewed Nov 3, 2021

View reviewed changes

zhangchen added 2 commits November 13, 2021 23:01

ByteBuffer util

d254d9a

ut

e78abd7

izchen force-pushed the fix_binary_filter branch from 24e6298 to e78abd7 Compare November 13, 2021 15:01

rdblue reviewed Nov 14, 2021

View reviewed changes

fix encodeHexString

ecccb29

izchen requested review from kbendick and rdblue January 7, 2022 02:52

izchen commented Jan 11, 2022

View reviewed changes

izchen closed this Jan 12, 2022

izchen deleted the fix_binary_filter branch January 13, 2022 03:04

		byte[] binary = new byte[value().remaining()];
		value().duplicate().get(binary);

Spark: Fix IllegarlArgumentException when filtering on BinaryType column #3460

Spark: Fix IllegarlArgumentException when filtering on BinaryType column #3460

Uh oh!

Conversation

izchen commented Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

izchen commented Nov 3, 2021

Uh oh!

izchen commented Nov 3, 2021

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

izchen Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

izchen commented Nov 16, 2021

Flink

Hive

Trino

Uh oh!

izchen commented Nov 16, 2021

Uh oh!

izchen commented Dec 8, 2021

Uh oh!

izchen commented Jan 6, 2022

Uh oh!

izchen commented Jan 6, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 12, 2022

Uh oh!

Reviewers

Assignees

Labels

izchen commented Nov 3, 2021 •

edited

Loading

kbendick Nov 3, 2021 •

edited

Loading

izchen Nov 16, 2021 •

edited

Loading