Skip to content
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions api/src/main/java/org/apache/iceberg/expressions/Literals.java
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
import java.util.Objects;
import java.util.UUID;
import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
import org.apache.iceberg.relocated.com.google.common.io.BaseEncoding;
import org.apache.iceberg.types.Comparators;
import org.apache.iceberg.types.Conversions;
import org.apache.iceberg.types.Type;
Expand Down Expand Up @@ -599,6 +600,13 @@ protected Type.TypeID typeId() {
Object writeReplace() throws ObjectStreamException {
return new SerializationProxies.FixedLiteralProxy(value());
}

@Override
public String toString() {
byte[] binary = new byte[value().remaining()];
value().duplicate().get(binary);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an existing ByteBuffers utility class, org.apache.iceberg.util.ByteBuffers, which you can use here. Can you please use that for consistency? It's also more efficient in some cases (doesn't always necessarily allocate, handles offsets more thoroughly, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, I have modified the code

return "0x" + BaseEncoding.base16().encode(binary);
}
}

static class BinaryLiteral extends BaseLiteral<ByteBuffer> {
Expand Down Expand Up @@ -639,5 +647,12 @@ Object writeReplace() throws ObjectStreamException {
protected Type.TypeID typeId() {
return Type.TypeID.BINARY;
}

@Override
public String toString() {
byte[] binary = new byte[value().remaining()];
value().duplicate().get(binary);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as above.

return "0x" + BaseEncoding.base16().encode(binary);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we have any other places where we convert a ByteBufer to a string.

If there are any, it seems like potentially we should add the stringify function to the ByteBuffer utility class and then try to use it uniformly everywhere. You don't necessarily need to update the other places to use them in this PR, but would you mind taking a look and seeing if you find any?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up question: Are there cases where people might not use base16 for the representation or where the leading zero might not be used? I don't think so for Spark, but I'm less sure with Flink for example.

For Spark, looking at the documentation, it seems like it's just X, without the leading zero: https://spark.apache.org/docs/latest/sql-ref-literals.html#binary-literal

I'm not as sure about Trino, Flink, and other systems though.

For Trino, it looks to also start with the capital X (without the zero): https://trino.io/docs/current/language/types.html#varbinary

Looking it up, for Flink, it looks like maybe it's actually represented as BINARY(n) or VARBINARY(n). This is the one I'm least sure about: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/table/types/#binary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be for debugging. If we are converting a literal to a string and back, then that's a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be for debugging.

In the base class, the toString method is String.valueOf(value) (In StringLiteral, it is "\"" + value() + "\"").

For other types of literals, the return value of toString is clear. But for ByteBuffer, the return value is java.nio.HeapByteBuffer[pos=0 lim=16 cap=16].

So I implemented the literal toString method of ByteBuffer type, for logging.

}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@

package org.apache.iceberg.spark;

import java.nio.ByteBuffer;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
Expand Down Expand Up @@ -636,17 +635,17 @@ public <T> String predicate(UnboundPredicate<T> pred) {
case NOT_NAN:
return "not_nan(" + pred.ref().name() + ")";
case LT:
return pred.ref().name() + " < " + sqlString(pred.literal());
return pred.ref().name() + " < " + pred.literal();
case LT_EQ:
return pred.ref().name() + " <= " + sqlString(pred.literal());
return pred.ref().name() + " <= " + pred.literal();
case GT:
return pred.ref().name() + " > " + sqlString(pred.literal());
return pred.ref().name() + " > " + pred.literal();
case GT_EQ:
return pred.ref().name() + " >= " + sqlString(pred.literal());
return pred.ref().name() + " >= " + pred.literal();
case EQ:
return pred.ref().name() + " = " + sqlString(pred.literal());
return pred.ref().name() + " = " + pred.literal();
case NOT_EQ:
return pred.ref().name() + " != " + sqlString(pred.literal());
return pred.ref().name() + " != " + pred.literal();
Copy link
Contributor

@kbendick kbendick Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we get rid of the sqlString function here? Is there a change that removes the need to quote strings for example?

Also, since the representation of ByteBuffer is potentially engine specific, would it be better to add the conversion to the sqlString here for Spark (e.g. using the leading capital X, like X'123456')?

This would allow other engines to handle it themselves (e.g. if Flink doesn't use leading X format).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbendick is right. This should not remove the sqlString function. That converts a literal into a form that is usable by Spark. Instead, you should update that function to produce Spark's binary literal format.

Copy link
Contributor Author

@izchen izchen Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlString is implemented and used in the DescribeExpressionVisitor class.

I think this class is just for generating description information, and the description will be displayed in the log and Spark UI for debugging. So in my opinion, it is an acceptable way to use Literal.toString directly, which ensures that the description is the same as the style of other logs.

In fact, except for the difference between StringLiteral's single and double quotes, the current sqlString and Literal.toString are the same.

But if you think that keeping sqlString is a better way, I can also modify the code.
In addition, there are some differences between the description form of spark here, for example, the StringLiteral of spark does not have quotation marks. If you think that this should be the same as the behavior of spark, I can also modify the code.

case STARTS_WITH:
return pred.ref().name() + " LIKE '" + pred.literal() + "%'";
case IN:
Expand All @@ -659,17 +658,7 @@ public <T> String predicate(UnboundPredicate<T> pred) {
}

private static <T> String sqlString(List<org.apache.iceberg.expressions.Literal<T>> literals) {
return literals.stream().map(DescribeExpressionVisitor::sqlString).collect(Collectors.joining(", "));
}

private static String sqlString(org.apache.iceberg.expressions.Literal<?> lit) {
if (lit.value() instanceof String) {
return "'" + lit.value() + "'";
} else if (lit.value() instanceof ByteBuffer) {
throw new IllegalArgumentException("Cannot convert bytes to SQL literal: " + lit);
} else {
return lit.value().toString();
}
return literals.stream().map(Object::toString).collect(Collectors.joining(", "));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems possibly incorrect or a deviation from the behavior before.

It looks like this code is calling Object.toString on Literal<T>, whereas before, the toString call came from Literal<T>::value. Is that the intended change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. After testing, I think this is correct

}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
import org.junit.AfterClass;
import org.junit.Assert;
import org.junit.BeforeClass;
import org.junit.internal.ExactComparisonCriteria;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it normal to use this class that's got internal in the package name? Do we have to worry about behavioral changes if we upgrade JUnit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, I have modified the code


import static org.apache.hadoop.hive.conf.HiveConf.ConfVars.METASTOREURIS;

Expand Down Expand Up @@ -157,7 +158,11 @@ private void assertEquals(String context, Object[] expectedRow, Object[] actualR
Object actualValue = actualRow[col];
if (expectedValue != null && expectedValue.getClass().isArray()) {
String newContext = String.format("%s (nested col %d)", context, col + 1);
assertEquals(newContext, (Object[]) expectedValue, (Object[]) actualValue);
if (expectedValue.getClass().getComponentType().isPrimitive()) {
new ExactComparisonCriteria().arrayEquals(newContext, expectedValue, actualValue);
} else {
assertEquals(newContext, (Object[]) expectedValue, (Object[]) actualValue);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, I have modified the code

} else if (expectedValue != ANY) {
Assert.assertEquals(context + " contents should match", expectedValue, actualValue);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ public TestSelect(String catalogName, String implementation, Map<String, String>

@Before
public void createTables() {
sql("CREATE TABLE %s (id bigint, data string, float float) USING iceberg", tableName);
sql("INSERT INTO %s VALUES (1, 'a', 1.0), (2, 'b', 2.0), (3, 'c', float('NaN'))", tableName);
sql("CREATE TABLE %s (id bigint, data string, float float, binary binary) USING iceberg", tableName);
sql("INSERT INTO %s VALUES (1, 'a', 1.0, X''), (2, 'b', 2.0, X'11'), (3, 'c', float('NaN'), X'1111')", tableName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than repurposing old test cases, can you create new ones? We want to avoid mixing tests together in confusing ways that look like other failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add that repurposing existing test cases makes it harder for people who maintain forks and might not cherry-pick every commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, I have modified the code


this.scanEventCount = 0;
this.lastScanEvent = null;
Expand All @@ -64,14 +64,16 @@ public void removeTables() {
@Test
public void testSelect() {
List<Object[]> expected = ImmutableList.of(
row(1L, "a", 1.0F), row(2L, "b", 2.0F), row(3L, "c", Float.NaN));
row(1L, "a", 1.0F, new byte[]{}),
row(2L, "b", 2.0F, new byte[]{0x11}),
row(3L, "c", Float.NaN, new byte[]{0x11, 0x11}));

assertEquals("Should return all expected rows", expected, sql("SELECT * FROM %s", tableName));
}

@Test
public void testSelectRewrite() {
List<Object[]> expected = ImmutableList.of(row(3L, "c", Float.NaN));
List<Object[]> expected = ImmutableList.of(row(3L, "c", Float.NaN, new byte[]{0x11, 0x11}));

assertEquals("Should return all expected rows", expected,
sql("SELECT * FROM %s where float = float('NaN')", tableName));
Expand Down Expand Up @@ -120,4 +122,12 @@ public void testMetadataTables() {
ImmutableList.of(row(ANY, ANY, null, "append", ANY, ANY)),
sql("SELECT * FROM %s.snapshots", tableName));
}

@Test
public void testFilterBinary() {
List<Object[]> expected = ImmutableList.of(row(3L, "c", Float.NaN, new byte[]{0x11, 0x11}));

assertEquals("Should return all expected rows", expected,
sql("SELECT * FROM %s where binary > X'1101'", tableName));
}
}