Skip to content
Closed
162 changes: 162 additions & 0 deletions core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.apache.iceberg.util;

import java.nio.ByteBuffer;
import java.util.Arrays;

/**
* Within Z-Ordering the byte representations of objects being compared must be ordered,
* this requires several types to be transformed when converted to bytes. The goal is to
* map object's whose byte representation are not lexicographically ordered into representations
* that are lexicographically ordered.
* Most of these techniques are derived from
* https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-2/
*
* Some implementation is taken from
* https://github.com/apache/hbase/blob/master/hbase-common/src/main/java/org/apache/hadoop/hbase/util/OrderedBytes.java
*/
public class ZOrderByteUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be documented somewhere the produced byte[] should be compared lexicographically, with bytes as unsigned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I will add this


private ZOrderByteUtils() {

}

/**
* Signed ints do not have their bytes in magnitude order because of the sign bit.
* To fix this, flip the sign bit so that all negatives are ordered before positives. This essentially
* shifts the 0 value so that we don't break our ordering when we cross the new 0 value.
*/
public static byte[] intToOrderedBytes(int val) {
ByteBuffer bytes = ByteBuffer.allocate(Integer.BYTES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this if you want to get this through quickly, but in the long term I think we should avoid the allocation cost. What I'd recommend since this is a util method is to add a reuse argument:

  // this should probably live in the ByteBuffers util class
  public static ByteBuffer reuseOrAllocate(ByteBuffer reuse, int length) {
    if (reuse.hasArray() && reuse.arrayOffset() == 0 && reuse.capacity() == length) {
      reuse.position(0);
      reuse.limit(length);
      return reuse;
    } else {
      return ByteBuffer.allocate(length);
    }
  }

  public static ByteBuffer intToOrderedBytes(int val, ByteBuffer reused) {
    ByteBuffer bytes = reuseOrAllocate(reused, 4);
    bytes.putInt(val ^ 0x80000000);
    return bytes;
  }

Then the caller can always call ByteBuffer.array() to get the underlying bytes, but it can keep reusing the space:

  v1Bytes = intToOrderedBytes(v1, v1Bytes);
  v2Bytes = intToOrderedBytes(v2, v2Bytes);
  zorderBytes = zorder(v1Bytes, v2Bytes, zorderBytes);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worthwhile to get in now, it isn't a huge change and should let us think a bit more about the Spark implementation's memory efficiency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I wonder if we should have reuseOrAllocate or just reuse and throw an error if it cannot. Basically give two versions of the orderedByteFunctions. One which attempts to reuse and throws an error if the Buffer cannot be reused and one that allocates a new buffer if no buffer is passed. That way the user knows their buffer is being re-used and the behavior isn't changing at runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this but didn't allow for the "orAllocate" option. Caller needs to create the new allocation if they want to use it.

bytes.putInt(val ^ 0x80000000);
return bytes.array();
}

/**
* Signed longs are treated the same as the signed ints
*/
public static byte[] longToOrderBytes(long val) {
ByteBuffer bytes = ByteBuffer.allocate(Long.BYTES);
bytes.putLong(val ^ 0x8000000000000000L);
return bytes.array();
}

/**
* Signed shorts are treated the same as the signed ints
*/
public static byte[] shortToOrderBytes(short val) {
ByteBuffer bytes = ByteBuffer.allocate(Short.BYTES);
bytes.putShort((short) (val ^ (0x8000)));
return bytes.array();
}

/**
* Signed tiny ints are treated the same as the signed ints
*/
public static byte[] tinyintToOrderedBytes(byte val) {
ByteBuffer bytes = ByteBuffer.allocate(Byte.BYTES);
bytes.put((byte) (val ^ (0x80)));
return bytes.array();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return the ByteBuffer instead since we're passing in a ByteBuffer?

}

/**
* IEEE 754 :
* “If two floating-point numbers in the same format are ordered (say, x {@literal <} y),
* they are ordered the same way when their bits are reinterpreted as sign-magnitude integers.”
*
* Which means floats can be treated as sign magnitude integers which can then be converted into lexicographically
* comparable bytes
*/
public static byte[] floatToOrderedBytes(float val) {
ByteBuffer bytes = ByteBuffer.allocate(Integer.BYTES);
int ival = Float.floatToIntBits(val);
ival ^= ((ival >> (Integer.SIZE - 1)) | Integer.MIN_VALUE);
bytes.putInt(ival);
return bytes.array();
}

/**
* Doubles are treated the same as floats
*/
public static byte[] doubleToOrderedBytes(double val) {
ByteBuffer bytes = ByteBuffer.allocate(Long.BYTES);
long lng = Double.doubleToLongBits(val);
lng ^= ((lng >> (Long.SIZE - 1)) | Long.MIN_VALUE);
bytes.putLong(lng);
return bytes.array();
}

/**
* Strings are lexicographically sortable BUT if different byte array lengths will
* ruin the Z-Ordering. (ZOrder requires that a given column contribute the same number of bytes every time).
* This implementation just uses a set size to for all output byte representations. Truncating longer strings
* and right padding 0 for shorter strings.
*/
public static byte[] stringToOrderedBytes(String val, int length) {
ByteBuffer bytes = ByteBuffer.allocate(length);
if (val != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you update this to accept a reused buffer, then we need to remember to zero out the bytes.

int maxLength = Math.min(length, val.length());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val.length() is not "encoding-aware". Maybe add a comment that it doesn't matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, don't add a comment. We must take byte-length into account, otherwise we would be truncating short non-all-ASCII string values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... an example would be String "ą", single character, so length=1.
We would copy only the first byte of the two-byte sequence (0xC4 0x85) for the letter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH sorry I thought you were talking about something else, yeah let me switch this to the byte length of the encoded string

bytes.put(val.getBytes(), 0, maxLength);
Copy link
Member

@findepi findepi Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encoding must be fixed, and it must be UTF-8, if we want the ordering of the produced bytes be coherent with value ordering.

Also byte order will match string/varchar order only if bytes are treated as unsigned.
I guess he unsignedness applies to other types, so maybe it's documented already, i didn't notice.

}
return bytes.array();
}

/**
* Interleave bits using a naive loop.
* @param columnsBinary an array of byte arrays, none of which are empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is the none of which are empty commit still valid given that there are tests for empty arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix up this comment, it's pretty unclear at the moment

* @return their bits interleaved
*/
public static byte[] interleaveBits(byte[][] columnsBinary) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mention it at first because it is an optimization but an alternative to this is a builder pattern that has specific methods for int/long/float/double/byte[] and a predertmined number of dimensions and a bit-width for each dimension. then the addFloat/addInt methods could be called in the correct sequence. I don't know if the JVM is smart enough to compile the ByteBuffer.of(...).put pattern to simple casts, but with the builder methods these could be avoided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a good thing for a future improvement. If you see the Spark request has some benchmarks and even with wrapping all these functions in UDFS and applying them to rows that way, it's still only about 2~3 x slower than sort with as many expressions. So I think the perf is probably ok to start with.

int interleavedSize = Arrays.stream(columnsBinary).mapToInt(a -> a.length).sum();
byte[] interleavedBytes = new byte[interleavedSize];
int sourceBit = 7;
int sourceByte = 0;
int sourceColumn = 0;
int interleaveBit = 7;
int interleaveByte = 0;
while (interleaveByte < interleavedSize) {
// Take what we have, Get the source Bit of the source Byte, move it to the interleaveBit position
interleavedBytes[interleaveByte] =
(byte) (interleavedBytes[interleaveByte] |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use |= instead of adding this every time?

(columnsBinary[sourceColumn][sourceByte] & 1 << sourceBit) >> sourceBit << interleaveBit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this use >>> sourceBit so that the sign bit isn't propagated? or are we guaranteed that won't happen because this gets converted into an int so it is too sparse to worry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually haven't thought this through, This is one of the reasons I added the big fuzzing test of the interleave code since I have a hard time thinking through byte path in my head. I can switch to >>> though since I think that is correct


--interleaveBit;
if (interleaveBit == -1) {
// Finished a byte in our interleave byte array start a new byte
interleaveByte++;
interleaveBit = 7;
}

// Find next column with a byte we can use
do {
++sourceColumn;
if (sourceColumn == columnsBinary.length) {
sourceColumn = 0;
if (--sourceBit == -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's much easier to read and reason about algorithms like this if you don't use the return value of ++ or -- operators.

sourceByte++;
sourceBit = 7;
}
}
} while (columnsBinary[sourceColumn].length <= sourceByte && interleaveByte < interleavedSize);
Copy link
Contributor

@rdblue rdblue Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have interleaveBytes < interleavedSize here? All that's going to happen is the first time through the loop, sourceColumn will be checked. But after that, if the interleave is finished then this will exit and the outer loop will exit. Then sourceColumn isn't used because the bytes are returned. So I think it would be equivalent to do this check above the do/while:

  if (interleaveBit == -1) {
    ...
  }

  if (interleaveByte < interleavedSize) {
    break;
  }

  do { ... } while (columnsBinary[sourceColumn].length <= sourceByte)

You could also use an if block to avoid the break if that's your thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not a big break person, but I think it complicated the code a bit since I wrote without them. Let me see what I can do to clean this up a bit

}
return interleavedBytes;
}
}
Loading