-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-26353 Support loadable dictionaries in hbase-compression-zstd #3748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Haven't read the code yet, but is it possible to copy the dict into the hbase storage so it is controlled by us? |
|
🎊 +1 overall
This message was automatically generated. |
| return size > 0 ? size : 256 * 1024; // Don't change this default | ||
| } | ||
|
|
||
| static LoadingCache<Configuration,byte[]> CACHE = CacheBuilder.newBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Configuration as the key makes me a bit nervous, although after checking the code, there is no hashCode and equals methods in Configuration so it will perform like IdentityHashMap...
So is it possible to use the file name as the map key here? I suppose different tables could use the same dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a concern.
In the latest version of the patch I override hashCode in CompoundConfiguration so we are doing something better than object identity when caching the dictionaries for the store writer case. It is kind of expensive to compute the hashCode given how CompoundConfiguration works but at least we do not do it that often, and not in performance critical code. Once a compressor or decompressor is created it is reused for the lifetime of the reader or writer. Otherwise we are using object identity. That is not the worst thing, at least. The cache is capped at 100 and will also expire entries if they are not used for one hour.
Let me try your suggestion. I was thinking we could avoid doing two lookups into the Configuration -- to get the boolean, and then the path, for the key -- but that hashCode calculation is pretty expensive. Getting the path from the configuration object and using that would be less.
| final Path p = new Path(s); | ||
| final ByteArrayOutputStream baos = new ByteArrayOutputStream(); | ||
| final byte[] buffer = new byte[8192]; | ||
| try (final FSDataInputStream in = FileSystem.get(p.toUri(), conf).open(p)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to limit the max dict size here? If an user create a table with a very large dict file, it could bring down the whole cluster if we do not truncate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If there is a size limit and it is exceeded the codec load should be rejected by throwing a RuntimeException probably.
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
Let me re-run the performance evaluation from HBASE-26259 but with synthetic small value data and compare speed and efficiency with precomputed dictionary vs without. Gains are expected but I'd like to present some hard comparison data here. |
@Apache9 I was thinking about writing the dictionary used to compress values in an HFile or WAL into the HFile or WAL in the metadata section, but there would need to be format extensions to the WAL (perhaps just an extra field in the header and/or trailer PB). Hopefully there can be some re-use of meta blocks for HFiles. But this raises questions. There should be some way for a codec to read and write metadata into the container of the thing they are processing, but we don't have API support for that. I would consider it future work, but definitely of interest. The interest is ensuring that HFiles have all of the information they need to read themselves added at write time. Otherwise I think the current scheme is ok. The operator is already in charge of their table schema and compression codec dependencies (like deployment of native link libraries). This is an incremental responsibility... if you put a compression dictionary attribute into your schema, don't lose the dictionary. Mostly it is already true that HFiles carry all of the information within their trailer or meta blocks a reader requires to process them. I can think of one exception, that being encryption, where the data encryption key (DEK) is stored in the HFile, but the master encryption key (MEK) used to encrypt the DEK is by design kept in a trust store or HSM and if the MEK is lost all data is not decryptable. There are some parallels between external MEK data and external compression dictionary data. One could claim the same general rules for managing them apply. The difference is the dictionary is not sensitive and can be copied into the file, whereas the master encryption key must be carefully guarded and not written colocated with data. |
Apache9
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
Just some simple nits, and please fix the checkstyle and javac issues if possible.
| if (DICTIONARY_CACHE == null) { | ||
| synchronized (ZstdCodec.class) { | ||
| if (DICTIONARY_CACHE == null) { | ||
| DICTIONARY_CACHE = CacheBuilder.newBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: better abstract the creation code to a separated method? It could make the code easier to read.
| n = in.read(buffer); | ||
| if (n > 0) { | ||
| baos.write(buffer, 0, n); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: indent
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
Here is the performance test result. I wrote an integration test that simulates a location data tracking use case. It writes 10 million rows, each row has a 64-bit random row key (not important), one column family, with four qualifiers, one for: first name, last name, latitude (encoded as an integer with scale of 3), and longitude (also encoded as an integer with scale of 3). Details aren't really important except to say the character strings are short, corresponding with typical length for English first and last names, and there are two 32-bit integer values. The 32-bit integer values are generated with a zipfian distribution to reduce entropy and allow for potentially successful dictionary compression. But they are also short. When creating the table the IT specified a block size of 1K. Perhaps not unreasonable for a heavily indexed use case with short values. I could have achieved a higher compression ratio if the row keys were sequential instead of completely random. This is not really important. I also wrote a simple utility that iterates over an HFile and saves each DATA or ENCODED_DATA block as a separate file somewhere else, just the block data. These files were used as the training set for The results demonstrate compression speed improvements as expected (a 22-33% improvement), as described by the ZStandard documentation. They also demonstrate efficiency gains (a modest 6-8%), especially in combination with higher levels, where even modest gains are meaningful at scale. Specifying higher levels is more affordable because of the relative speedups at each level. There is a demonstration of meaningful gains in just this simple case, with potential for more benefits when applied by someone with expert knowledge. It seems reasonable to support this feature. No Dictionary
With Dictionary
Let me clean up checkstyle and other review feedback and merge this, after merging the prerequisite PR for HBASE-26316 first. |
|
Just to double check, I re-ran the earlier described test, except when generating the test data it only emitted:
When training the dictionary I gave the trainer the parameters k=32 (bit width to enter into the dictionary) and d=8 (stride for walking over content, in bits). This is a good approximation of designing these parameters with intent in a real use case. The result demonstrates significant speedups in compression as advertised and allows for achieving a better overall compression by enabling higher compression levels given an equivalent time budget as a no dictionary case. Integers Only, No Dictionary
Integers Only, With Dictionary (k=32,d=8)
|
This will cause a small merge conflict between apache#3730 and apache#3748 because we need CanReinit here too.
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
This will cause a small merge conflict between apache#3730 and apache#3748 because we need CanReinit here too.
|
Rebase |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
ZStandard supports initialization of compressors and decompressors with a precomputed dictionary, which can dramatically improve and speed up compression of tables with small values. For more details, please see The Case For Small Data Compression https://github.com/facebook/zstd#the-case-for-small-data-compression
|
Rebase to resolve expected conflicts after #3730 |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@Apache9 I merged based on your prior approval. If you disagree with this action please let me know and I will revert/restart this PR. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
…3748) ZStandard supports initialization of compressors and decompressors with a precomputed dictionary, which can dramatically improve and speed up compression of tables with small values. For more details, please see The Case For Small Data Compression https://github.com/facebook/zstd#the-case-for-small-data-compression Signed-off-by: Duo Zhang <[email protected]>
[ Requires #3730 ]
ZStandard supports initialization of compressors and decompressors with a precomputed dictionary, which can dramatically improve and speed up compression of tables with small values. For more details, please see The Case For Small Data Compression
Example:
Training:
Deploy the dictionary file to HDFS, or S3, etc.
Create the table:
Now start storing data. Compression results even for small values will be excellent.
Note: Beware, if the dictionary is lost, the data will not be decompressable.