-
Notifications
You must be signed in to change notification settings - Fork 3.5k
pq: activate stringref extension for more-compact PQ representation #17849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pq: activate stringref extension for more-compact PQ representation #17849
Conversation
When serializing a non-primitive value, CBOR encodes a two-element tuple containing the class name and the class-specific serialized value, which results in a significant amount of overhead in the form of frequently- repeated strings. Jackson CBOR supports the stringref extension, which allows it to avoid repeating the actual bytes of a string, and instead keeps track of the strings it has encountered and _referencing_ those strings by the index in which they occur. For example, the first `org.jruby.RubyString` looks like: ~~~ 74 # text(20) 6f72672e6a727562792e52756279537472696e67 # "org.jruby.RubyString" ~~~ While each subsequent string looks like: ~~~ d8 19 # tag(25) 05 # unsigned(5) ~~~ Enabling this extension allows us to save: - ~18 bytes from each `org.jruby.RubyString` - ~23 bytes from each `org.logstash.ConvertedMap` - ~24 bytes from each `org.logstash.ConvertedList` - ...etc. The CBOR implementation in Jackson _appears_ to support reading stringrefs regardless of whether this feature is enabled for serializing, which means that this change is not a rollback-barrier.
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
|
This pull request does not have a backport label. Could you fix it @yaauie? 🙏
|
…nsion With the CBOR stringref extension enabled, we add a 3-byte overhead to each event to activate the extension, and eliminate 24 bytes of overhead for each event's secondary instances of `org.logstash.ConvertedMap`. Since the events under test have exactly two instances of `org.logstash.ConvertedMap`, this is a net reduction of 21 bytes of overhead. This changes the specifically-constructed events to have the intended lengths to test their specific edge-cases.
7647be4 to
44692e9
Compare
|
I have manually validated that a stringref-enabled event serialized with this patch can be deserialized in unpatched logstash. I plan to add tests with fixture data for the releasable branches that we are not backporting to, to ensure that this change is not a rollback barrier. [EDIT] Backport test-only PR's to ensure activating the extension doesn't create a rollback boundary:
[/EDIT] |
|
💛 Build succeeded, but was flaky
Failed CI StepsHistory
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good find! Thank you.
It makes sense to me to set string ref to true by default rather than feature flag.
So, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing enhancement for PQ. LGTM
I tested both the upgrade and downgrade paths, and they work smoothly without requiring any user intervention.
I used the tool lsq-pagedump to print the page. Instead of getting CBOR(stringref) , I got CBOR(assumed D9019F716A617661)
$ lsq-pagedump "${LOGSTASH_PATH}/data/queue/main/page.0"
1 277 094395CD page.0 CBOR(assumed D9019F716A617661)
...
100 277 F7EE8590 page.0 CBOR(assumed D9019F716A617661)
|
[EDIT] |
|
@kaisecheng after fixing my dump tool, your page file looks like:
|
The tool works as expected now 👍 |





Release notes
What does this PR do?
When serializing a non-primitive value, CBOR encodes a two-element tuple containing the class name and the class-specific serialized value, which results in a significant amount of overhead in the form of frequently-repeated strings.
Jackson CBOR supports the stringref extension, which allows it to avoid repeating the actual bytes of a string, and instead keeps track of the strings it has encountered and referencing those strings by the index in which they occur.
For example, the first
org.jruby.RubyStringlooks like:While each subsequent string looks like:
Enabling this extension allows us to save:
org.jruby.RubyStringorg.logstash.ConvertedMaporg.logstash.ConvertedListPractical example: a 9183-byte complex JSON that contains an
event.original, consumed through the stdin input with json codec (adding fields like@timestamp,@versionper normal) resulted in a 22% reduction in serialized size:The CBOR implementation in Jackson appears to support reading stringrefs regardless of whether this feature is enabled for serializing, which means that this change is not a rollback-barrier.
Why is it important/What is the impact to the user?
Reduces size-on-disk for PQ, enabling more events to fit on each page and reducing disk IO
Checklist
[ ] I have commented my code, particularly in hard-to-understand areas[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files (and/or docker env variables)How to test this PR locally
CBOR(known)(indicating that they start with the9F716Asequence that events have historically started with)CBOR(stringref)(indicating that they start with the stringref tagD90100)Related issues
Use cases