Skip to content

Conversation

@yaauie
Copy link
Member

@yaauie yaauie commented Jul 22, 2025

Release notes

  • Persisted Queue: improved serialization to be more compact by default (note: queues containing these compact events can be processed by Logstash v8.10.0 and later)

What does this PR do?

When serializing a non-primitive value, CBOR encodes a two-element tuple containing the class name and the class-specific serialized value, which results in a significant amount of overhead in the form of frequently-repeated strings.

Jackson CBOR supports the stringref extension, which allows it to avoid repeating the actual bytes of a string, and instead keeps track of the strings it has encountered and referencing those strings by the index in which they occur.

For example, the first org.jruby.RubyString looks like:

74                                            # text(20)
   6f72672e6a727562792e52756279537472696e67   #   "org.jruby.RubyString"

While each subsequent string looks like:

d8 19                                         # tag(25)
   05                                         #   unsigned(5)

Enabling this extension allows us to save:

  • ~18 bytes from each secondary org.jruby.RubyString
  • ~23 bytes from each secondary org.logstash.ConvertedMap
  • ~24 bytes from each secondary org.logstash.ConvertedList
  • ...etc.

Practical example: a 9183-byte complex JSON that contains an event.original, consumed through the stdin input with json codec (adding fields like @timestamp, @version per normal) resulted in a 22% reduction in serialized size:

CBOR unpatched CBOR patched
11218 bytes 8728 bytes

The CBOR implementation in Jackson appears to support reading stringrefs regardless of whether this feature is enabled for serializing, which means that this change is not a rollback-barrier.

Why is it important/What is the impact to the user?

Reduces size-on-disk for PQ, enabling more events to fit on each page and reducing disk IO

Checklist

  • My code follows the style guidelines of this project
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files (and/or docker env variables)
  • I have added tests that prove my fix is effective or that my feature works

How to test this PR locally

  • remove your queue data dir
    rm -rf "${LOGSTASH_HOME?:}/data/queue"
    
  • run unpatched LS to inject 100 events into the queue
    git checkout main && git pull && ./gradlew clean assemble installDefaultGems
    bin/logstash -Squeue.type=persisted -e 'input { generator { count => 100 } } output { stdout { codec => json_lines } }'
    
  • after process stops, inspect page files to events tagged as CBOR(known) (indicating that they start with the 9F716A sequence that events have historically started with)
  • run patched LS to inject 100 more events
    git fetch yaauie && git checkout yaauie/pq-activate-cbor-stringref-extension && ./gradlew clean assemble installDefaultGems
    bin/logstash -Squeue.type=persisted -e 'input { generator { count => 1000 } } output { stdout { codec => json_lines } }'
    
  • after process stops, inspect page files to see additional events tagged as CBOR(stringref) (indicating that they start with the stringref tag D90100)

Related issues

Use cases

  • Constrained or metered IO

When serializing a non-primitive value, CBOR encodes a two-element tuple
containing the class name and the class-specific serialized value, which
results in a significant amount of overhead in the form of frequently-
repeated strings.

Jackson CBOR supports the stringref extension, which allows it to avoid
repeating the actual bytes of a string, and instead keeps track of the
strings it has encountered and _referencing_ those strings by the index
in which they occur.

For example, the first `org.jruby.RubyString` looks like:

~~~
74                                            # text(20)
   6f72672e6a727562792e52756279537472696e67   #   "org.jruby.RubyString"
~~~

While each subsequent string looks like:

~~~
d8 19                                         # tag(25)
   05                                         #   unsigned(5)
~~~

Enabling this extension allows us to save:
 - ~18 bytes from each `org.jruby.RubyString`
 - ~23 bytes from each `org.logstash.ConvertedMap`
 - ~24 bytes from each `org.logstash.ConvertedList`
 - ...etc.

The CBOR implementation in Jackson _appears_ to support reading stringrefs
regardless of whether this feature is enabled for serializing, which means
that this change is not a rollback-barrier.
@github-actions
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Jul 22, 2025

This pull request does not have a backport label. Could you fix it @yaauie? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • If no backport is necessary, please add the backport-skip label

…nsion

With the CBOR stringref extension enabled, we add a 3-byte overhead to each
event to activate the extension, and eliminate 24 bytes of overhead for each
event's secondary instances of `org.logstash.ConvertedMap`. Since the events
under test have exactly two instances of `org.logstash.ConvertedMap`, this
is a net reduction of 21 bytes of overhead.

This changes the specifically-constructed events to have the intended lengths
to test their specific edge-cases.
@yaauie yaauie force-pushed the pq-activate-cbor-stringref-extension branch from 7647be4 to 44692e9 Compare July 22, 2025 23:37
@yaauie
Copy link
Member Author

yaauie commented Jul 24, 2025

I have manually validated that a stringref-enabled event serialized with this patch can be deserialized in unpatched logstash. I plan to add tests with fixture data for the releasable branches that we are not backporting to, to ensure that this change is not a rollback barrier.

[EDIT]

Backport test-only PR's to ensure activating the extension doesn't create a rollback boundary:

[/EDIT]

@elastic-sonarqube
Copy link

@elasticmachine
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

History

Copy link
Contributor

@mashhurs mashhurs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good find! Thank you.
It makes sense to me to set string ref to true by default rather than feature flag.
So, LGTM!

Copy link
Contributor

@kaisecheng kaisecheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing enhancement for PQ. LGTM
I tested both the upgrade and downgrade paths, and they work smoothly without requiring any user intervention.

I used the tool lsq-pagedump to print the page. Instead of getting CBOR(stringref) , I got CBOR(assumed D9019F716A617661)

$ lsq-pagedump "${LOGSTASH_PATH}/data/queue/main/page.0"
1	277	094395CD	page.0	CBOR(assumed D9019F716A617661)
...
100	277	F7EE8590	page.0	CBOR(assumed D9019F716A617661)

@yaauie yaauie marked this pull request as draft September 2, 2025 20:48
@yaauie
Copy link
Member Author

yaauie commented Sep 2, 2025

converting to draft until we chase down the issue where Kaise's machine was encoding with D9019F ->tag(415) instead of the expected D90100 ->tag(256).

[EDIT]
My dump tool was broken, and was losing the null-byte at the tail of the D90100 header.
[/EDIT]

@yaauie yaauie marked this pull request as ready for review September 3, 2025 17:35
@yaauie
Copy link
Member Author

yaauie commented Sep 3, 2025

@kaisecheng after fixing my dump tool, your page file looks like:

╭─{ rye@perhaps:~/REDACTED }
╰─○ lsq-pagedump ./page.0 | head -n10
1	277	094395CD	page.0	CBOR(stringref)
2	277	57DE3747	page.0	CBOR(stringref)
3	277	01449EC8	page.0	CBOR(stringref)
4	277	628F121B	page.0	CBOR(stringref)
5	277	3385FFA0	page.0	CBOR(stringref)
6	277	9AA0980D	page.0	CBOR(stringref)
7	277	C4B79D86	page.0	CBOR(stringref)
8	277	5DE6E9DE	page.0	CBOR(stringref)
9	277	E9D533F5	page.0	CBOR(stringref)
10	277	D2BE9979	page.0	CBOR(stringref)
[success]

@yaauie yaauie mentioned this pull request Sep 3, 2025
3 tasks
@kaisecheng
Copy link
Contributor

@kaisecheng after fixing my dump tool, your page file looks like:

The tool works as expected now 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants