Skip to content

Add detailed view of latency#10173

Merged
emilk merged 37 commits into
mainfrom
emilk/improve-latency-measurements
Jun 12, 2025
Merged

Add detailed view of latency#10173
emilk merged 37 commits into
mainfrom
emilk/improve-latency-measurements

Conversation

@emilk
Copy link
Copy Markdown
Member

@emilk emilk commented Jun 9, 2025

Related

What

This adds a breakdown of where the end-to-end latency is coming from. It automatically appears in the top panel when latency is high. Hover it to see details:

image

Notable changes

The latency metrics are no longer opt-in. We always show them now if they get surprisingly high.

The measurements are made by adding extra metadata to the record batches at IPC encode/decode time.

We already have accurate timestamps at log-time (RowId) and batch-time (ChunkId).

Testing

Apply this diff:

diff --git a/tests/rust/log_benchmark/src/image.rs b/tests/rust/log_benchmark/src/image.rs
index e36478262a1..877c3e0bce4 100644
--- a/tests/rust/log_benchmark/src/image.rs
+++ b/tests/rust/log_benchmark/src/image.rs
@@ -26,8 +26,7 @@ fn prepare() -> Vec<u8> {
 fn execute(mut raw_image_data: Vec<u8>) -> anyhow::Result<()> {
     re_tracing::profile_function!();
 
-    let (rec, _storage) =
-        rerun::RecordingStreamBuilder::new("rerun_example_benchmark_").memory()?;
+    let rec = rerun::RecordingStreamBuilder::new("rerun_example_benchmark_").connect_grpc()?;
 
     for i in 0..NUM_LOG_CALLS {
         raw_image_data[i] += 1; // Change a single pixel of the image data, just to make sure we transmit something different each time.

The run this in one terminal:

pixi run rerun-release 

and in another:

cargo run -p log_benchmark --release -- --benchmarks image --profile

Findings

In a test where I encode a bunch of uncompressed 1024² RGBA images (tests/rust/log_benchmark/src/image.rs).
I found that by far the slowest part is the IPC encoding of arrow data (yes, this is a release build):

Image

zooming in:

Image

This suggests that if we simply encode the chunks in parallel instead on N cores, we will get a N-times speedup (when the actual logging thread is not the bottleneck).

This also gives a likely reason for the recent regression: we switched from arrow2 to arrow-rs.

Perhaps we could also spend some time optimizing arrow::ipc::writer::StreamWriter

@emilk emilk added 🧑‍💻 dev experience developer experience (excluding CI) 📉 performance Optimization, memory use, etc include in changelog labels Jun 9, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 9, 2025

Web viewer built successfully. If applicable, you should also test it:

  • I have tested the web viewer
Result Commit Link Manifest
a558811 https://rerun.io/viewer/pr/10173 +nightly +main

Note: This comment is updated whenever you push a commit.

@nikolausWest
Copy link
Copy Markdown
Member

Really great stuff! So great to see this breakdown so easily.

This suggests that if we simply encode the chunks in parallel instead on N cores, we will get a N-times speedup (when the actual logging thread is not the bottleneck).

We should keep in mind here though that in the user's system, it may not be desirable for the logging to take up all the resources. Don't know what the answer is but this might need to be a setting (using many threads to improve latency is obviously great ion an off-device bridge but perhaps not on device in production).

Perhaps we could also spend some time optimizing arrow::ipc::writer::StreamWriter

Any improvements we can do to the single threaded case would be great in both cases =)

@jprochazk
Copy link
Copy Markdown
Member

jprochazk commented Jun 9, 2025

I wonder how the changes from this draft affect the measurements here:

Will try it myself soonish

@jprochazk
Copy link
Copy Markdown
Member

The latency metrics are no longer opt-in. We always show them now if they get surprisingly high.
The measurements are made by adding extra metadata to the record batches at IPC encode/decode time.

I would expect this to be part of the "show performance metrics" setting. How much overhead does this add to encode/decode time?

@emilk
Copy link
Copy Markdown
Member Author

emilk commented Jun 9, 2025

The latency metrics are no longer opt-in. We always show them now if they get surprisingly high.
The measurements are made by adding extra metadata to the record batches at IPC encode/decode time.

I would expect this to be part of the "show performance metrics" setting. How much overhead does this add to encode/decode time?

We need to insert the measurement on the SDK side, without knowing the settings for the viewer. But this is only a single time measurement per chunk, so the added overhead is minimal.

Comment on lines -431 to -450
// Show queue latency on hover, as that is part of this.
// For instance, if the framerate is really bad we have less time to ingest incoming data,
// leading to an ever-increasing input queue.
let rx = app.msg_receive_set();
let queue_len = rx.queue_len();
let latency_sec = rx.latency_nanos() as f32 / 1e9;
// empty queue == unreliable latency
if 0 < queue_len {
response.on_hover_ui(|ui| {
ui.label(format!(
"Queue latency: {}, length: {}",
latency_text(latency_sec),
format_uint(queue_len),
));

ui.label(
"When more data is arriving over network than the Rerun Viewer can ingest, a queue starts building up, leading to latency and increased RAM use.\n\
We call this the queue latency.");
});
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the old "queue latency", which was just measuring the latency from receiving data until ingesting it; which we now cover in another way with the new code

@nikolausWest
Copy link
Copy Markdown
Member

nikolausWest commented Jun 9, 2025

Perhaps a bit silly but did a Chatgpt deep research to look for opportunities to improve performance. May be something in there of interest

Performance of Arrow StreamWriter in Rerun.pdf

@emilk
Copy link
Copy Markdown
Member Author

emilk commented Jun 11, 2025

image

@Wumpf Wumpf self-requested a review June 11, 2025 11:59
@emilk emilk merged commit c618728 into main Jun 12, 2025
47 checks passed
@emilk emilk deleted the emilk/improve-latency-measurements branch June 12, 2025 16:37
grtlr added a commit that referenced this pull request Jun 13, 2025
This was introduced via #10173 and
led to breakage on the dataplatform side.
grtlr added a commit that referenced this pull request Jun 13, 2025
### Related

* #10173

### What

The removal of this variant led to breakage on the dataplatform side.
grtlr pushed a commit that referenced this pull request Jun 13, 2025
## Related
* Investigates #9812
* Closes #9973

## What
This adds a breakdown of where the end-to-end latency is coming from. It
automatically appears in the top panel when latency is high. Hover it to
see details:


![image](https://github.com/user-attachments/assets/f73a0b3a-365a-412b-aa7e-f6e4d3a95620)



## Notable changes
The latency metrics are no longer opt-in. We always show them now if
they get surprisingly high.

The measurements are made by adding extra metadata to the record batches
at IPC encode/decode time.

We already have accurate timestamps at log-time (`RowId`) and batch-time
(`ChunkId`).

## Testing

Apply this diff:

```
diff --git a/tests/rust/log_benchmark/src/image.rs b/tests/rust/log_benchmark/src/image.rs
index e364782..877c3e0bce4 100644
--- a/tests/rust/log_benchmark/src/image.rs
+++ b/tests/rust/log_benchmark/src/image.rs
@@ -26,8 +26,7 @@ fn prepare() -> Vec<u8> {
 fn execute(mut raw_image_data: Vec<u8>) -> anyhow::Result<()> {
     re_tracing::profile_function!();
 
-    let (rec, _storage) =
-        rerun::RecordingStreamBuilder::new("rerun_example_benchmark_").memory()?;
+    let rec = rerun::RecordingStreamBuilder::new("rerun_example_benchmark_").connect_grpc()?;
 
     for i in 0..NUM_LOG_CALLS {
         raw_image_data[i] += 1; // Change a single pixel of the image data, just to make sure we transmit something different each time.
```

The run this in one terminal:

```sh
pixi run rerun-release 
```

and in another:

```sh
cargo run -p log_benchmark --release -- --benchmarks image --profile
```

## Findings
In a test where I encode a bunch of uncompressed 1024² RGBA images
(`tests/rust/log_benchmark/src/image.rs`).
I found that by far the slowest part is the IPC encoding of arrow data
(yes, this is a release build):

<img width="727" alt="Image"
src="https://github.com/user-attachments/assets/2f9b4856-f9ae-48b2-abb7-2dcefafc17b8"
/>

zooming in:

<img width="679" alt="Image"
src="https://github.com/user-attachments/assets/f0dcd198-ec22-4655-90ce-36eed1aa318e"
/>

This suggests that if we simply encode the chunks in parallel instead on
N cores, we will get a N-times speedup (when the actual logging thread
is not the bottleneck).

This also gives a likely reason for the recent regression: we switched
from `arrow2` to `arrow-rs`.

Perhaps we could also spend some time optimizing
`arrow::ipc::writer::StreamWriter`

---------

Co-authored-by: Andreas Reich <andreas@rerun.io>
grtlr added a commit that referenced this pull request Jun 13, 2025
### Related

* #10173

### What

The removal of this variant led to breakage on the dataplatform side.
teh-cmc pushed a commit that referenced this pull request Jun 13, 2025
#10222)

### Related

* Fixes test regression introduced in
#10173

### What

Related internal slack discussion
https://rerunio.slack.com/archives/C07T0LFT5BQ/p1749799544492489

* [x] pass full check ci

cc: @emilk
@emilk emilk mentioned this pull request Jun 13, 2025
1 task
emilk added a commit that referenced this pull request Jun 17, 2025
### Related
* #10222
* #10173
* #9812

### What
My previous PR added timestamps to the arrow metadata whenever we
encoded or decoded arrow IPC. This caused all sorts of test failures.

In this PR, we instead only insert it in two places:
* The gRPC log sink, used by the SDK
* The gRPC client used by the viewer

### How to test
```sh
pixi run rerun-release
```
```sh
cargo run -p log_benchmark --release -- --benchmarks image --connect
```

### TODO
* [x] Run `main` CI

### Future work
* Add timestamp for when the messages hits the proxy server
(`re_grpc_server`)
* Or better yet: #10229 🤦
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🧑‍💻 dev experience developer experience (excluding CI) include in changelog 📉 performance Optimization, memory use, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate e2e performance of rr.log(…, rr.Image(numpy))

4 participants