Add detailed view of latency#10173
Conversation
|
Web viewer built successfully. If applicable, you should also test it:
Note: This comment is updated whenever you push a commit. |
|
Really great stuff! So great to see this breakdown so easily.
We should keep in mind here though that in the user's system, it may not be desirable for the logging to take up all the resources. Don't know what the answer is but this might need to be a setting (using many threads to improve latency is obviously great ion an off-device bridge but perhaps not on device in production).
Any improvements we can do to the single threaded case would be great in both cases =) |
|
I wonder how the changes from this draft affect the measurements here: Will try it myself soonish |
I would expect this to be part of the "show performance metrics" setting. How much overhead does this add to encode/decode time? |
We need to insert the measurement on the SDK side, without knowing the settings for the viewer. But this is only a single time measurement per chunk, so the added overhead is minimal. |
| // Show queue latency on hover, as that is part of this. | ||
| // For instance, if the framerate is really bad we have less time to ingest incoming data, | ||
| // leading to an ever-increasing input queue. | ||
| let rx = app.msg_receive_set(); | ||
| let queue_len = rx.queue_len(); | ||
| let latency_sec = rx.latency_nanos() as f32 / 1e9; | ||
| // empty queue == unreliable latency | ||
| if 0 < queue_len { | ||
| response.on_hover_ui(|ui| { | ||
| ui.label(format!( | ||
| "Queue latency: {}, length: {}", | ||
| latency_text(latency_sec), | ||
| format_uint(queue_len), | ||
| )); | ||
|
|
||
| ui.label( | ||
| "When more data is arriving over network than the Rerun Viewer can ingest, a queue starts building up, leading to latency and increased RAM use.\n\ | ||
| We call this the queue latency."); | ||
| }); | ||
| } |
There was a problem hiding this comment.
I removed the old "queue latency", which was just measuring the latency from receiving data until ingesting it; which we now cover in another way with the new code
|
Perhaps a bit silly but did a Chatgpt deep research to look for opportunities to improve performance. May be something in there of interest |
Co-authored-by: Andreas Reich <andreas@rerun.io>
Co-authored-by: Andreas Reich <andreas@rerun.io>
This was introduced via #10173 and led to breakage on the dataplatform side.
### Related * #10173 ### What The removal of this variant led to breakage on the dataplatform side.
## Related * Investigates #9812 * Closes #9973 ## What This adds a breakdown of where the end-to-end latency is coming from. It automatically appears in the top panel when latency is high. Hover it to see details:  ## Notable changes The latency metrics are no longer opt-in. We always show them now if they get surprisingly high. The measurements are made by adding extra metadata to the record batches at IPC encode/decode time. We already have accurate timestamps at log-time (`RowId`) and batch-time (`ChunkId`). ## Testing Apply this diff: ``` diff --git a/tests/rust/log_benchmark/src/image.rs b/tests/rust/log_benchmark/src/image.rs index e364782..877c3e0bce4 100644 --- a/tests/rust/log_benchmark/src/image.rs +++ b/tests/rust/log_benchmark/src/image.rs @@ -26,8 +26,7 @@ fn prepare() -> Vec<u8> { fn execute(mut raw_image_data: Vec<u8>) -> anyhow::Result<()> { re_tracing::profile_function!(); - let (rec, _storage) = - rerun::RecordingStreamBuilder::new("rerun_example_benchmark_").memory()?; + let rec = rerun::RecordingStreamBuilder::new("rerun_example_benchmark_").connect_grpc()?; for i in 0..NUM_LOG_CALLS { raw_image_data[i] += 1; // Change a single pixel of the image data, just to make sure we transmit something different each time. ``` The run this in one terminal: ```sh pixi run rerun-release ``` and in another: ```sh cargo run -p log_benchmark --release -- --benchmarks image --profile ``` ## Findings In a test where I encode a bunch of uncompressed 1024² RGBA images (`tests/rust/log_benchmark/src/image.rs`). I found that by far the slowest part is the IPC encoding of arrow data (yes, this is a release build): <img width="727" alt="Image" src="https://github.com/user-attachments/assets/2f9b4856-f9ae-48b2-abb7-2dcefafc17b8" /> zooming in: <img width="679" alt="Image" src="https://github.com/user-attachments/assets/f0dcd198-ec22-4655-90ce-36eed1aa318e" /> This suggests that if we simply encode the chunks in parallel instead on N cores, we will get a N-times speedup (when the actual logging thread is not the bottleneck). This also gives a likely reason for the recent regression: we switched from `arrow2` to `arrow-rs`. Perhaps we could also spend some time optimizing `arrow::ipc::writer::StreamWriter` --------- Co-authored-by: Andreas Reich <andreas@rerun.io>
### Related * #10173 ### What The removal of this variant led to breakage on the dataplatform side.
#10222) ### Related * Fixes test regression introduced in #10173 ### What Related internal slack discussion https://rerunio.slack.com/archives/C07T0LFT5BQ/p1749799544492489 * [x] pass full check ci cc: @emilk
### Related * #10222 * #10173 * #9812 ### What My previous PR added timestamps to the arrow metadata whenever we encoded or decoded arrow IPC. This caused all sorts of test failures. In this PR, we instead only insert it in two places: * The gRPC log sink, used by the SDK * The gRPC client used by the viewer ### How to test ```sh pixi run rerun-release ``` ```sh cargo run -p log_benchmark --release -- --benchmarks image --connect ``` ### TODO * [x] Run `main` CI ### Future work * Add timestamp for when the messages hits the proxy server (`re_grpc_server`) * Or better yet: #10229 🤦

Related
rr.log(…, rr.Image(numpy))#9973What
This adds a breakdown of where the end-to-end latency is coming from. It automatically appears in the top panel when latency is high. Hover it to see details:
Notable changes
The latency metrics are no longer opt-in. We always show them now if they get surprisingly high.
The measurements are made by adding extra metadata to the record batches at IPC encode/decode time.
We already have accurate timestamps at log-time (
RowId) and batch-time (ChunkId).Testing
Apply this diff:
The run this in one terminal:
and in another:
Findings
In a test where I encode a bunch of uncompressed 1024² RGBA images (
tests/rust/log_benchmark/src/image.rs).I found that by far the slowest part is the IPC encoding of arrow data (yes, this is a release build):
zooming in:
This suggests that if we simply encode the chunks in parallel instead on N cores, we will get a N-times speedup (when the actual logging thread is not the bottleneck).
This also gives a likely reason for the recent regression: we switched from
arrow2toarrow-rs.Perhaps we could also spend some time optimizing
arrow::ipc::writer::StreamWriter