Added producer-side SMB scraping for remote producers #3841

jahdiel-alvarez · 2025-11-25T06:59:25Z

Producers using shmem emulation won't be SMB scraped by the tracing service since the buffer isn't shared. To handle this scenario such producers after receiving a flush request from the tracing service will self-scrape their buffers, will mark the chunks with uncommitted data as complete and will send the trace data via a commit data request back to the tracing service.

Also, the tracing service has been refactored such that it always emits a flush request for producers using shmem emulation and it no longer does SMB scraping on service-side for such producers (those using shmem emulation).

Lastly, the tracing service was fixed such that it now properly sets the use_shmem_emulation on the shmem_abi when setting up the shared memory for producers using shmem emulation.

Bug: 456746462
Bug: 462494442

Producers using shmem emulation won't be SMB scraped by the tracing service since the buffer isn't shared. To handle this scenario such producers after receiving a flush request from the tracing service will self-scrape their buffers, will mark the chunks with uncommitted data as complete and will send the trace data via a commit data request back to the tracing service. Also, the tracing service has been refactored such that it always emits a flush request for producers using shmem emulation and it no longer does SMB scraping on service-side for such producers (those using shmem emulation). Lastly, the tracing service was fixed such that it now properly sets the use_shmem_emulation on the shmem_abi when setting up the shared memory for producers using shmem emulation. Bug: 456746462 Bug: 462494442 Change-Id: If3db3ac50df4e9fb1b573c1604b829fb519c9c95

github-actions · 2025-11-25T07:05:42Z

🎨 Perfetto UI Build

✅ UI build is ready: https://storage.googleapis.com/perfetto-ci-artifacts/gh-19661063236-ui/ui/index.html

primiano

Jahdiel, wow. You managed to figure out one of the darkest corner of our codebase.

I left some comments, but none of it is major.
I think you figured out all the right bits.

I'm truly impressed.

primiano · 2025-11-27T21:58:22Z

include/perfetto/ext/tracing/core/shared_memory_abi.h

  size_t page_size() const { return page_size_; }
  size_t num_pages() const { return num_pages_; }
  bool is_valid() { return num_pages() > 0; }
+  bool is_shmem_emulated() { return use_shmem_emulation_; }


can you plz call it use_shmem_emulation() ? When doing codesearches is quite nice when the names of private variables and accessor match, as you can see in one go how things are plumbed rather than following renames.

primiano · 2025-11-27T22:05:05Z

src/tracing/core/shared_memory_arbiter_impl.cc


+void SharedMemoryArbiterImpl::ScrapeEmulatedSharedMemoryBuffer(
+    const std::map<WriterID, BufferID>& buffer_for_writers) {
+  if (!use_shmem_emulation_)


I think this should be a PERFETTO_CHECK(). we should never get here without shmem emulation, if we do it's a bug and we should fail fast, not hide it

primiano · 2025-11-27T22:15:30Z

src/tracing/core/shared_memory_arbiter_impl.cc

+  if (!use_shmem_emulation_)
+    return;
+
+  for (size_t page_idx = 0; page_idx < shmem_abi_.num_pages(); page_idx++) {


So most of this code is identical to the on in TracingServiceImpl::ScrapeSharedMemoryBuffers... but also cannot be 100% identical.

WDYT if we factored out an inline function in shared_memory_arbiter(_impl.h) along the lines of

// F is a lambda with signature void(WriterID, SharedMemoryABI::Chunk*) template<typename F> void ForEachScrapableChunk(F lambda) { ... }

which essentially moves all the code up to chunk.writer_id(),

so then both TSImpl and your code can diverge there and do

ForEachScrapableChunk([&] (WriterID writer_id, Chunk* chunk)) {
const auto writer = buffer_for_writers.find(chunk.writer_id());
if (writer == buffer_for_writers.end())
continue;

BufferID target_buffer_id = writer->second; PatchList ignored; ReturnCompletedChunk(std::move(chunk), target_buffer_id, &ignored);

});

primiano · 2025-11-27T22:31:50Z

src/tracing/ipc/producer/producer_ipc_client_impl.cc

+    if (use_shmem_emulation_ &&
+        smb_scraping_mode_ ==
+            TracingService::ProducerSMBScrapingMode::kEnabled) {
+      // Producer-side SMB scraping should occur after the flush is complete.


Hmm I am a bit conflicted here. I went back and forth ~30 mins while editing this comment and reasoning about it. Please double-check below and don't blindly accept it. this is tricky

Here's my reasoning in summary, and then details:

I see you are trying to stay consistent with the same order in which TracingServiceImpl does things

But TracingServiceImpl has the benefit of being able to handle the sequence of async events linearly (because ScrapeSharedMemoryBuffers is synchronous within TSImpl::CompleteFlush)

BUt you cannot guarantee the same linearity. SO there is a chance that your code might be racy, and by the time you commit the scraped data, the producer has already acknowledged the flush and the service might have moved on.

As such I think you should make the scraping syncrhonously (i.e. NOT posted) but move it in ProducerIPCClientImpl::NotifyFlushComplete. Because that's the equivalent linearization that TSImpl does, on the client side.

WDYT?

(some further notes below just out of interest. You have no ideas I many times i rewrote this comment XD)

scraping has been designed to be idempotent. As in, because it happens under the hoods, you can first scrape, and then the producer can do the real commit, and TraceBuffer is designed to deal with it. i.e. it allows "recommit" and checks that you are only "growing" payload and flags. (*)

BUT THEN, I looked at the code in TracingServiceImpl, and realized that scraping has been implemented only AFTER NotifyFlushDoneForProducer. And here I realize you are trying to be consistent with it.

I think in practice both will work. If you commit, and then scrape, the data should be gone already (and even if you accidentally commit twice, the TraceBuffer should be able to handle that)

If you look at the code in trace_buffer_v1.cc that says

// Check whether we have already copied the same chunk previously. This may // happen if the service scrapes chunks in a potentially incomplete state // before receiving commit requests for them from the producer. Note that the // service may scrape and thus override chunks in arbitrary order since the // chunks aren't ordered in the SMB.

primiano · 2025-11-27T22:35:03Z

src/tracing/ipc/producer/producer_ipc_client_impl.h

+  // needs the mapping of writer to buffer in order to do producer-side
+  // SMB scraping as it cannot rely on the map stored in the tracing
+  // service.
+  std::map<WriterID, BufferID> writers_;


call this writers_for_scraping_ so it's easier to reason about them?

primiano · 2025-11-27T22:38:07Z

src/tracing/service/tracing_service_impl.cc

+      // with the tracing service, therefore we NEED to trigger a flush request
+      // so that the producer can self-scrape their emulated SMB. Trigger the
+      // flush by inserting an empty vector if the producer not in the map
+      // already.


since you are here and you put the effort of writing a RFC, link it.
"See RFC-0010"

primiano · 2025-11-27T22:39:17Z

src/tracing/service/tracing_service_impl.cc

+      // so that the producer can self-scrape their emulated SMB. Trigger the
+      // flush by inserting an empty vector if the producer not in the map
+      // already.
+      data_source_instances[producer_id];


this looks like a bug (I know it's not, I mean, if you just look at the code without thinking too much).
Do data_source_instances.emplace(producer_id) instead as it looks more explicit and deliberate?

primiano · 2025-11-27T22:43:04Z

src/tracing/service/tracing_service_impl.cc

                  producer->name_.c_str());
+    // TODO(jahdiel): We are creating shared memory buffers on the tracing
+    // service side even if the producer is using shmem emulation, which is
+    // wasting memory on the tracing service's machine.


TBH it might NOT be the case in practice.
If we page-fault only on reading (i.e. if the service never tries to change the smb) in reality it will cost 0 physical memory:

THe SMB is allocated using mmap() (well base::SHaredMemory -> PagedMemory -> mmap) and not malloc

If you pagefault read-only on MAP_ANON memory, on Linux it will map it to the zero page (a read-only physical page filled with zeros), which is one for the whole system

Actual memory usage happens only if you try to write, when copy-on-write kicks in.

Now, I would have to audit the code but I believe that TSIMpl in the case of an unused SMB will only page-fault in read-only. It will try to read the page headers, read 0x00000000 (Which mwans "page not formatted") and move on.

So, all in all, I think we can live with this

jahdiel-alvarez requested a review from a team as a code owner November 25, 2025 06:59

jahdiel-alvarez requested review from LalitMaganti and primiano November 25, 2025 06:59

primiano reviewed Nov 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added producer-side SMB scraping for remote producers #3841

Added producer-side SMB scraping for remote producers #3841

jahdiel-alvarez commented Nov 25, 2025

Uh oh!

github-actions bot commented Nov 25, 2025

Uh oh!

primiano left a comment •

edited

Loading

Uh oh!

primiano Nov 27, 2025

Uh oh!

primiano Nov 27, 2025

Uh oh!

primiano Nov 27, 2025

Uh oh!

primiano Nov 27, 2025

Uh oh!

primiano Nov 27, 2025

Uh oh!

primiano Nov 27, 2025

Uh oh!

primiano Nov 27, 2025

Uh oh!

primiano Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added producer-side SMB scraping for remote producers #3841

Are you sure you want to change the base?

Added producer-side SMB scraping for remote producers #3841

Conversation

jahdiel-alvarez commented Nov 25, 2025

Uh oh!

github-actions bot commented Nov 25, 2025

🎨 Perfetto UI Build

Uh oh!

primiano left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

primiano left a comment •

edited

Loading