Skip to content

Conversation

@jahdiel-alvarez
Copy link
Contributor

Producers using shmem emulation won't be SMB scraped by the tracing service since the buffer isn't shared. To handle this scenario such producers after receiving a flush request from the tracing service will self-scrape their buffers, will mark the chunks with uncommitted data as complete and will send the trace data via a commit data request back to the tracing service.

Also, the tracing service has been refactored such that it always emits a flush request for producers using shmem emulation and it no longer does SMB scraping on service-side for such producers (those using shmem emulation).

Lastly, the tracing service was fixed such that it now properly sets the use_shmem_emulation on the shmem_abi when setting up the shared memory for producers using shmem emulation.

Bug: 456746462
Bug: 462494442

Producers using shmem emulation won't be SMB scraped by the tracing
service since the buffer isn't shared. To handle this scenario such
producers after receiving a flush request from the tracing service
will self-scrape their buffers, will mark the chunks with uncommitted
data as complete and will send the trace data via a commit data request
back to the tracing service.

Also, the tracing service has been refactored such that it always emits a
flush request for producers using shmem emulation and it no longer does
SMB scraping on service-side for such producers (those using shmem
emulation).

Lastly, the tracing service was fixed such that it now properly sets the
use_shmem_emulation on the shmem_abi when setting up the shared memory
for producers using shmem emulation.

Bug: 456746462
Bug: 462494442
Change-Id: If3db3ac50df4e9fb1b573c1604b829fb519c9c95
@github-actions
Copy link

🎨 Perfetto UI Build

✅ UI build is ready: https://storage.googleapis.com/perfetto-ci-artifacts/gh-19661063236-ui/ui/index.html

Copy link
Member

@primiano primiano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jahdiel, wow. You managed to figure out one of the darkest corner of our codebase.

I left some comments, but none of it is major.
I think you figured out all the right bits.

I'm truly impressed.

size_t page_size() const { return page_size_; }
size_t num_pages() const { return num_pages_; }
bool is_valid() { return num_pages() > 0; }
bool is_shmem_emulated() { return use_shmem_emulation_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you plz call it use_shmem_emulation() ? When doing codesearches is quite nice when the names of private variables and accessor match, as you can see in one go how things are plumbed rather than following renames.


void SharedMemoryArbiterImpl::ScrapeEmulatedSharedMemoryBuffer(
const std::map<WriterID, BufferID>& buffer_for_writers) {
if (!use_shmem_emulation_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a PERFETTO_CHECK(). we should never get here without shmem emulation, if we do it's a bug and we should fail fast, not hide it

if (!use_shmem_emulation_)
return;

for (size_t page_idx = 0; page_idx < shmem_abi_.num_pages(); page_idx++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So most of this code is identical to the on in TracingServiceImpl::ScrapeSharedMemoryBuffers... but also cannot be 100% identical.

WDYT if we factored out an inline function in shared_memory_arbiter(_impl.h) along the lines of

// F is a lambda with signature void(WriterID, SharedMemoryABI::Chunk*)
template<typename F>
void ForEachScrapableChunk(F lambda) {
  ...
}

which essentially moves all the code up to chunk.writer_id(),

so then both TSImpl and your code can diverge there and do

ForEachScrapableChunk([&] (WriterID writer_id, Chunk* chunk)) {
const auto writer = buffer_for_writers.find(chunk.writer_id());
if (writer == buffer_for_writers.end())
continue;

  BufferID target_buffer_id = writer->second;
  PatchList ignored;

  ReturnCompletedChunk(std::move(chunk), target_buffer_id, &ignored);

});

if (use_shmem_emulation_ &&
smb_scraping_mode_ ==
TracingService::ProducerSMBScrapingMode::kEnabled) {
// Producer-side SMB scraping should occur after the flush is complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I am a bit conflicted here. I went back and forth ~30 mins while editing this comment and reasoning about it. Please double-check below and don't blindly accept it. this is tricky

Here's my reasoning in summary, and then details:

  • I see you are trying to stay consistent with the same order in which TracingServiceImpl does things

  • But TracingServiceImpl has the benefit of being able to handle the sequence of async events linearly (because ScrapeSharedMemoryBuffers is synchronous within TSImpl::CompleteFlush)

  • BUt you cannot guarantee the same linearity. SO there is a chance that your code might be racy, and by the time you commit the scraped data, the producer has already acknowledged the flush and the service might have moved on.

  • As such I think you should make the scraping syncrhonously (i.e. NOT posted) but move it in ProducerIPCClientImpl::NotifyFlushComplete. Because that's the equivalent linearization that TSImpl does, on the client side.

WDYT?

(some further notes below just out of interest. You have no ideas I many times i rewrote this comment XD)

scraping has been designed to be idempotent. As in, because it happens under the hoods, you can first scrape, and then the producer can do the real commit, and TraceBuffer is designed to deal with it. i.e. it allows "recommit" and checks that you are only "growing" payload and flags. (*)

BUT THEN, I looked at the code in TracingServiceImpl, and realized that scraping has been implemented only AFTER NotifyFlushDoneForProducer. And here I realize you are trying to be consistent with it.

I think in practice both will work. If you commit, and then scrape, the data should be gone already (and even if you accidentally commit twice, the TraceBuffer should be able to handle that)

If you look at the code in trace_buffer_v1.cc that says

   // Check whether we have already copied the same chunk previously. This may
  // happen if the service scrapes chunks in a potentially incomplete state
  // before receiving commit requests for them from the producer. Note that the
  // service may scrape and thus override chunks in arbitrary order since the
  // chunks aren't ordered in the SMB.

// needs the mapping of writer to buffer in order to do producer-side
// SMB scraping as it cannot rely on the map stored in the tracing
// service.
std::map<WriterID, BufferID> writers_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this writers_for_scraping_ so it's easier to reason about them?

// with the tracing service, therefore we NEED to trigger a flush request
// so that the producer can self-scrape their emulated SMB. Trigger the
// flush by inserting an empty vector if the producer not in the map
// already.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are here and you put the effort of writing a RFC, link it.
"See RFC-0010"

// so that the producer can self-scrape their emulated SMB. Trigger the
// flush by inserting an empty vector if the producer not in the map
// already.
data_source_instances[producer_id];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a bug (I know it's not, I mean, if you just look at the code without thinking too much).
Do data_source_instances.emplace(producer_id) instead as it looks more explicit and deliberate?

producer->name_.c_str());
// TODO(jahdiel): We are creating shared memory buffers on the tracing
// service side even if the producer is using shmem emulation, which is
// wasting memory on the tracing service's machine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH it might NOT be the case in practice.
If we page-fault only on reading (i.e. if the service never tries to change the smb) in reality it will cost 0 physical memory:

  • THe SMB is allocated using mmap() (well base::SHaredMemory -> PagedMemory -> mmap) and not malloc
  • If you pagefault read-only on MAP_ANON memory, on Linux it will map it to the zero page (a read-only physical page filled with zeros), which is one for the whole system
  • Actual memory usage happens only if you try to write, when copy-on-write kicks in.

Now, I would have to audit the code but I believe that TSIMpl in the case of an unused SMB will only page-fault in read-only. It will try to read the page headers, read 0x00000000 (Which mwans "page not formatted") and move on.

So, all in all, I think we can live with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants