-
Notifications
You must be signed in to change notification settings - Fork 626
Added producer-side SMB scraping for remote producers #3841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Producers using shmem emulation won't be SMB scraped by the tracing service since the buffer isn't shared. To handle this scenario such producers after receiving a flush request from the tracing service will self-scrape their buffers, will mark the chunks with uncommitted data as complete and will send the trace data via a commit data request back to the tracing service. Also, the tracing service has been refactored such that it always emits a flush request for producers using shmem emulation and it no longer does SMB scraping on service-side for such producers (those using shmem emulation). Lastly, the tracing service was fixed such that it now properly sets the use_shmem_emulation on the shmem_abi when setting up the shared memory for producers using shmem emulation. Bug: 456746462 Bug: 462494442 Change-Id: If3db3ac50df4e9fb1b573c1604b829fb519c9c95
🎨 Perfetto UI Build✅ UI build is ready: https://storage.googleapis.com/perfetto-ci-artifacts/gh-19661063236-ui/ui/index.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jahdiel, wow. You managed to figure out one of the darkest corner of our codebase.
I left some comments, but none of it is major.
I think you figured out all the right bits.
I'm truly impressed.
| size_t page_size() const { return page_size_; } | ||
| size_t num_pages() const { return num_pages_; } | ||
| bool is_valid() { return num_pages() > 0; } | ||
| bool is_shmem_emulated() { return use_shmem_emulation_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you plz call it use_shmem_emulation() ? When doing codesearches is quite nice when the names of private variables and accessor match, as you can see in one go how things are plumbed rather than following renames.
|
|
||
| void SharedMemoryArbiterImpl::ScrapeEmulatedSharedMemoryBuffer( | ||
| const std::map<WriterID, BufferID>& buffer_for_writers) { | ||
| if (!use_shmem_emulation_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a PERFETTO_CHECK(). we should never get here without shmem emulation, if we do it's a bug and we should fail fast, not hide it
| if (!use_shmem_emulation_) | ||
| return; | ||
|
|
||
| for (size_t page_idx = 0; page_idx < shmem_abi_.num_pages(); page_idx++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So most of this code is identical to the on in TracingServiceImpl::ScrapeSharedMemoryBuffers... but also cannot be 100% identical.
WDYT if we factored out an inline function in shared_memory_arbiter(_impl.h) along the lines of
// F is a lambda with signature void(WriterID, SharedMemoryABI::Chunk*)
template<typename F>
void ForEachScrapableChunk(F lambda) {
...
}
which essentially moves all the code up to chunk.writer_id(),
so then both TSImpl and your code can diverge there and do
ForEachScrapableChunk([&] (WriterID writer_id, Chunk* chunk)) {
const auto writer = buffer_for_writers.find(chunk.writer_id());
if (writer == buffer_for_writers.end())
continue;
BufferID target_buffer_id = writer->second;
PatchList ignored;
ReturnCompletedChunk(std::move(chunk), target_buffer_id, &ignored);
});
| if (use_shmem_emulation_ && | ||
| smb_scraping_mode_ == | ||
| TracingService::ProducerSMBScrapingMode::kEnabled) { | ||
| // Producer-side SMB scraping should occur after the flush is complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I am a bit conflicted here. I went back and forth ~30 mins while editing this comment and reasoning about it. Please double-check below and don't blindly accept it. this is tricky
Here's my reasoning in summary, and then details:
-
I see you are trying to stay consistent with the same order in which TracingServiceImpl does things
-
But TracingServiceImpl has the benefit of being able to handle the sequence of async events linearly (because ScrapeSharedMemoryBuffers is synchronous within TSImpl::CompleteFlush)
-
BUt you cannot guarantee the same linearity. SO there is a chance that your code might be racy, and by the time you commit the scraped data, the producer has already acknowledged the flush and the service might have moved on.
-
As such I think you should make the scraping syncrhonously (i.e. NOT posted) but move it in ProducerIPCClientImpl::NotifyFlushComplete. Because that's the equivalent linearization that TSImpl does, on the client side.
WDYT?
(some further notes below just out of interest. You have no ideas I many times i rewrote this comment XD)
scraping has been designed to be idempotent. As in, because it happens under the hoods, you can first scrape, and then the producer can do the real commit, and TraceBuffer is designed to deal with it. i.e. it allows "recommit" and checks that you are only "growing" payload and flags. (*)
BUT THEN, I looked at the code in TracingServiceImpl, and realized that scraping has been implemented only AFTER NotifyFlushDoneForProducer. And here I realize you are trying to be consistent with it.
I think in practice both will work. If you commit, and then scrape, the data should be gone already (and even if you accidentally commit twice, the TraceBuffer should be able to handle that)
If you look at the code in trace_buffer_v1.cc that says
// Check whether we have already copied the same chunk previously. This may
// happen if the service scrapes chunks in a potentially incomplete state
// before receiving commit requests for them from the producer. Note that the
// service may scrape and thus override chunks in arbitrary order since the
// chunks aren't ordered in the SMB.
| // needs the mapping of writer to buffer in order to do producer-side | ||
| // SMB scraping as it cannot rely on the map stored in the tracing | ||
| // service. | ||
| std::map<WriterID, BufferID> writers_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this writers_for_scraping_ so it's easier to reason about them?
| // with the tracing service, therefore we NEED to trigger a flush request | ||
| // so that the producer can self-scrape their emulated SMB. Trigger the | ||
| // flush by inserting an empty vector if the producer not in the map | ||
| // already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are here and you put the effort of writing a RFC, link it.
"See RFC-0010"
| // so that the producer can self-scrape their emulated SMB. Trigger the | ||
| // flush by inserting an empty vector if the producer not in the map | ||
| // already. | ||
| data_source_instances[producer_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a bug (I know it's not, I mean, if you just look at the code without thinking too much).
Do data_source_instances.emplace(producer_id) instead as it looks more explicit and deliberate?
| producer->name_.c_str()); | ||
| // TODO(jahdiel): We are creating shared memory buffers on the tracing | ||
| // service side even if the producer is using shmem emulation, which is | ||
| // wasting memory on the tracing service's machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH it might NOT be the case in practice.
If we page-fault only on reading (i.e. if the service never tries to change the smb) in reality it will cost 0 physical memory:
- THe SMB is allocated using mmap() (well base::SHaredMemory -> PagedMemory -> mmap) and not malloc
- If you pagefault read-only on MAP_ANON memory, on Linux it will map it to the zero page (a read-only physical page filled with zeros), which is one for the whole system
- Actual memory usage happens only if you try to write, when copy-on-write kicks in.
Now, I would have to audit the code but I believe that TSIMpl in the case of an unused SMB will only page-fault in read-only. It will try to read the page headers, read 0x00000000 (Which mwans "page not formatted") and move on.
So, all in all, I think we can live with this
Producers using shmem emulation won't be SMB scraped by the tracing service since the buffer isn't shared. To handle this scenario such producers after receiving a flush request from the tracing service will self-scrape their buffers, will mark the chunks with uncommitted data as complete and will send the trace data via a commit data request back to the tracing service.
Also, the tracing service has been refactored such that it always emits a flush request for producers using shmem emulation and it no longer does SMB scraping on service-side for such producers (those using shmem emulation).
Lastly, the tracing service was fixed such that it now properly sets the use_shmem_emulation on the shmem_abi when setting up the shared memory for producers using shmem emulation.
Bug: 456746462
Bug: 462494442