-
Notifications
You must be signed in to change notification settings - Fork 30
fix: reduce the number of memory allocations and the latency overhead. #983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: reduce the number of memory allocations and the latency overhead. #983
Conversation
|
/gcbrun |
enocom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR.
I'm surprised by the arguments to RecordBytesReceived escaping the stack. Do you understand why this is happening?
In any case, you're right that we don't need to schedule goroutines with every read and write. So this is a big improvement. To avoid a breaking change in the metrics, I think we need to make a small adjustment to the code here. See below for my suggestion.
Thanks again for sending this.
dialer.go
Outdated
| bytesRead, err := i.Conn.Read(b) | ||
| if err == nil { | ||
| go trace.RecordBytesReceived(context.Background(), int64(bytesRead), i.connName, i.dialerID) | ||
| atomic.AddInt64(&i.bytesRead, int64(bytesRead)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use https://pkg.go.dev/sync/atomic#Int64.Add instead here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does not make a difference, except that a plain atomic.AddInt64() does not rely on the inliner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's apparently a bug around AddInt64 for 32-bit platforms, but we can address this in a separate PR.
The lifetime of a goroutine is not limited to the lifetime of a stack frame where it was created, so all arguments of a goroutine have to be heap-allocated. The compiler allocates one object to hold all parameters, but it still has to allocate. Another problem is the use |
enocom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate this PR. This is a big win for anyone using the Go Connector.
|
@hessjcg Would you like to take a pass on this as well now that I'm technically not a code owner anymore? |
|
I will have time to fix the tests like |
|
@hessjcg looks like the test just needs to properly initialize an instrumentedConn based on the changes here. |
Struct instrumentedConn would trace every call to Read() and Write(). This incurs a significant overhead because the variadic arguments of trace.RecordBytesReceived() escape and need to be heap-allocated. Also, every trace.RecordBytesReceived() would be called in a new goroutine. That makes the call yet more expensive. It makes no sense to update the performance counter this often. The default scraping interval in Prometheus is 1 minute. Google Cloud Monitoring had the same interval before they added high-resolution counters that scrape services every 10 seconds. Only update integer counters in the hot path, and update OpenCensus' counters once in 5 seconds. I have a test that just loops through `SELECT * FROM t WHERE id = $1` and has response sizes that range from several dozen bytes to several kilobytes. At 64 connections to Postgres and 10k requests per connection, the test makes approx. 25.5M allocations before this patch, and 7.2M allocations after the patch. The CPU time goes down accordingly because OpenCensus and the garbage collector have less work to do.
…ons. Creating a goroutine introduces a latency of its own. Moreover, a Dial() that makes a TLS connection will not benefit from any possiblemicrosend- level savings.
|
Fixed |
|
Nice work. I'll review the code more carefully today. I noticed this unusual error in the unit tests results, only for the i386 architecture. Any idea what to do about it? |
|
See GoogleCloudPlatform/alloydb-go-connector#686 where we ported this change and used |
Unlike int64 struct fields, atomic.Int64 fields are guaranteed to be aligned to 8 bytes on 32-bit platoform. This alignement is required for atomic ops to work. Also, replace atomic int32 vars with atomic.Int32 for consistency.
|
Ping. I've replaced plain |
|
@hessjcg to review. LGTM. Thanks again for this. |
|
Hey @hessjcg! Is anyone out here to read PRs? |
hessjcg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this improvement. This looks great!
|
@hessjcg please make a release so that I can pull in this fix without referencing some random git commit. |
|
hi @dominiquelefevre , thanks for this great contribution! We have monthly release scheduled. And we will release latest with this PR included in next 1 ~ 2 weeks. |
Struct instrumentedConn would trace every call to Read() and Write(). This incurs a significant overhead because the variadic arguments of trace.RecordBytesReceived() escape and need to be heap-allocated. Also, every trace.RecordBytesReceived() would be called in a new goroutine. That makes the call yet more expensive.
It makes no sense to update the performance counter this often. The default scraping interval in Prometheus is 1 minute. Google Cloud Monitoring had the same interval before they added high-resolution counters that scrape services every 10 seconds.
Only update integer counters in the hot path, and update OpenCensus' counters once in 5 seconds.
I have a test that just loops through
SELECT * FROM t WHERE id = $1and has response sizes that range from several dozen bytes to several kilobytes. At 64 connections to Postgres and 10k requests per connection, the test makes approx. 25.5M allocations before this patch, and 7.2M allocations after the patch. The CPU time goes down accordingly because OpenCensus and the garbage collector have less work to do.