-
Notifications
You must be signed in to change notification settings - Fork 45
fix: Ensure connection pool metrics stay consistent #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This started as a bugfix to several prisma issues regarding incorrect metrics: prisma/prisma#25177 prisma/prisma#23525 And a couple of more discovered the testing, but not reported in the issues. There were several causes for this: 1. Following pattern appears quite a lot in mobc code: ```rust gauge!("something").increment(1.0); do_a_thing_that_could_fail()?; gauge!("something").decrement(1.0); ``` So, in case `do_a_thing_that_could_fail` actually fails, gauge will get incremented, but never will get decremented. 2. Couple of metrics were relying on `Conn::close` being manually called and that was not the case every once in a while. To prevent both of those problems, I rewrote the internals of library to rely on RAII rather than manual counters and resources management. `Conn` struct is now split into two: - `ActiveConn` - represents currently checked out connection that have been actively used by the client. Holds onto semaphore permit and can be converted into `IdleConn`. Doing so will free the permit. - `IdleConn` - represents idle connection, currently checked into the pool. Can be converted to `ActiveConn` by providing a valid permit. `ConnState` represents the shared state of the connection that is retained between different activity states. Both `IdleConn` and `ActiveConn` manage their corresponding gauges - increment them on creation and decrement them during drop. `ConnState` manages `CONNECTIONS_OPEN` gauge and `CONNECTIONS_TOTAL` and `CLOSED_TOTAL` counters in the same way. This system ensures that metrics stay consistent: since metrics are automatically incremented and decremented on state conversions, we can always be sure that: - Connection is always either idle or active, there is no in between state. - Idle connections and active connections gauges will always add up to the currently open connections gauge - Total connections open counter, minus total connections closed gauge will always be equal to number of currently open connections. Since resources are now managed by `Drop::drop` implementations, that removes the need for manual `close` method and simiplifies the code quite in a few places, also ensuring it is safer against future changes.
| drop(v); | ||
| delay_for(Duration::from_millis(800)).await; | ||
|
|
||
| let mut v = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion was incorrect - connection lifeteme is 1 sec, by that time 1800ms passed but connections were not freed for some reason
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
| num_open: Arc<AtomicU64>, | ||
| max_lifetime_closed: AtomicU64, | ||
| max_idle_closed: AtomicU64, | ||
| max_idle_closed: Arc<AtomicU64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are atomics wrapped in an Arc?
Arc<T> is Send + Safe as long as T: Send + Safe, and AtomicU64 is already Send + Safe.
What do we need the Arc for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was wrong.
From https://doc.rust-lang.org/std/sync/atomic/:
Atomic variables are safe to share between threads (they implement Sync) but they do not themselves provide the mechanism for sharing and follow the threading model of Rust. The most common way to share an atomic variable is to put it into an Arc (an atomically-reference-counted shared pointer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, exactly. Those two are needed to be shared becuase drop for ConnState needs to increment one and decrement another
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkomyno atomics are safely shareable (and you can even safely concurrently mutate them through a shared reference) but they are not magically shared just by cloning them (that would violate the idea of ownership), you need some kind of a reference or a pointer to be able to access the same location in memory from multiple places. If you don't want an Arc here, the only other option is for ConnState to borrow from the Pool and that is not necessarily practical.
aqrln
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beautiful
| internals.wait_duration += wait_guard.elapsed(); | ||
| drop(wait_guard); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of an edge case and I'm not sure how critical it is here, but if the thread is preempted by the kernel between these two statements, the counters in internal and the histogram may diverge a bit. Could be especially sensitive to running on a cloud VM with something like 100m CPU quota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I think instead of a manual drop we can add a method to HistogramGuard that takes ownership of self and returns elapsed time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep internals.wait_duration += wait_guard.into_elapsed() sounds great
|
|
||
| impl Drop for ConnState { | ||
| fn drop(&mut self) { | ||
| self.total_connections_open.fetch_sub(1, Ordering::Relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Relaxed sound here and no happens-before relationship is necessary, or should the loads of PoolState::num_open form an acquire-release pair with this store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just moved existing code around, but I agree with you here, acquire-release makes more sense in this case
Co-authored-by: Alexey Orlenko <[email protected]>
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
garrensmith
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @SevInf
Thanks for the PR. There is a lot of changes here and I haven't looked at this code for awhile so I'm not entirely sure on a few things
So, in case do_a_thing_that_could_fail actually fails, gauge will get incremented, but never will get decremented.
Can you give an example of this? Why what adding the GaugeGuard your solution to this?
Couple of metrics were relying on Conn::close being manually called and that was not the case every once in a while.
What causes the Conn::close not to be called?
I don't understand why two connection states, I can't see what the difference is between them?
The error in metrics you are trying to fix makes sense. Are you 100% certain this fixes them?
Have you measure what the performance impact this has on the library?
I know you say it makes the code simpler but I don't see how, it seems to add a lot more to it.
Sorry for lots of questions, I'm trying to get some context here?
| manager, | ||
| internals, | ||
| semaphore: Semaphore::new(max_open), | ||
| semaphore: Arc::new(Semaphore::new(max_open)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need the Arc around the Semaphore?
What is the performance implications of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arc is needed to get owned permit, that method exists only on Arc<Self>
No noticeable negative performance implications, if anything, mobc own testsuite consistently runs 10-15s faster on my machine after my changes.
| Self { | ||
| inner, | ||
| state, | ||
| _permit: permit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you keep the permit instead of forgetting it? I can't see you using it anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I forget it, i will then need to add it back manually when connection drops. If I keep on struct, it will automatically gets returned when the struct drops, we don't need to think about doing manual management and we can't get it wrong in case struct drops earlier then we thought.
For example:
Generally,
Just a safety mechanism to ensure connections are used correctly when checked in and out of pool. For example, now typesystem guarantees that:
Not 100% certain, but I went from being able to reproduce them consistently to not being able to reproduce them at all. (autocannon with 10k connections on a test service).
mobc's own test suite consistently runs faster for me after my changes
Sure, it's more code, but resource management is no longer relies on us calling right cleanup method at the right time, Rust does it for us in a safer and less brittle way. |
|
Awesome thanks for the work and explanations. |
|
@SevInf you should have commit access and be able to push a release. |
|
Thank you @garrensmith! |
|
Also, @garrensmith, I do have access on Github but not on crates.io so can't publish a release right now |
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
|
FYI when building prisma-engines 5.14.0 w/ mobc 0.8.5 i'm not getting any values for most metrics |
|
@tmm1 yeah that's a known issue in Prisma (one problem was masking another), will be fixed in 5.22.0 |
|
@tmm1 actually on a second thought it wouldn't have led to not getting any values at all. In your case it's probably just because the version of the |
|
Good catch! Can confirm things work as expected with 5.14.0 + prisma/prisma-engines#5015 |
This started as a bugfix to several prisma issues regarding incorrect metrics:
prisma/prisma#25177
prisma/prisma#23525
And a couple of more discovered the testing, but not reported in the issues.
There were several causes for this:
So, in case
do_a_thing_that_could_failactually fails, gauge will get incremented, but never will get decremented.Conn::closebeing manually called and that was not the case every once in a while.To prevent both of those problems, I rewrote the internals of library to rely on RAII rather than manual counters and resources management.
Connstruct is now split into two:ActiveConn- represents currently checked out connection that have been actively used by the client. Holds onto semaphore permit and can be converted intoIdleConn. Doing so will free the permit.IdleConn- represents idle connection, currently checked into the pool. Can be converted toActiveConnby providing a valid permit.ConnStaterepresents the shared state of the connection that is retained between different activity states.Both
IdleConnandActiveConnmanage their corresponding gauges - increment them on creation and decrement them during drop.ConnStatemanagesCONNECTIONS_OPENgauge andCONNECTIONS_TOTALandCLOSED_TOTALcounters in the same way.This system ensures that metrics stay consistent: since metrics are automatically incremented and decremented on state conversions, we can always be sure that:
Since resources are now managed by
Drop::dropimplementations, that removes the need for manualclosemethod and simiplifies the code quite in a few places, also ensuring it is safer against future changes.