-
-
Notifications
You must be signed in to change notification settings - Fork 34
use diskcache to memoize check results #1467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
PR is not yet complete but more of a PoC to demonstrate the implementation. Will need some inputs on what TTL to use for each check |
8984fbe to
84eb987
Compare
rgaudin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have some minor questions inline.
My main concern is that it's not clear from the code what we're caching, what we're retrieving and what we're sending back.
There are tests so I guess it's working but it's unclear to me how we're caching the Result object. Is is serialized? How? How are we unserializing it?
This is handled by the library. The default implementation uses pickle to store data types that are not natively supported like bytes, string, floats, etc. This is how we store the pydantic objects and retrieve them directly since each check is a |
I opted to use it because it's the same as what the DjangoCache uses. Figured since it is used in a web framework, it should suffice here. It has support for sharding than the default Cache. The docs say: "While readers and writers do not block each other, writers block other writers. Therefore a shard for every concurrent writer is suggested. This will depend on your scenario." So, I figured since we call the results with |
Good ; please add a small comment to make this clear |
Good, a comment with this would be perfect |
benoit74
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry but I'm not convinced at all by this PR.
I don't see why we want to store data on disk using an SQLite DB. Check results do not need to be persisted from my understanding so far, and adding such persistence layer is only unneeded accidental complexity from my perspective so far (which can only lead to useless bugs if we confirm it does not add value). Can you develop why we would need this?
Despite explanations, I still don't get why we use a FanoutCache. From what I read, this will shard the DB to have less chance to have blocking writes. This also seems like unneeded complexity. We have only few checks (10 maybe at some point, we will probably never reach 100). We plan to update them only once per minute or so. So it is not like we have significant chances of having so many writes per second we might get into a performance issue.
I finally do not get how this PR achieves to fulfill requirements I've expressed. With this PR, as soon as the cache has expired if we have multiple parallel calls to the check I feel like the code will still run multiple checks in parallel (cache is expired so code will be ran). I would expect that checks are periodically updated but we continue to serve old status until new status has been retrieved.
I also don't get why we cache only on success. In case of failure, this is where everyone will "rush" (to a moderate extend, we do not have that much success, but still) to check status, and we do not want to have parallel / too much repeated calls to the check which would only overwhelm the upstream system which is already failing.
I finally miss the capability to force refreshing the check status.
After some thought, I'm also not really convinced we are using the right approach at a higher level. Nothing very precise yet, but I'm still puzzled by the special HTTP return code for each check, by the fact that we still do not have history on checks status over time except bare information in Slack (but this is really far away from the system supposed to have this history), ... Deserve some thought, sorry for not having some precise input here ATM.
Do I miss the point of this PR or are my points valid to be analyzed? Do not rush into changing code yet, I feel like we need to better align first.
It seems to be an implementation detail of the library and I'm not really in control of it. |
Do you mean this should be run periodically by a cron job and not from the HTTP request? Because it's probably only a caching mechanism (even if not this one) that prevents the checks from happening each time a request is made. The checks can be configured with different intervals but at some future time, all the intervals will meet and they will be run in parallel. Running "periodically" would still be susceptible to parallelism. |
This is quite parametrizable and I can flip the switch to false if that shouldn't be the normal. |
I will add a query parameter to the checks that will force refresh |
|
@elfkuzco please wait for discussion on openzim/overview#72 to settle, we've discussed a bit with @rgaudin and confirm we miss the proper big picture to correctly implement things in this PR. |
|
@benoit74 Just want to say that I mostly share the opinion here. Should be possible to have a better software architecture than this. |
Rationale
The healthcheck service performs all the checks each time which might be overwhelming for the zimfarm and its components. By caching the results of each call with a TTL, we can avoid making checks in quick succession.
Changes
This closes #1455