Skip to content

Conversation

@elfkuzco
Copy link
Collaborator

Rationale

The healthcheck service performs all the checks each time which might be overwhelming for the zimfarm and its components. By caching the results of each call with a TTL, we can avoid making checks in quick succession.

Changes

  • use diskcache FanoutCache to store results of each check. As per the docs, results are cached using pickle
  • add decorator function to cache checks only if they are successful
  • provide default ttl allowing different checks to have different ttl for their results

This closes #1455

@elfkuzco elfkuzco requested review from benoit74 and rgaudin October 24, 2025 13:15
@elfkuzco
Copy link
Collaborator Author

PR is not yet complete but more of a PoC to demonstrate the implementation. Will need some inputs on what TTL to use for each check

@elfkuzco elfkuzco force-pushed the healthcheck-diskcache branch from 8984fbe to 84eb987 Compare October 24, 2025 13:28
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some minor questions inline.

My main concern is that it's not clear from the code what we're caching, what we're retrieving and what we're sending back.
There are tests so I guess it's working but it's unclear to me how we're caching the Result object. Is is serialized? How? How are we unserializing it?

@elfkuzco
Copy link
Collaborator Author

but it's unclear to me how we're caching the Result object. Is is serialized? How? How are we unserializing it?

This is handled by the library. The default implementation uses pickle to store data types that are not natively supported like bytes, string, floats, etc. This is how we store the pydantic objects and retrieve them directly since each check is a Result object

@elfkuzco
Copy link
Collaborator Author

Why a FanoutCache and not another one? We should comment hbere to record rationale.

I opted to use it because it's the same as what the DjangoCache uses. Figured since it is used in a web framework, it should suffice here. It has support for sharding than the default Cache. The docs say:

"While readers and writers do not block each other, writers block other writers. Therefore a shard for every concurrent writer is suggested. This will depend on your scenario."

So, I figured since we call the results with asyncio.gather, we don't want another writer to block another. I suppose I should add the rationale in the code too.

@elfkuzco elfkuzco requested a review from rgaudin October 24, 2025 17:21
@rgaudin
Copy link
Member

rgaudin commented Oct 24, 2025

but it's unclear to me how we're caching the Result object. Is is serialized? How? How are we unserializing it?

This is handled by the library. The default implementation uses pickle to store data types that are not natively supported like bytes, string, floats, etc. This is how we store the pydantic objects and retrieve them directly since each check is a Result object

Good ; please add a small comment to make this clear

@rgaudin
Copy link
Member

rgaudin commented Oct 24, 2025

Why a FanoutCache and not another one? We should comment hbere to record rationale.

I opted to use it because it's the same as what the DjangoCache uses. Figured since it is used in a web framework, it should suffice here. It has support for sharding than the default Cache. The docs say:

"While readers and writers do not block each other, writers block other writers. Therefore a shard for every concurrent writer is suggested. This will depend on your scenario."

So, I figured since we call the results with asyncio.gather, we don't want another writer to block another. I suppose I should add the rationale in the code too.

Good, a comment with this would be perfect

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I'm not convinced at all by this PR.

I don't see why we want to store data on disk using an SQLite DB. Check results do not need to be persisted from my understanding so far, and adding such persistence layer is only unneeded accidental complexity from my perspective so far (which can only lead to useless bugs if we confirm it does not add value). Can you develop why we would need this?

Despite explanations, I still don't get why we use a FanoutCache. From what I read, this will shard the DB to have less chance to have blocking writes. This also seems like unneeded complexity. We have only few checks (10 maybe at some point, we will probably never reach 100). We plan to update them only once per minute or so. So it is not like we have significant chances of having so many writes per second we might get into a performance issue.

I finally do not get how this PR achieves to fulfill requirements I've expressed. With this PR, as soon as the cache has expired if we have multiple parallel calls to the check I feel like the code will still run multiple checks in parallel (cache is expired so code will be ran). I would expect that checks are periodically updated but we continue to serve old status until new status has been retrieved.

I also don't get why we cache only on success. In case of failure, this is where everyone will "rush" (to a moderate extend, we do not have that much success, but still) to check status, and we do not want to have parallel / too much repeated calls to the check which would only overwhelm the upstream system which is already failing.

I finally miss the capability to force refreshing the check status.

After some thought, I'm also not really convinced we are using the right approach at a higher level. Nothing very precise yet, but I'm still puzzled by the special HTTP return code for each check, by the fact that we still do not have history on checks status over time except bare information in Slack (but this is really far away from the system supposed to have this history), ... Deserve some thought, sorry for not having some precise input here ATM.

Do I miss the point of this PR or are my points valid to be analyzed? Do not rush into changing code yet, I feel like we need to better align first.

@elfkuzco
Copy link
Collaborator Author

I don't see why we want to store data on disk using an SQLite DB. Check results do not need to be persisted from my understanding so far, and adding such persistence layer is only unneeded accidental complexity from my perspective so far (which can only lead to useless bugs if we confirm it does not add value). Can you develop why we would need this?

It seems to be an implementation detail of the library and I'm not really in control of it.

@elfkuzco
Copy link
Collaborator Author

I finally do not get how this PR achieves to fulfill requirements I've expressed. With this PR, as soon as the cache has expired if we have multiple parallel calls to the check I feel like the code will still run multiple checks in parallel (cache is expired so code will be ran). I would expect that checks are periodically updated but we continue to serve old status until new status has been retrieved.

Do you mean this should be run periodically by a cron job and not from the HTTP request? Because it's probably only a caching mechanism (even if not this one) that prevents the checks from happening each time a request is made. The checks can be configured with different intervals but at some future time, all the intervals will meet and they will be run in parallel. Running "periodically" would still be susceptible to parallelism.

@elfkuzco
Copy link
Collaborator Author

I also don't get why we cache only on success. In case of failure, this is where everyone will "rush" (to a moderate extend, we do not have that much success, but still) to check status, and we do not want to have parallel / too much repeated calls to the check which would only overwhelm the upstream system which is already failing.

This is quite parametrizable and I can flip the switch to false if that shouldn't be the normal.

@elfkuzco
Copy link
Collaborator Author

I finally miss the capability to force refreshing the check status.

I will add a query parameter to the checks that will force refresh

@benoit74
Copy link
Collaborator

@elfkuzco please wait for discussion on openzim/overview#72 to settle, we've discussed a bit with @rgaudin and confirm we miss the proper big picture to correctly implement things in this PR.

@kelson42
Copy link
Contributor

@benoit74 Just want to say that I mostly share the opinion here. Should be possible to have a better software architecture than this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache healthcheck service checks

5 participants