use diskcache to memoize check results #1467

elfkuzco · 2025-10-24T13:15:18Z

Rationale

The healthcheck service performs all the checks each time which might be overwhelming for the zimfarm and its components. By caching the results of each call with a TTL, we can avoid making checks in quick succession.

Changes

use diskcache FanoutCache to store results of each check. As per the docs, results are cached using pickle
add decorator function to cache checks only if they are successful
provide default ttl allowing different checks to have different ttl for their results

This closes #1455

elfkuzco · 2025-10-24T13:16:11Z

PR is not yet complete but more of a PoC to demonstrate the implementation. Will need some inputs on what TTL to use for each check

rgaudin

Have some minor questions inline.

My main concern is that it's not clear from the code what we're caching, what we're retrieving and what we're sending back.
There are tests so I guess it's working but it's unclear to me how we're caching the Result object. Is is serialized? How? How are we unserializing it?

healthcheck/src/healthcheck/cache.py

elfkuzco · 2025-10-24T16:53:39Z

but it's unclear to me how we're caching the Result object. Is is serialized? How? How are we unserializing it?

This is handled by the library. The default implementation uses pickle to store data types that are not natively supported like bytes, string, floats, etc. This is how we store the pydantic objects and retrieve them directly since each check is a Result object

elfkuzco · 2025-10-24T16:59:37Z

Why a FanoutCache and not another one? We should comment hbere to record rationale.

I opted to use it because it's the same as what the DjangoCache uses. Figured since it is used in a web framework, it should suffice here. It has support for sharding than the default Cache. The docs say:

"While readers and writers do not block each other, writers block other writers. Therefore a shard for every concurrent writer is suggested. This will depend on your scenario."

So, I figured since we call the results with asyncio.gather, we don't want another writer to block another. I suppose I should add the rationale in the code too.

rgaudin · 2025-10-24T17:43:52Z

but it's unclear to me how we're caching the Result object. Is is serialized? How? How are we unserializing it?

This is handled by the library. The default implementation uses pickle to store data types that are not natively supported like bytes, string, floats, etc. This is how we store the pydantic objects and retrieve them directly since each check is a Result object

Good ; please add a small comment to make this clear

rgaudin · 2025-10-24T17:44:26Z

Why a FanoutCache and not another one? We should comment hbere to record rationale.

I opted to use it because it's the same as what the DjangoCache uses. Figured since it is used in a web framework, it should suffice here. It has support for sharding than the default Cache. The docs say:

"While readers and writers do not block each other, writers block other writers. Therefore a shard for every concurrent writer is suggested. This will depend on your scenario."

So, I figured since we call the results with asyncio.gather, we don't want another writer to block another. I suppose I should add the rationale in the code too.

Good, a comment with this would be perfect

benoit74

I'm sorry but I'm not convinced at all by this PR.

I don't see why we want to store data on disk using an SQLite DB. Check results do not need to be persisted from my understanding so far, and adding such persistence layer is only unneeded accidental complexity from my perspective so far (which can only lead to useless bugs if we confirm it does not add value). Can you develop why we would need this?

Despite explanations, I still don't get why we use a FanoutCache. From what I read, this will shard the DB to have less chance to have blocking writes. This also seems like unneeded complexity. We have only few checks (10 maybe at some point, we will probably never reach 100). We plan to update them only once per minute or so. So it is not like we have significant chances of having so many writes per second we might get into a performance issue.

I finally do not get how this PR achieves to fulfill requirements I've expressed. With this PR, as soon as the cache has expired if we have multiple parallel calls to the check I feel like the code will still run multiple checks in parallel (cache is expired so code will be ran). I would expect that checks are periodically updated but we continue to serve old status until new status has been retrieved.

I also don't get why we cache only on success. In case of failure, this is where everyone will "rush" (to a moderate extend, we do not have that much success, but still) to check status, and we do not want to have parallel / too much repeated calls to the check which would only overwhelm the upstream system which is already failing.

I finally miss the capability to force refreshing the check status.

After some thought, I'm also not really convinced we are using the right approach at a higher level. Nothing very precise yet, but I'm still puzzled by the special HTTP return code for each check, by the fact that we still do not have history on checks status over time except bare information in Slack (but this is really far away from the system supposed to have this history), ... Deserve some thought, sorry for not having some precise input here ATM.

Do I miss the point of this PR or are my points valid to be analyzed? Do not rush into changing code yet, I feel like we need to better align first.

elfkuzco · 2025-10-28T09:32:52Z

I don't see why we want to store data on disk using an SQLite DB. Check results do not need to be persisted from my understanding so far, and adding such persistence layer is only unneeded accidental complexity from my perspective so far (which can only lead to useless bugs if we confirm it does not add value). Can you develop why we would need this?

It seems to be an implementation detail of the library and I'm not really in control of it.

elfkuzco · 2025-10-28T09:39:50Z

I finally do not get how this PR achieves to fulfill requirements I've expressed. With this PR, as soon as the cache has expired if we have multiple parallel calls to the check I feel like the code will still run multiple checks in parallel (cache is expired so code will be ran). I would expect that checks are periodically updated but we continue to serve old status until new status has been retrieved.

Do you mean this should be run periodically by a cron job and not from the HTTP request? Because it's probably only a caching mechanism (even if not this one) that prevents the checks from happening each time a request is made. The checks can be configured with different intervals but at some future time, all the intervals will meet and they will be run in parallel. Running "periodically" would still be susceptible to parallelism.

elfkuzco · 2025-10-28T09:41:25Z

I also don't get why we cache only on success. In case of failure, this is where everyone will "rush" (to a moderate extend, we do not have that much success, but still) to check status, and we do not want to have parallel / too much repeated calls to the check which would only overwhelm the upstream system which is already failing.

This is quite parametrizable and I can flip the switch to false if that shouldn't be the normal.

elfkuzco · 2025-10-28T09:42:08Z

I finally miss the capability to force refreshing the check status.

I will add a query parameter to the checks that will force refresh

benoit74 · 2025-10-28T10:43:38Z

@elfkuzco please wait for discussion on openzim/overview#72 to settle, we've discussed a bit with @rgaudin and confirm we miss the proper big picture to correctly implement things in this PR.

kelson42 · 2025-10-28T14:12:33Z

@benoit74 Just want to say that I mostly share the opinion here. Should be possible to have a better software architecture than this.

use diskcache to memoize check results

e4e4f86

elfkuzco requested review from benoit74 and rgaudin October 24, 2025 13:15

elfkuzco added 2 commits October 24, 2025 14:21

add stubs library ad fix ci

f43b6f6

move stubs library to proper section

84eb987

elfkuzco force-pushed the healthcheck-diskcache branch from 8984fbe to 84eb987 Compare October 24, 2025 13:28

rgaudin requested changes Oct 24, 2025

View reviewed changes

healthcheck/src/healthcheck/cache.py Show resolved Hide resolved

healthcheck/src/healthcheck/cache.py Outdated Show resolved Hide resolved

healthcheck/src/healthcheck/cache.py Outdated Show resolved Hide resolved

healthcheck/src/healthcheck/cache.py Outdated Show resolved Hide resolved

add option to cache failed results

4ad0868

elfkuzco requested a review from rgaudin October 24, 2025 17:21

add comment describing serialization protocol

1dbd002

rgaudin approved these changes Oct 28, 2025

View reviewed changes

benoit74 requested changes Oct 28, 2025

View reviewed changes

Uh oh!

use diskcache to memoize check results #1467

Are you sure you want to change the base?

use diskcache to memoize check results #1467

Uh oh!

Conversation

elfkuzco commented Oct 24, 2025

Rationale

Changes

Uh oh!

elfkuzco commented Oct 24, 2025

Uh oh!

rgaudin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elfkuzco commented Oct 24, 2025

Uh oh!

elfkuzco commented Oct 24, 2025

Uh oh!

rgaudin commented Oct 24, 2025

Uh oh!

rgaudin commented Oct 24, 2025

Uh oh!

benoit74 left a comment

Choose a reason for hiding this comment

Uh oh!

elfkuzco commented Oct 28, 2025

Uh oh!

elfkuzco commented Oct 28, 2025

Uh oh!

elfkuzco commented Oct 28, 2025

Uh oh!

elfkuzco commented Oct 28, 2025

Uh oh!

benoit74 commented Oct 28, 2025

Uh oh!

kelson42 commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants