Skip to content

Enhance and standardize solution for healtcheck monitoring #72

@benoit74

Description

@benoit74

Current situation

We currently use UptimeRobot to check our services (imager service, zimfarm, ...) status. UptimeRobot send notifications to a dedicated slack channel whenever status changes. UptimeRobot record historical service status.

In imager-service, we have implemented an endpoint which when called checks many imager-service components or dependencies status and return a single status to UptimeRobot while displaying details as an HTML page.

In zimfarm, we have two endpoints which helps monitor the queue size directly in UptimeRobot (bare JSON content) and we are close to deploy the same kind of endpoint that imager-service has implemented.

Questions

We have identified following shortcomings in current implementations and would like to define a common solution to fix these.

Details in notifications

Some dependencies are know to fail every now and then and to recover without our intervention. For instance, Wasabi has regular downtimes. Currently in imager-service implementation, Ops receives an UptimeRobot notification which notify that service is down, without details about which component / dependency is down. This is a problem because urgency of checking the issue is different for different components. For our Wasabi example, we do not need to worry or act immediately, but still need to be informed everytime this happens.

Persistent status

Status needs to be checked only periodically. With current imager-service implementation, every time something / someone calls the special status endpoint, all checks are ran. It causes following issues:

  • it is very slow to get details about current status since all checks need to complete ... and when something is down, the check usually waits for a timeout ... which is big enough to avoid false alarms
  • it can causes multiple checks of the same components to run in parallel ... which is never a good things especially when systems are trying to recover
  • in case of intermittent issues, two users might get two different status for the same service ... kinda confusing

But we still need to be able to force check refresh on-demand (only if we are an "admin" probably)

Historical status

With current solutions, we miss the historical status of individual components. This information can however be valuable to understand a complex incident by being sure about what went down first and what followed as a consequence. We can usually rebuild this information by looking into the logs, but historical details about each service component status can help.

Different needs

We have basically one system for two needs.

On one side, management and our users and clients need overview of each Kiwix services, without details.

On the other side, Ops needs details about every components. This details should probably not be exposed to the public, at least most other service providers don't. Probably both because they are too confusing and because they too sensitive information.

Solution

No clearly-defined solution has been defined yet, and we should work on it so that we have a common target to aim for.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions