-
-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Current situation
We currently use UptimeRobot to check our services (imager service, zimfarm, ...) status. UptimeRobot send notifications to a dedicated slack channel whenever status changes. UptimeRobot record historical service status.
In imager-service, we have implemented an endpoint which when called checks many imager-service components or dependencies status and return a single status to UptimeRobot while displaying details as an HTML page.
In zimfarm, we have two endpoints which helps monitor the queue size directly in UptimeRobot (bare JSON content) and we are close to deploy the same kind of endpoint that imager-service has implemented.
Questions
We have identified following shortcomings in current implementations and would like to define a common solution to fix these.
Details in notifications
Some dependencies are know to fail every now and then and to recover without our intervention. For instance, Wasabi has regular downtimes. Currently in imager-service implementation, Ops receives an UptimeRobot notification which notify that service is down, without details about which component / dependency is down. This is a problem because urgency of checking the issue is different for different components. For our Wasabi example, we do not need to worry or act immediately, but still need to be informed everytime this happens.
Persistent status
Status needs to be checked only periodically. With current imager-service implementation, every time something / someone calls the special status endpoint, all checks are ran. It causes following issues:
- it is very slow to get details about current status since all checks need to complete ... and when something is down, the check usually waits for a timeout ... which is big enough to avoid false alarms
- it can causes multiple checks of the same components to run in parallel ... which is never a good things especially when systems are trying to recover
- in case of intermittent issues, two users might get two different status for the same service ... kinda confusing
But we still need to be able to force check refresh on-demand (only if we are an "admin" probably)
Historical status
With current solutions, we miss the historical status of individual components. This information can however be valuable to understand a complex incident by being sure about what went down first and what followed as a consequence. We can usually rebuild this information by looking into the logs, but historical details about each service component status can help.
Different needs
We have basically one system for two needs.
On one side, management and our users and clients need overview of each Kiwix services, without details.
On the other side, Ops needs details about every components. This details should probably not be exposed to the public, at least most other service providers don't. Probably both because they are too confusing and because they too sensitive information.
Solution
No clearly-defined solution has been defined yet, and we should work on it so that we have a common target to aim for.