Skip to content

Conversation

@elfkuzco
Copy link
Collaborator

@elfkuzco elfkuzco commented Nov 6, 2025

Rationale

Monitor (used to monitor Zimfarm tasks consumption) is still using Netdata 1.38 (Feb 2023) while we are now at 2.8.0. Also, monitor is currently not in the dev docker graph making it hard to view the statistics for local tasks. This PR aims to address these points.

Changes

  • upgrade netdata version to 2.8.0
  • add monitor service to dev docker graph with profile worker
  • uncomment out worker-related components and rely on docker profiles to selectively include them
  • included netdata subfolder in dev directory. This is what would be hosted on https://github.com/kiwix/container-images/blob/main/netdata/netdata.conf
  • use urllib in regen-stream because netdata base image doesn't have pip installed and we can't install requests
  • netdata v2.8.0 doesn't come with wget/killall installed. So, I installed them via the package manager as those are either used in the Dockerfile or shell scripts
  • disable all forms of metrics collection for the parent
  • allow only container metrics collections with filter to send charts only for zimscraper and zimtask containers.
  • change the format for cronjobs to debian style
  • use uuid5 function to build worker stream key from SHA256 fingerprint
  • use the scraper name directly in development mode since we cannot retrieve it's IP address. This works because the containers all share the same network and the go.d plugin which communicates to the redis instance of the scraper can connect to it via it's name
Screenshot_20251127_141848

This closes #1102

@elfkuzco elfkuzco self-assigned this Nov 6, 2025
@elfkuzco elfkuzco requested a review from benoit74 November 6, 2025 06:30
@elfkuzco
Copy link
Collaborator Author

elfkuzco commented Nov 6, 2025

It seems the previous way of building the chart name doesn't work anymore. When I click on it, it says "Chart Not Found". So, what I did was to remove it from the URL. The image in the PR description is what will be shown on page view. Haven't figured out to construct the query param to point to a specific chart and the docs/communities haven't been helpful on it.

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get where I should have a look in the dashboard to see only stats from the "current node" (scraper). Everywhere I go I only see stats from my host machine, not only the scraper container.

There is also a bad warning (probably because I did not get how to tweak things to have the task manager start my local monitor image instead of the officially published one):

image

And finally, it looks like I have not been precise enough. It is important to also update the image at https://github.com/kiwix/container-images/tree/main/netdata since this is the one we use to run https://monitoring.openzim.org ; note that it would maybe be easier to host this image Dockerfile inside the zimfarm repo since it is so tight to the Zimfarm ; at least I don't mind if you temporarily host it there for the review to complete, and then we move it again over there.

@elfkuzco
Copy link
Collaborator Author

I have updated the PR description. Still, it appears there's significant changes in the netdata operates now. I have tried disabling their cloud offering but it seems it doesn't respect my decision. Hence, I have to configure the UI to point me to /v3 which is a UI that doesn't give the cloud pop-up netdata/netdata#15640

@elfkuzco
Copy link
Collaborator Author

Also had to rebase the branch on top of main because of #1521

@benoit74
Copy link
Collaborator

I still don't get what I'm seeing in the graphs. I expect to see resource usage (mostly CPU and memory) of the scraper container itself. Here it looks like I see resource usage of my whole machine, not only the scraper. Do I miss something?

@elfkuzco
Copy link
Collaborator Author

I still don't get what I'm seeing in the graphs. I expect to see resource usage (mostly CPU and memory) of the scraper container itself. Here it looks like I see resource usage of my whole machine, not only the scraper. Do I miss something?

Could you show a picture?

@benoit74
Copy link
Collaborator

Sure. For instance, I'm quite sure the scraper I've started did not consumed 16G used and 42G cache memory.

image

@elfkuzco
Copy link
Collaborator Author

Applied some updates. See updated README. Notable changes since last PRs are:

  • upgraded to 2.8.0
  • only show container level metrics. Sending only zimscraper* and zimtask* metrics.
    [stream]
    enabled = yes
    destination = ${MONITORING_DEST}
    api key = ${MONITORING_KEY}
    send charts matching = cgroup_zimscraper* cgroup_zimtask*

The system-level collection collects entire metrics for my host machine making it hard to discern worker-related metrics. Should we use this as an opportunity to make all containers started within the manager start with a known prefix and collect only their stats. Collecting stats of the zimfarm backend and postgresdb (because I am in development) is quite unnecessary and reducing the amount of data collected could speed up streaming speed. What do you think?

@elfkuzco
Copy link
Collaborator Author

You will of course need to build the monitor image with the same name set as MONITOR_IMAGE in the compose file

@benoit74
Copy link
Collaborator

I'm now a bit confused. How did it worked in the past? Did we collected metrics from all containers on the worker host, no matter which task they belong to? And also host metrics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upgrade monitor

3 participants