update #481

davedavedavid · 2025-08-12T08:16:39Z

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

* 'main' of https://github.com/kubeedge/sedna: examples: fix the image name prefix IL: decouple eval task and deploy task Fix IL bug: job misses first data when reads data. Fix IL bug: TrainTriggerStatus used in eval task buildx: add build speed for LC buildx: disable build speed for LC buildx: speed the language having builtin build

* 'main' of https://github.com/davedavedavid/sedna: Provide num class arguments Move model intialization out of estimator Change input model from estimator on the server side

* 'main' of https://github.com/davedavedavid/sedna: update update # Conflicts: # examples/build_agg.sh # examples/federated-learning-mistnet-yolo-aggregator.Dockerfile # examples/mistnet-yolo-aggregator.Dockerfile # examples/mistnet-yolo-server.Dockerfile

* 'main' of https://github.com/kubeedge/sedna: fix grammar error in api module

kubeedge-bot · 2025-08-12T08:16:49Z

@davedavedavid: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kubeedge-bot · 2025-08-12T08:16:50Z

Welcome @davedavedavid! It looks like this is your first PR to kubeedge/sedna 🎉

kubeedge-bot · 2025-08-12T08:16:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign moorezheng after the PR has been reviewed.
You can assign the PR to them by writing /assign @moorezheng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Summary of Changes

Hello @davedavedavid, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly updates the Federated Learning framework within the sedna project, primarily by integrating support for MindSpore and Huawei Ascend NPUs, specifically for YOLOv5 models. It involves a comprehensive refactoring of Custom Resource Definitions (CRDs) for datasets, models, and federated learning jobs, alongside updates to the core training and aggregation logic to leverage the new hardware and software stack. Additionally, new tooling for building Docker images and a Prometheus exporter for monitoring federated learning job metrics have been introduced.

Highlights

MindSpore & Ascend NPU Integration: This pull request introduces deep integration with MindSpore and Huawei Ascend NPUs for federated learning, enabling specialized hardware acceleration for YOLOv5 models. This significantly enhances the performance and capabilities of AI workloads at the edge.
Comprehensive CRD Refactoring: The Custom Resource Definitions (CRDs) for Dataset, Model, and FederatedLearningJob have been thoroughly refactored and updated. This involves removing older, deprecated versions and introducing new, streamlined configurations that better support the evolving federated learning architecture.
YOLOv5 MistNet Example Enhancement: The existing YOLOv5 MistNet federated learning example has been substantially enhanced to leverage the newly integrated MindSpore/Ascend NPU backend. This includes critical changes to model paths, dataset parameters, and the client selection logic, optimizing the example for the new hardware and software stack.
Prometheus Exporter for FL Jobs: A new Go-based Prometheus exporter has been added to facilitate the monitoring of Federated Learning job statuses and metrics. This provides improved observability into the lifecycle and performance of distributed AI training tasks.
Improved Docker Build System: New Dockerfiles (baseagg.Dockerfile, baseclient.Dockerfile) and associated build scripts (build_agg.sh, build_train.sh) have been introduced. These additions streamline the process of creating specialized Docker images for both aggregation and training workers, ensuring consistent and efficient deployment.
Bug Fix in MistNet Aggregation Algorithm: A subtle but important bug in the MistNet aggregation algorithm has been corrected. Previously, the epsilon parameter was incorrectly assigned, which could lead to suboptimal or incorrect aggregation results. This fix ensures the algorithm behaves as intended.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant number of changes, including adding new examples, Dockerfiles, and a Prometheus exporter, while removing old ones. The core logic for federated learning is also updated. My review focuses on improving code quality, maintainability, and fixing several critical issues. Key areas for improvement include removing hardcoded values in scripts and Dockerfiles, fixing a critical typo and a deepcopy issue in the new exporter, and improving code structure for better maintainability.

gemini-code-assist · 2025-08-12T08:19:05Z

examples/baseagg.Dockerfile

+ENV https_proxy http://1.1.1.153:3128
+#ENV no_proxy localhost,127.0.0.1,.huawei.com
+
+RUN sed -i 's/mirrors.aliyun.com/mirrors.tools.huawei.com/g' /etc/apt/sources.list && chmod 777 /tmp


Using chmod 777 on /tmp is a significant security risk as it gives all users read, write, and execute permissions. Please use more restrictive permissions or explain why such open permissions are necessary.

gemini-code-assist · 2025-08-12T08:19:05Z

test/ct_exporter/clientset/v1alpha1/projects.go

+	return c.restClient.
+		Get().
+		Namespace(c.ns).
+		Resource("federatedllearningjobs").


There is a typo in the resource name: federatedllearningjobs should be federatedlearningjobs. This will cause the watch request to fail.

Suggested change

Resource("federatedllearningjobs").

Resource("federatedlearningjobs").

gemini-code-assist · 2025-08-12T08:19:05Z

test/ct_exporter/api/types/v1alpha1/zz_generated.deepcopy.go

+	*out = *in
+	out.TypeMeta = in.TypeMeta
+	in.ObjectMeta.DeepCopyInto(&out.ObjectMeta)
+	//in.Spec.DeepCopyInto(&out.Spec)


The deepcopy for FederatedLearningJob.Spec is commented out. This is a critical issue that can lead to race conditions and unpredictable behavior in a concurrent environment, as the spec will not be properly copied. This line must be uncommented and the deepcopy generation must be fixed if it's not working correctly.

Suggested change

//in.Spec.DeepCopyInto(&out.Spec)

in.Spec.DeepCopyInto(&out.Spec)

gemini-code-assist · 2025-08-12T08:19:05Z

examples/pip.conf

+index-url = http://mirrors.tools.huawei.com/pypi/simple/
+trusted-host = mirrors.tools.huawei.com


This file hardcodes a private pip index URL (mirrors.tools.huawei.com). This will prevent users outside of that network from building the associated Docker images. Please use the public PyPI index or make the index configurable.

index-url = https://pypi.org/simple/ trusted-host = pypi.org

gemini-code-assist · 2025-08-12T08:19:06Z

test/ct_exporter/main.go

+
+func (e *Exporter) Collect(ch chan<- prometheus.Metric) {
+	ctx := context.Background()
+	federatedRes, err := e.client.FederatedClient("default").Get(ctx, "ct-yolo-v5", metav1.GetOptions{})


The job name (ct-yolo-v5) and namespace (default) are hardcoded. This makes the exporter specific to a single job and not reusable. These values should be configurable, for example, via command-line flags.

gemini-code-assist · 2025-08-12T08:19:06Z

pkg/globalmanager/controllers/federatedlearning/federatedlearningjob.go

 	}

-	aggWorkerParam.WorkerType = jobStageAgg
+	aggWorkerParam.WorkerType = "server"


Using magic strings like "server" and "client" for WorkerType can lead to typos and makes the code harder to maintain. It's better to define these as constants and reuse them. For example, you could add jobStageServer = "server" and jobStageClient = "client" constants and use them here and in other related places.

Suggested change

aggWorkerParam.WorkerType = "server"

aggWorkerParam.WorkerType = jobStageServer

gemini-code-assist · 2025-08-12T08:19:07Z

examples/federated_learning/yolov5_coco128_mistnet/interface.py

+class Estimator_edge:
+    def __init__(self) -> None:
+        # initialize inference object with deviceID, om path, image height and width
+        self.model = None
+        self.trainer = None
+        self.algorithm = None
+        self.pretrained = None
+        self.saved = None
+        self.hyperparameters = {
+            "type": "yolov5",
+            "rounds": 1,
+            "target_accuracy": 0.99,
+            "epochs": int(Context.get_parameters("EPOCHS", 500)),
+            "batch_size": int(Context.get_parameters("BATCH_SIZE", 16)),
+            "optimizer": "SGD",
+            "linear_lr": False,
+            # The machine learning model
+            "model_name": "yolov5",
+            "model_config": "./yolov5x.yaml",
+            "train_params": "./hyp.scratch.yaml",
+            "learning_rate": 0.002
+        }
+class Estimator_server:
    def __init__(self) -> None:
+        # initialize inference object with deviceID, om path, image height and width
        self.model = None
+        self.trainer = None
+        self.algorithm = None
        self.pretrained = None
        self.saved = None
        self.hyperparameters = {
+            "use_mindspore": True,
            "type": "yolov5",
            "rounds": 1,
            "target_accuracy": 0.99,
            "epochs": int(Context.get_parameters("EPOCHS", 500)),
            "batch_size": int(Context.get_parameters("BATCH_SIZE", 16)),
+            "repeat_epoch": 200,        #repeat_epoch=200(epoch)/aug_num       ####new adding params
+            "per_batch_size": 8,                                        ####new adding params
+            "group_size": 1,     ###Be consistent with aug_num          ####new adding params
            "optimizer": "SGD",
            "linear_lr": False,
            # The machine learning model
            "model_name": "yolov5",
-            "model_config": "./yolov5s.yaml",
-            "train_params": "./hyp.scratch.yaml"
+            "model_config": "./yolov5x.yaml",
+            "train_params": "./hyp.scratch.yaml",
+            "learning_rate": 0.002
        }


The classes Estimator_edge and Estimator_server share a lot of common attributes and structure in their __init__ methods. To improve maintainability and reduce code duplication, consider creating a base Estimator class that these two classes can inherit from.

gemini-code-assist · 2025-08-12T08:19:07Z

build/crd-samples/sedna/federatedlearningjob.yaml

+              path: /usr/local/Ascend/driver
+        - name: add-ons
+          hostPath:
+               path: /usr/local/Ascend/add-ons


The indentation in this file is inconsistent. For example, path here is indented with 15 spaces, while other similar keys are indented differently. Consistent indentation is important for readability and to prevent potential parsing issues in YAML files. Please format the file consistently.

gemini-code-assist · 2025-08-12T08:19:07Z

build/crd-samples/sedna/start_ct.sh

+kubectl create -f - <<EOF
+apiVersion: sedna.io/v1alpha1
+kind: Model
+metadata:
+  name: "yolo-v5-model"
+spec:
+  url: "/model/yolov5.pth"
+  format: "pth"
+EOF
+
+kubectl create -f - <<EOF
+apiVersion: sedna.io/v1alpha1
+kind: Model
+metadata:
+  name: "yolo-v5-pretrained-model"
+spec:
+  url: "/pretrained/yolov5.pth"
+  format: "pth"
+EOF


This script inconsistently uses kubectl create here and on line 40, while using kubectl $action in other places. For consistency and to allow the script to handle other actions like delete, it would be better to use kubectl $action for all resource manipulations.

Suggested change

kubectl create -f - <<EOF

apiVersion: sedna.io/v1alpha1

kind: Model

metadata:

name: "yolo-v5-model"

spec:

url: "/model/yolov5.pth"

format: "pth"

EOF

kubectl create -f - <<EOF

apiVersion: sedna.io/v1alpha1

kind: Model

metadata:

name: "yolo-v5-pretrained-model"

spec:

url: "/pretrained/yolov5.pth"

format: "pth"

EOF

kubectl $action -f - <<EOF

apiVersion: sedna.io/v1alpha1

kind: Model

metadata:

name: "yolo-v5-model"

spec:

url: "/model/yolov5.pth"

format: "pth"

EOF

kubectl $action -f - <<EOF

apiVersion: sedna.io/v1alpha1

kind: Model

metadata:

name: "yolo-v5-pretrained-model"

spec:

url: "/pretrained/yolov5.pth"

format: "pth"

EOF

gemini-code-assist · 2025-08-12T08:19:07Z

build/crd-samples/sedna/federatedlearningjob.yaml

+               #- name: slog-conf
+          #hostPath:
+              # path: /var/log/npu/conf/slog/slog.conf
+               #- name: slog
+               #hostPath:
+               #path: /var/log/npu/slog
+               #- name: profiling
+          #hostPath:
+              #path: /var/log/npu/profiling
+               #- name: dump
+               #hostPath:
+               #path: /var/log/npu/dump
+               #- name: user-slog
+               #hostPath:
+               #path: /var/log/npu/


There are several large blocks of commented-out code in this example file. To improve clarity and avoid confusion, it's best to remove this commented-out code.

davedavedavid and others added 30 commits November 12, 2021 10:27

同步张顺康适配nnrt

1d6b8c8

修改边侧dockfile

0912734

bugfix

0b9368b

bugfix

3ffe3c5

bugfix

2807a7b

bugfix

05e303b

update

c6694f1

调整dockfile

4b944d7

update dockfile

8098d12

update

2eae814

Change input model from estimator on the server side

41a407c

Move model intialization out of estimator

80f4075

Provide num class arguments

f351e41

提交启动脚本

f332ca0

Merge branch 'main' of https://github.com/davedavedavid/sedna into main

b171674

* 'main' of https://github.com/davedavedavid/sedna: Provide num class arguments Move model intialization out of estimator Change input model from estimator on the server side

update

8b6eb2b

update

a3e6e40

update

b51ebf3

update

d24f072

Merge branch 'main' of https://github.com/kubeedge/sedna into main

c199444

* 'main' of https://github.com/kubeedge/sedna: fix grammar error in api module

update

8b667b6

update

b6b3348

update

73d6de8

update

600b7ca

update

de04036

update

46a7532

update

9cd7ff9

Move initialization of trainer and algorithm out estimator

68fad45

c59900052 and others added 19 commits April 25, 2022 16:56

test

8b4d60a

test

523cf0c

TEST

eb43b3b

TEST

f1c2ba5

test

7ba3da9

test

e5270ff

test

fff0335

test

a6d87e1

test

abd1c6a

test

5015ba6

test

59a0694

test

76b2157

TEST

5ca3a73

test

31621ca

TEST

501f91c

test

317fe43

test

a34fe76

TEST

1cb0dad

刷新exporter

885a4bc

kubeedge-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 12, 2025

kubeedge-bot requested review from JimmyYang20, TymonXie, jaypume and tangming1996 August 12, 2025 08:16

kubeedge-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 12, 2025

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

	Resource("federatedllearningjobs").
	Resource("federatedlearningjobs").

	//in.Spec.DeepCopyInto(&out.Spec)
	in.Spec.DeepCopyInto(&out.Spec)

		index-url = http://mirrors.tools.huawei.com/pypi/simple/
		trusted-host = mirrors.tools.huawei.com

	aggWorkerParam.WorkerType = "server"
	aggWorkerParam.WorkerType = jobStageServer

update #481

Are you sure you want to change the base?

update #481

Uh oh!

Conversation

davedavedavid commented Aug 12, 2025

Uh oh!

kubeedge-bot commented Aug 12, 2025

Uh oh!

kubeedge-bot commented Aug 12, 2025

Uh oh!

kubeedge-bot commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants