Skip to content

Conversation

@davedavedavid
Copy link

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

davedavedavid and others added 30 commits November 12, 2021 10:27
* 'main' of https://github.com/kubeedge/sedna:
  examples: fix the image name prefix
  IL: decouple eval task and deploy task
  Fix IL bug: job misses first data when reads data.
  Fix IL bug: TrainTriggerStatus used in eval task
  buildx: add build speed for LC
  buildx: disable build speed for LC
  buildx: speed the language having builtin build
* 'main' of https://github.com/davedavedavid/sedna:
  Provide num class arguments
  Move model intialization out of estimator
  Change input model from estimator on the server side
* 'main' of https://github.com/davedavedavid/sedna:
  update
  update

# Conflicts:
#	examples/build_agg.sh
#	examples/federated-learning-mistnet-yolo-aggregator.Dockerfile
#	examples/mistnet-yolo-aggregator.Dockerfile
#	examples/mistnet-yolo-server.Dockerfile
@kubeedge-bot kubeedge-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 12, 2025
@kubeedge-bot
Copy link
Collaborator

@davedavedavid: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubeedge-bot
Copy link
Collaborator

Welcome @davedavedavid! It looks like this is your first PR to kubeedge/sedna 🎉

@kubeedge-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign moorezheng after the PR has been reviewed.
You can assign the PR to them by writing /assign @moorezheng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubeedge-bot kubeedge-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 12, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @davedavedavid, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly updates the Federated Learning framework within the sedna project, primarily by integrating support for MindSpore and Huawei Ascend NPUs, specifically for YOLOv5 models. It involves a comprehensive refactoring of Custom Resource Definitions (CRDs) for datasets, models, and federated learning jobs, alongside updates to the core training and aggregation logic to leverage the new hardware and software stack. Additionally, new tooling for building Docker images and a Prometheus exporter for monitoring federated learning job metrics have been introduced.

Highlights

  • MindSpore & Ascend NPU Integration: This pull request introduces deep integration with MindSpore and Huawei Ascend NPUs for federated learning, enabling specialized hardware acceleration for YOLOv5 models. This significantly enhances the performance and capabilities of AI workloads at the edge.
  • Comprehensive CRD Refactoring: The Custom Resource Definitions (CRDs) for Dataset, Model, and FederatedLearningJob have been thoroughly refactored and updated. This involves removing older, deprecated versions and introducing new, streamlined configurations that better support the evolving federated learning architecture.
  • YOLOv5 MistNet Example Enhancement: The existing YOLOv5 MistNet federated learning example has been substantially enhanced to leverage the newly integrated MindSpore/Ascend NPU backend. This includes critical changes to model paths, dataset parameters, and the client selection logic, optimizing the example for the new hardware and software stack.
  • Prometheus Exporter for FL Jobs: A new Go-based Prometheus exporter has been added to facilitate the monitoring of Federated Learning job statuses and metrics. This provides improved observability into the lifecycle and performance of distributed AI training tasks.
  • Improved Docker Build System: New Dockerfiles (baseagg.Dockerfile, baseclient.Dockerfile) and associated build scripts (build_agg.sh, build_train.sh) have been introduced. These additions streamline the process of creating specialized Docker images for both aggregation and training workers, ensuring consistent and efficient deployment.
  • Bug Fix in MistNet Aggregation Algorithm: A subtle but important bug in the MistNet aggregation algorithm has been corrected. Previously, the epsilon parameter was incorrectly assigned, which could lead to suboptimal or incorrect aggregation results. This fix ensures the algorithm behaves as intended.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant number of changes, including adding new examples, Dockerfiles, and a Prometheus exporter, while removing old ones. The core logic for federated learning is also updated. My review focuses on improving code quality, maintainability, and fixing several critical issues. Key areas for improvement include removing hardcoded values in scripts and Dockerfiles, fixing a critical typo and a deepcopy issue in the new exporter, and improving code structure for better maintainability.

ENV https_proxy http://1.1.1.153:3128
#ENV no_proxy localhost,127.0.0.1,.huawei.com

RUN sed -i 's/mirrors.aliyun.com/mirrors.tools.huawei.com/g' /etc/apt/sources.list && chmod 777 /tmp

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using chmod 777 on /tmp is a significant security risk as it gives all users read, write, and execute permissions. Please use more restrictive permissions or explain why such open permissions are necessary.

return c.restClient.
Get().
Namespace(c.ns).
Resource("federatedllearningjobs").

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a typo in the resource name: federatedllearningjobs should be federatedlearningjobs. This will cause the watch request to fail.

Suggested change
Resource("federatedllearningjobs").
Resource("federatedlearningjobs").

*out = *in
out.TypeMeta = in.TypeMeta
in.ObjectMeta.DeepCopyInto(&out.ObjectMeta)
//in.Spec.DeepCopyInto(&out.Spec)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The deepcopy for FederatedLearningJob.Spec is commented out. This is a critical issue that can lead to race conditions and unpredictable behavior in a concurrent environment, as the spec will not be properly copied. This line must be uncommented and the deepcopy generation must be fixed if it's not working correctly.

Suggested change
//in.Spec.DeepCopyInto(&out.Spec)
in.Spec.DeepCopyInto(&out.Spec)

Comment on lines +2 to +3
index-url = http://mirrors.tools.huawei.com/pypi/simple/
trusted-host = mirrors.tools.huawei.com

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This file hardcodes a private pip index URL (mirrors.tools.huawei.com). This will prevent users outside of that network from building the associated Docker images. Please use the public PyPI index or make the index configurable.

index-url = https://pypi.org/simple/
trusted-host = pypi.org


func (e *Exporter) Collect(ch chan<- prometheus.Metric) {
ctx := context.Background()
federatedRes, err := e.client.FederatedClient("default").Get(ctx, "ct-yolo-v5", metav1.GetOptions{})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The job name (ct-yolo-v5) and namespace (default) are hardcoded. This makes the exporter specific to a single job and not reusable. These values should be configurable, for example, via command-line flags.

}

aggWorkerParam.WorkerType = jobStageAgg
aggWorkerParam.WorkerType = "server"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using magic strings like "server" and "client" for WorkerType can lead to typos and makes the code harder to maintain. It's better to define these as constants and reuse them. For example, you could add jobStageServer = "server" and jobStageClient = "client" constants and use them here and in other related places.

Suggested change
aggWorkerParam.WorkerType = "server"
aggWorkerParam.WorkerType = jobStageServer

Comment on lines +141 to 188
class Estimator_edge:
def __init__(self) -> None:
# initialize inference object with deviceID, om path, image height and width
self.model = None
self.trainer = None
self.algorithm = None
self.pretrained = None
self.saved = None
self.hyperparameters = {
"type": "yolov5",
"rounds": 1,
"target_accuracy": 0.99,
"epochs": int(Context.get_parameters("EPOCHS", 500)),
"batch_size": int(Context.get_parameters("BATCH_SIZE", 16)),
"optimizer": "SGD",
"linear_lr": False,
# The machine learning model
"model_name": "yolov5",
"model_config": "./yolov5x.yaml",
"train_params": "./hyp.scratch.yaml",
"learning_rate": 0.002
}
class Estimator_server:
def __init__(self) -> None:
# initialize inference object with deviceID, om path, image height and width
self.model = None
self.trainer = None
self.algorithm = None
self.pretrained = None
self.saved = None
self.hyperparameters = {
"use_mindspore": True,
"type": "yolov5",
"rounds": 1,
"target_accuracy": 0.99,
"epochs": int(Context.get_parameters("EPOCHS", 500)),
"batch_size": int(Context.get_parameters("BATCH_SIZE", 16)),
"repeat_epoch": 200, #repeat_epoch=200(epoch)/aug_num ####new adding params
"per_batch_size": 8, ####new adding params
"group_size": 1, ###Be consistent with aug_num ####new adding params
"optimizer": "SGD",
"linear_lr": False,
# The machine learning model
"model_name": "yolov5",
"model_config": "./yolov5s.yaml",
"train_params": "./hyp.scratch.yaml"
"model_config": "./yolov5x.yaml",
"train_params": "./hyp.scratch.yaml",
"learning_rate": 0.002
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The classes Estimator_edge and Estimator_server share a lot of common attributes and structure in their __init__ methods. To improve maintainability and reduce code duplication, consider creating a base Estimator class that these two classes can inherit from.

path: /usr/local/Ascend/driver
- name: add-ons
hostPath:
path: /usr/local/Ascend/add-ons

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The indentation in this file is inconsistent. For example, path here is indented with 15 spaces, while other similar keys are indented differently. Consistent indentation is important for readability and to prevent potential parsing issues in YAML files. Please format the file consistently.

Comment on lines +30 to +48
kubectl create -f - <<EOF
apiVersion: sedna.io/v1alpha1
kind: Model
metadata:
name: "yolo-v5-model"
spec:
url: "/model/yolov5.pth"
format: "pth"
EOF

kubectl create -f - <<EOF
apiVersion: sedna.io/v1alpha1
kind: Model
metadata:
name: "yolo-v5-pretrained-model"
spec:
url: "/pretrained/yolov5.pth"
format: "pth"
EOF

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This script inconsistently uses kubectl create here and on line 40, while using kubectl $action in other places. For consistency and to allow the script to handle other actions like delete, it would be better to use kubectl $action for all resource manipulations.

Suggested change
kubectl create -f - <<EOF
apiVersion: sedna.io/v1alpha1
kind: Model
metadata:
name: "yolo-v5-model"
spec:
url: "/model/yolov5.pth"
format: "pth"
EOF
kubectl create -f - <<EOF
apiVersion: sedna.io/v1alpha1
kind: Model
metadata:
name: "yolo-v5-pretrained-model"
spec:
url: "/pretrained/yolov5.pth"
format: "pth"
EOF
kubectl $action -f - <<EOF
apiVersion: sedna.io/v1alpha1
kind: Model
metadata:
name: "yolo-v5-model"
spec:
url: "/model/yolov5.pth"
format: "pth"
EOF
kubectl $action -f - <<EOF
apiVersion: sedna.io/v1alpha1
kind: Model
metadata:
name: "yolo-v5-pretrained-model"
spec:
url: "/pretrained/yolov5.pth"
format: "pth"
EOF

Comment on lines +62 to +76
#- name: slog-conf
#hostPath:
# path: /var/log/npu/conf/slog/slog.conf
#- name: slog
#hostPath:
#path: /var/log/npu/slog
#- name: profiling
#hostPath:
#path: /var/log/npu/profiling
#- name: dump
#hostPath:
#path: /var/log/npu/dump
#- name: user-slog
#hostPath:
#path: /var/log/npu/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are several large blocks of commented-out code in this example file. To improve clarity and avoid confusion, it's best to remove this commented-out code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants