-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Improve feature mode switch process #12188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6ef5dd8
d35d166
86265d2
7509d11
57bc905
e059786
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -10,3 +10,4 @@ tests/__pycache__/ | |
| ctrmgr/__pycache__/ | ||
| venv | ||
| tests/.coverage* | ||
| .pytest_cache/ | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -60,7 +60,7 @@ | |
| CFG_SER_IP: "", | ||
| CFG_SER_PORT: "6443", | ||
| CFG_SER_DISABLE: "false", | ||
| CFG_SER_INSECURE: "false" | ||
| CFG_SER_INSECURE: "true" | ||
| } | ||
|
|
||
| dflt_st_ser = { | ||
|
|
@@ -88,18 +88,20 @@ | |
| JOIN_LATENCY = "join_latency_on_boot_seconds" | ||
| JOIN_RETRY = "retry_join_interval_seconds" | ||
| LABEL_RETRY = "retry_labels_update_seconds" | ||
| TAG_IMAGE_LATEST = "tag_latest_image_on_wait_seconds" | ||
| USE_K8S_PROXY = "use_k8s_as_http_proxy" | ||
|
|
||
| remote_ctr_config = { | ||
| JOIN_LATENCY: 10, | ||
| JOIN_RETRY: 10, | ||
| LABEL_RETRY: 2, | ||
| TAG_IMAGE_LATEST: 30, | ||
| USE_K8S_PROXY: "" | ||
| } | ||
|
|
||
| def log_debug(m): | ||
| msg = "{}: {}".format(inspect.stack()[1][3], m) | ||
| print(msg) | ||
|
renukamanavalan marked this conversation as resolved.
Outdated
|
||
| #print(msg) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| syslog.syslog(syslog.LOG_DEBUG, msg) | ||
|
|
||
|
|
||
|
|
@@ -148,6 +150,8 @@ def init(): | |
| with open(SONIC_CTR_CONFIG, "r") as s: | ||
| d = json.load(s) | ||
| remote_ctr_config.update(d) | ||
| if UNIT_TESTING: | ||
| remote_ctr_config[TAG_IMAGE_LATEST] = 0 | ||
|
|
||
|
|
||
| class MainServer: | ||
|
|
@@ -172,11 +176,11 @@ def register_db(self, db_name): | |
| self.db_connectors[db_name] = swsscommon.DBConnector(db_name, 0) | ||
|
|
||
|
|
||
| def register_timer(self, ts, handler): | ||
| def register_timer(self, ts, handler, args=()): | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| """ Register timer based handler. | ||
| The handler will be called on/after give timestamp, ts | ||
| """ | ||
| self.timer_handlers[ts].append(handler) | ||
| self.timer_handlers[ts].append((handler, args)) | ||
|
|
||
|
|
||
| def register_handler(self, db_name, table_name, handler): | ||
|
|
@@ -235,7 +239,7 @@ def run(self): | |
| lst = self.timer_handlers[k] | ||
| del self.timer_handlers[k] | ||
| for fn in lst: | ||
| fn() | ||
| fn[0](*fn[1]) | ||
| else: | ||
| timeout = (k - ct_ts).seconds | ||
| break | ||
|
|
@@ -426,6 +430,54 @@ def do_join(self, ip, port, insecure): | |
| format(remote_ctr_config[JOIN_RETRY], self.start_time)) | ||
|
|
||
|
|
||
| def tag_latest_image(server, feat, docker_id, image_ver): | ||
| res = 1 | ||
| if not UNIT_TESTING: | ||
| status = os.system("docker ps |grep {} >/dev/null".format(docker_id)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make sure the running time of the container >= expected min time. Said that, systemd would have failed for frequent restarts. Think through for any possible hole.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, I set "tag_latest_image_on_wait_seconds" = 600s, it means when a k8s pod running for over 10 minutes(by monitor the pod's container id), I will tag the container's image. About this default value, do you have any ideas?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. May be not. Suggest
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The situation you say should be:
The current reactions are:
Have tested this case and some other cases, all good. And by the way, about removing 'tag as local' from reboot script, you want me to do it in the PR or in the future? I want to do in the future, because it seems redundant, but no real impacts. And it can ensure tag as local when reboot, maybe necessary for us.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer we do it in this PR, as this PR's major purpose is to tag local. One more Q: After you tag as "local" when you do service restart, who wins, systemd or kube. I prefer systemd takes over. This would add the new version in label that should auto-block kube from deploying.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tag as local is in reboot script in sonic-utilities repo, removing from reboot script, PR is here. #Remove tag as local Q: After you tag as "local" when do service restart, who wins? By the way. I want to remove the instance_higher checker, because we want to support fallback in the future, if post-check failed, we could fall back to the previous version, I think the version should be managed by scheduler. What's your concern about this before, is there any necessary things so that we always need the higher version? #remove in this commit
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Two points:
No need for kube to manage after local tagging. It is more stable if systemd takes over.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Point 1: About the two PRs, the reason is that the reboot script is in another repo called sonic-utilities, seems we can't raise a PR including two repos code. Point 2: About after tagging, it's easy to let feature go back to local after restart, but one of our visions is that k8s can collect the containers real-time metrics, if features are taken over by systemd after tagging, we can't keep collecting the metrics.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Point 1 -- Agree. Let it be so for now. |
||
| if status: | ||
| syslog.syslog(syslog.LOG_ERR, | ||
| "Feature {}:{} is not stable".format(feat, image_ver)) | ||
| else: | ||
| image_item = os.popen("docker inspect {} |jq -r .[].Image".format(docker_id)).read().strip() | ||
| if image_item: | ||
| image_id = image_item.split(":")[1][:12] | ||
| image_info = os.popen("docker images |grep {}".format(image_id)).read().split() | ||
| if image_info: | ||
| image_rep = image_info[0] | ||
| res = os.system("docker tag {} {}:latest".format(image_id, image_rep)) | ||
| if res != 0: | ||
| syslog.syslog(syslog.LOG_ERR, | ||
| "Failed to tag {}:{} to latest".format(image_rep, image_ver)) | ||
| else: | ||
| syslog.syslog(syslog.LOG_INFO, | ||
| "Successfully tag {}:{} to latest".format(image_rep, image_ver)) | ||
| feat_status = os.popen("docker inspect {} |jq -r .[].State.Running".format(feat)).read().strip() | ||
| if feat_status: | ||
| if feat_status == 'true': | ||
| os.system("docker stop {}".format(feat)) | ||
| syslog.syslog(syslog.LOG_ERR, | ||
| "{} should not run, stop it".format(feat)) | ||
| os.system("docker rm {}".format(feat)) | ||
| syslog.syslog(syslog.LOG_INFO, | ||
| "Delete previous {} container".format(feat)) | ||
| else: | ||
| syslog.syslog(syslog.LOG_ERR, | ||
| "Failed to docker images |grep {} to get image repo".format(image_id)) | ||
| else: | ||
| syslog.syslog(syslog.LOG_ERR, | ||
| "Failed to inspect container:{} to get image id".format(docker_id)) | ||
| else: | ||
| server.mod_db_entry(STATE_DB_NAME, | ||
| FEATURE_TABLE, feat, {"tag_latest": "true"}) | ||
| res = 0 | ||
| if res: | ||
| log_debug("failed to tag {}:{} to latest".format(feat, image_ver)) | ||
| else: | ||
| log_debug("successfully tag {}:{} to latest".format(feat, image_ver)) | ||
|
|
||
| return res | ||
|
|
||
|
|
||
| # | ||
| # Feature changes | ||
| # | ||
|
|
@@ -523,6 +575,19 @@ def on_state_update(self, key, op, data): | |
| self.st_data[key] = _update_entry(dflt_st_feat, data) | ||
| remote_state = self.st_data[key][ST_FEAT_REMOTE_STATE] | ||
|
|
||
| if (old_remote_state != remote_state) and (remote_state == "running"): | ||
| # Tag latest | ||
| start_time = datetime.datetime.now() + datetime.timedelta( | ||
| seconds=remote_ctr_config[TAG_IMAGE_LATEST]) | ||
| self.server.register_timer(start_time, tag_latest_image, ( | ||
| self.server, | ||
| key, | ||
| self.st_data[key][ST_FEAT_CTR_ID], | ||
| self.st_data[key][ST_FEAT_CTR_VER])) | ||
|
|
||
| log_debug("try to tag latest label after {} seconds @{}".format( | ||
| remote_ctr_config[TAG_IMAGE_LATEST], start_time)) | ||
|
|
||
| if (not init) and ( | ||
| (old_remote_state == remote_state) or (remote_state != "pending")): | ||
| # no change or nothing to do. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -84,7 +84,7 @@ def _run_command(cmd, timeout=5): | |
|
|
||
| def kube_read_labels(): | ||
| """ Read current labels on node and return as dict. """ | ||
| KUBECTL_GET_CMD = "kubectl --kubeconfig {} get nodes {} --show-labels |tr -s ' ' | cut -f6 -d' '" | ||
| KUBECTL_GET_CMD = "kubectl --kubeconfig {} get nodes {} --show-labels --no-headers |tr -s ' ' | cut -f6 -d' '" | ||
|
|
||
| labels = {} | ||
| ret, out, _ = _run_command(KUBECTL_GET_CMD.format( | ||
|
|
@@ -332,12 +332,12 @@ def _do_reset(pending_join = False): | |
|
|
||
|
|
||
| def _do_join(server, port, insecure): | ||
| KUBEADM_JOIN_CMD = "kubeadm join --discovery-file {} --node-name {} --apiserver-advertise-address {}" | ||
| KUBEADM_JOIN_CMD = "kubeadm join --discovery-file {} --node-name {}" | ||
| err = "" | ||
| out = "" | ||
| ret = 0 | ||
| try: | ||
| local_ipv6 = _get_local_ipv6() | ||
| #local_ipv6 = _get_local_ipv6() | ||
| #_download_file(server, port, insecure) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Delete dead code. #Closed |
||
| _gen_cli_kubeconf(server, port, insecure) | ||
| _do_reset(True) | ||
|
|
@@ -349,7 +349,7 @@ def _do_join(server, port, insecure): | |
|
|
||
| if ret == 0: | ||
| (ret, out, err) = _run_command(KUBEADM_JOIN_CMD.format( | ||
| KUBE_ADMIN_CONF, get_device_name(), local_ipv6), timeout=60) | ||
| KUBE_ADMIN_CONF, get_device_name()), timeout=60) | ||
| log_debug("ret = {}".format(ret)) | ||
|
|
||
| except IOError as e: | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not just comment code, remove them if they are dead. #Closed