Skip to content

Add workflow to detect stuck submodule update PRs#590

Open
hdwhdw wants to merge 3 commits intosonic-net:masterfrom
hdwhdw:check-submodule-update
Open

Add workflow to detect stuck submodule update PRs#590
hdwhdw wants to merge 3 commits intosonic-net:masterfrom
hdwhdw:check-submodule-update

Conversation

@hdwhdw
Copy link
Contributor

@hdwhdw hdwhdw commented Feb 25, 2026

Why I did it

The mssonicbld bot creates submodule update PRs in sonic-buildimage automatically, but these PRs can get stuck for days or weeks without anyone noticing (e.g. CI failures, merge conflicts). This delays rollout of merged fixes to the build.

For example, sonic-net/sonic-buildimage#25285 has been open since Jan 31 — over a month — blocking multiple merged commits from reaching the build.

How I did it

Added a GitHub Actions workflow (.github/workflows/check-submodule-update.yml) that:

  • Runs on weekdays at 8AM UTC (and supports manual trigger)
  • Searches for open mssonicbld submodule update PRs in sonic-buildimage
  • If any PR has been open longer than a configurable threshold (default 96 hours), files an issue in sonic-gnmi with the submodule-stuck label
  • Avoids duplicate issues by checking for existing open issues referencing the same PR
  • Only runs on the sonic-net organization (skipped on forks)

Assignees and threshold are read from .github/submodule-watchers.json so the policy can be updated without modifying the workflow.

How to verify it

  1. Trigger the workflow manually via workflow_dispatch
  2. Verify it detects the currently stuck PRs (#25285, #25432) and files issues

Description for the changelog

Add scheduled workflow to detect and alert on stuck sonic-gnmi submodule update PRs in sonic-buildimage.

Copilot AI review requested due to automatic review settings February 25, 2026 17:25
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds automation to detect and file issues for “stuck” sonic-buildimage submodule-update PRs, and also changes Redis client initialization behavior in sonic_data_client to skip databases whose address/socket cannot be resolved (instead of failing fast).

Changes:

  • Add a scheduled GitHub Actions workflow that searches for long-open submodule update PRs in sonic-buildimage and files an issue in sonic-gnmi.
  • Update Redis client initialization (TCP test path and default unix-socket path) to log warnings and continue when DB address/socket lookup fails.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
sonic_data_client/db_client.go Changes init/TCP-client setup to “skip on error” when DB socket/address lookup fails.
.github/workflows/check-submodule-update.yml Introduces scheduled workflow to detect and report stuck submodule-update PRs.

Comment on lines +563 to +564
log.Warningf("Skipping %s in namespace %s: %v", dbName, dbNamespace, err)
continue
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing GetDbTcpAddr failure handling from return err to continue can leave Target2RedisDb[namespace][dbName] unset. Much of db_client.go later assumes the redis client is non-nil (e.g., TableData2Msi reads it without an ok check) and will panic if a skipped DB is later used (including via V2R mappings that bypass the presence check in populateDbtablePath). Consider either keeping fail-fast behavior for required DBs, or consistently guarding all Target2RedisDb[...] lookups (returning a clear error) so skipping cannot lead to a nil dereference at runtime.

Suggested change
log.Warningf("Skipping %s in namespace %s: %v", dbName, dbNamespace, err)
continue
log.Warningf("Failed to get TCP address for %s in namespace %s: %v", dbName, dbNamespace, err)
return err

Copilot uses AI. Check for mistakes.
Comment on lines +592 to +595
addr, err := sdcfg.GetDbSock(dbName, dbNamespace)
if err != nil {
log.Errorf("init error: %v", err)
return
log.Warningf("Skipping %s in namespace %s: %v", dbName, dbNamespace, err)
continue
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new continue behavior on GetDbSock errors, init() may partially populate Target2RedisDb and leave some DB clients missing. Since several call sites in this file index into Target2RedisDb without checking presence, this can turn a config/init problem into a later nil-pointer panic. Recommend either failing initialization for mandatory DBs or adding presence checks (and error returns) at all Target2RedisDb access points to make skipped DBs safe.

Copilot uses AI. Check for mistakes.
issues: write

jobs:
check:
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow hard-codes actions that only make sense in the upstream repo (searching sonic-net/sonic-buildimage and creating issues in sonic-net/sonic-gnmi). Other workflows in this repo guard against running on forks (e.g., .github/workflows/semgrep.yml uses if: github.repository_owner == 'sonic-net'). Add a similar if: guard to this job to prevent failures/noisy runs when triggered from forks or non-sonic-net repos.

Suggested change
check:
check:
if: github.repository_owner == 'sonic-net'

Copilot uses AI. Check for mistakes.
Add a scheduled GitHub Actions workflow that checks sonic-buildimage
for sonic-gnmi submodule update PRs that have been open longer than
96 hours. When found, it files an issue in sonic-gnmi with details
about the stuck PR to ensure visibility and prompt investigation.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@hdwhdw hdwhdw force-pushed the check-submodule-update branch from e1d151d to e6f1c7c Compare February 25, 2026 17:32
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hdwhdw hdwhdw closed this Mar 3, 2026
@hdwhdw hdwhdw reopened this Mar 4, 2026
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Load assignees and threshold_hours from .github/submodule-watchers.json
so the policy can be updated without modifying the workflow itself.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@hdwhdw hdwhdw changed the title Check submodule update Add workflow to detect stuck submodule update PRs Mar 5, 2026
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants