Skip to content

Conversation

@JamesMcDermott
Copy link
Contributor

@JamesMcDermott JamesMcDermott commented Dec 10, 2024

Implements #615; documentation update in docs PR #204

  • incoming ReadinessGates and Conditions are now saved in PackageRevision creation flows

    • this includes init, clone, and edit operations
      • API-based init, clone, and edit operations (Create) will merge in any Conditions and ReadinessGates passed in in the request
        • any passed in in the API request are added to the defaults/existing ones
        • Conditions/ReadinessGates passed in in the API request will OVERRIDE default/existing ones
        • currently NO default Conditions/ReadinessGates are set - artifact of a reconsidered decision in response to a review comment
          • mechanism remains in place in case we later have need of setting some by default
  • a successful pipeline run (defined as 1 or more runs in the same operation of the generictaskhandler.go::applyResourceMutations() function) sets the PackagePipelinePassed condition:

    • to False before starting
    • to True on successful completion
      • the patch operations for this appear in Git commit history, but NOT in the PackageRevision's task list to avoid polluting it
    • effectively preventing the PackageRevision from being proposed using porchctl rpkg propose while the pipeline is running
      • also, this validation is moved from porchctl into the API server to lock out API-based propose operations as well
        • also, this validation now checks which Conditions are unmet and provides a list to aid in resolving readiness situations
    • UNLESS operation is an update that does not change package resources
      • if this is the case, the pipeline is not run and, to save on expensive Git pushes, the PackagePipelinePassed condition is NOT set to False
  • PackageVariant controller now sets its own PVOperationsComplete condition and readiness gate on PackageRevisions it manages

    • and manages this Condition independently to allow it to lock out propose operations while the controller is performing multiple operations
  • PackageVariant controller can now handle optimistic concurrency conflicts when making its final (per-reconcile) status update to a PackageVariant

  • refactors internal conversion of Kptfiles to YAML since readiness condition information is stored in the package Kptfile

    • unifies all cases to the same kyaml/yaml-based method (KptFile.ToYamlString() and ToYamlString(*fn.KubeObject))
    • this ensures consistency in the Kptfile YAML (indentation, field order etc.)
    • and reduces the chances of Git conflicts when setting and updating readiness conditions
  • adds more info to error message in case of Git conflict when applying a patch

  • renderPackageMutation mutation now has its own (arbitrary string) task type "render", to better distinguish it in task list and Git commit history

  • CLI E2E tests now give more information on failure

    • pointing out config.yaml file containing command that produced unexpected result and 0-based index of the failed command in that file
  • CLI E2E tests now fail on first test failure to allow easier identification of the failed test's output

  • CLI E2E tests now respect a CLEANUP_ON_FAIL environment variable - if run with CLEANUP_ON_FAIL='false' and a CLI test fails, the suite will not clean that test's namespace off the Kind cluster, allowing for easier inspection of the state at time of failure

  • some refactoring/renaming for readbility

  • fixed typo in Makefile comment

@efiacor
Copy link
Collaborator

efiacor commented Dec 10, 2024

/test presubmit-nephio-go-test

@efiacor
Copy link
Collaborator

efiacor commented Dec 10, 2024

/test presubmit-nephio-go-test

5 similar comments
@JamesMcDermott
Copy link
Contributor Author

/test presubmit-nephio-go-test

@JamesMcDermott
Copy link
Contributor Author

/test presubmit-nephio-go-test

@JamesMcDermott
Copy link
Contributor Author

/test presubmit-nephio-go-test

@liamfallon
Copy link
Member

/test presubmit-nephio-go-test

@JamesMcDermott
Copy link
Contributor Author

/test presubmit-nephio-go-test

Copy link
Collaborator

@kispaljr kispaljr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I welcome this PR, and wholeheartedly agree with the general direction. :)
These are my initial comments, but could you give me also a couple more days to understand the logic a bit better?

return repoPkgRev, renderStatus, nil
}

func pushPipelineReadinessGate(ctx context.Context, repo repository.Repository, repoPr repository.PackageRevision) error {
Copy link
Member

@liamfallon liamfallon Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the engine handling of drafts even more complex but there is no other way to do this with the current code structure. Let's refactor engine.go in Nephio R5.

Copy link
Member

@liamfallon liamfallon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Very comprehensive work @JamesMcDermott

@nephio-prow
Copy link
Contributor

nephio-prow bot commented Feb 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JamesMcDermott, liamfallon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nephio-prow nephio-prow bot added the approved label Feb 4, 2025
JamesMcDermott added a commit to Nordix/nephio-docs that referenced this pull request Feb 10, 2025
- new document describing operation of readiness gates on PackageRevisions
  - in general (no existing document for it)
  - specifically the changes introduced in
    nephio-project/porch#156
    - including default readiness gates managed by Porch server for all
      PackageRevisions and by package variant controller for PackageRevisions
      controlled by a PackageVariant
- new diagram to illustrate flows

nephio-project/nephio#615
nephio-project/porch#156
@kispaljr
Copy link
Collaborator

I read through the latest code changes. All my previous comments are now resolved.
I focused my review on the pkg/task package, and that seems excellent to me.

all-in-all: it looks good to me, it can be merged from my part

Copy link
Collaborator

@kispaljr kispaljr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as we already discussed on Slack, it would make porch behavior more consistent if the readinessGates + conditions would be also merged from the PR objects that are created with an edit task (similarly how you implemented it to clone and init), but that is not mandatory. This PR can be merged without that.

@efiacor
Copy link
Collaborator

efiacor commented Feb 25, 2025

Need to run it through a few e2e runs. Hit an issue on the free5gc suite. Will run again and collect data if it's a recurring issue.

- alleviates issue where config injection from PackageVariant
  resulted in both PackageVariant controller detecting a Kptfile
  change (and updating the PackageRevision with a new patch task)
  and the API server NOT detecting any Kptfile change, resulting
  in the server attempting to apply a nil patch mutation and
  massive panics

nephio-project/nephio#615
- adjustments to usage of Kubernetes IO annotation
  "internal.config.kubernetes.io/path" to keep it present
  consistently and avoid it showing up as a diff in update
  operations
- CLI E2E tests now respect a CLEANUP_ON_FAIL variable -
  if run with CLEANUP_ON_FAIL='false' and a CLI test fails,
  the suite will not clean that test's namespace off the Kind
  cluster, allowing for easier inspection of the state at
  at time of failure

nephio-project/nephio#615
- in free5gc suite, one deployment package requires Kubernetes
  IO annotation "internal.config.kubernetes.io/index" before
  Nephio approval controller will take it in hand
  - adjusted usage to avoid removing that annotation as well
  - looks like we get away with clearing the legacy
    annotations, at least
- when updating package revision resources, skip closing
  pipeline readiness gate if no actual resource change
  (new resources are equal to old resources)
- add text to update where we commit/push to close pipeline
  readiness gate before render - better visibility in Git
  history

nephio-project/nephio#615
- pushing code only
- tests likely to fail; adjustments pending confirmation that
  this still works with free5gc suite
- PackagePipelinePassed readiness condition is now set to
  "True" in genericTaskHandler, at the last possible moment
  before returning to cadEngine
  - unified approach for consistency
  - allows us to be sure it covers all mutations in any given
    create/update operation as some operations/packages may
    include (for example) multiple render mutations
  - also means lock/unlock commits show up as distinct commits
    in Git history
    - but specifically excluded them from task lists
- edit and clone operations now include default readiness
  gates and conditions
- gave render mutation its own (arbitrary string) task type
  "render", to better distinguish it in task list and Git
  commit history

nephio-project/nephio#615
- final render mutation should now run only on actual changes
  to resources
  - and not run on empty updates, preserving state of task list

nephio-project/nephio#615
- further testing revealed annotations were a red herring
  - real issue lay in readiness-info patches not triggering
    Kpt package render operations

nephio-project/nephio#615
- refactored package revision variable names in edit mutation
  - had old and new the wrong way round
- removed PackagePipelinePassed from default conditions
  - left machinery in place for default conditions/readiness gates
    for future use
- now setting PackagePipelinePassed only at point of pipeline render
  - determined by:
    - presence of pipeline element in Kptfile, AND
    - presence of rendering task (type == "render" or image == "render")
      in package revision's task list

nephio-project/nephio#615
- was dependent on debugging to find and fix an issue in a test artefact
  - see nephio-project/catalog#110
- resolved performance degradation in packagevariant_controller
  - added too many calls to re-read package revisions or resources
    - should have it pared down to the minimum to work OK
- detection of pipeline now works as intended to set the "waiting
  for pipeline run" readiness condition only if the package has a
  pipeline that will do anything in the render

nephio-project/nephio#615
@JamesMcDermott JamesMcDermott force-pushed the pv-pipeline-readiness-gates branch from 62c015f to b52b571 Compare May 30, 2025 09:16
@JamesMcDermott
Copy link
Contributor Author

JamesMcDermott commented May 30, 2025

Other work has become higher-priority - putting this issue on hold. I've pushed the additional changes that have it working (at least for the free5gc tests).

Remaining items to do:

  • Debug/investigate:
    • in some cases when switching the package variant's readiness condition (PVOperationsComplete), porch-server doesn't give the Git patch the right custom task type/commit message
      • minor issue, but makes flow less comprehensible from downstream's Git history
      • need to debug - is there some place where a patch task/mutation is created where we don't check if it's setting the readiness gate?
        • or some issue in the Boolean logic in the UpdateOnlySetsReadinessConditions check function?
    • issue where Porch forgets package revisions
      • appears in free5gc tests, interfering with package variants creating revisions, not sure about other tests
      • seems to occur intermittently for different catalog repos on GitHub, not sure about others
      • deleting and recreating affected repo seems to bring back the revisions
        • but may result in package variants picking up erroneous changes and creating new package revisions
      • something in packagerev metadata handling? am I correct in thinking everything existential goes through packagerevs?
      • don't think it's my changes as I didn't do anything involving caches or metadata
    • issue when updating package variant in conjunction with long-running kpt pipeline
      • appears that:
        • when packagevariant_controller updates resources to patch in the new stuff
          • server renders the pipeline, hanging controller while it waits
            • renders multiple times because of the
        • meanwhile, more reconcile attempts... begin? are scheduled? Kubernetes stuff...
          • and all pile up behind the waiting one
            • with out-of-date resources? How?
          • and when the waiting one finishes they all go at once
            • which is fine as long as they don't have any differing data
            • and if one does, it pushes it into the package revision resources
              • resulting in more renders
              • and near-livelock?
                • busywork for porch-server, anyway
      • Will hopefuilly be alleviated by async work - less chance of an update getting stuck
        • and we can make packagevariant_controller check the revision's readiness conditions
        • if it has a pipeline condition set to False, kick the can down the road to the next reconcile
  • Fix:
    • golangci-lint issues - should be simple relative to the others
    • failing unit tests (make test) - should be mostly re-aligning test data, but a couple of panics
    • failing E2E tests (make test-e2e) - currently 1 failure; may be just a flake
    • failing E2E CLI tests (make test-e2e-cli) - align expected test responses for propose behaviour where it lists unmet readiness conditions
    • Sonar issues, if present - currently no data as the Sonar scan relies on running the (currently-broken) unit tests
    • Note: aim for minimal additional code changes! Test changes only wherever possible!
  • Add:
    • additional E2E tests to check readiness gates on packages with and without pipelines
  • Run:
    • all automated runs on the PR
      • and repeat locally to be sure!
    • a simple package variant against a local environment
      • and confirm Git history in repo with resulting downstream package revision contains all readiness-gate/condition switches
        • PVOperationsComplete coming from packagevariant_controller (included in initial clone and switched to true after clone and patch)
        • PackagePipelinePassed before and after running pipeline (render-type task/mutation/commit)
    • test-infra free5gc and oai suites
      • one last time to be absolutely sure not broken or degraded again
  • Final review from community
    • ensure most recent comments from @efiacor (here) and @kispaljr (offline) are addressed (and responded to)

@JamesMcDermott
Copy link
Contributor Author

/hold

@nephio-prow
Copy link
Contributor

nephio-prow bot commented Jun 4, 2025

@JamesMcDermott: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
presubmit-nephio-go-test dff35bd link true /test presubmit-nephio-go-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants