GCP-198: add PSC endpoint controller for customer-side management#7603
Conversation
Add a new controller in the control-plane-operator that manages PSC endpoints in the customer's GCP project. This controller: - Reserves static internal IP addresses for PSC endpoints - Creates ForwardingRules targeting the management-side Service Attachment - Handles cleanup of GCP resources on deletion with proper finalizers - Waits for WIF token availability before adding finalizers - Implements proper error handling with GCP-specific retry logic - Adds 30-second timeout to all GCP API calls to prevent hung reconcilers - Propagates GCPEndpointAvailable condition to HostedCluster status The controller runs in the hosted control plane namespace and uses Workload Identity Federation credentials to authenticate with the customer's GCP project. Signed-off-by: Cristiano Veiga <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
WalkthroughAdds a new GCP Private Service Connect (PSC) endpoint controller with tests, wires it into the control-plane manager, and updates HostedCluster reconciliation to aggregate PSC-related status conditions; a duplicate test declaration was introduced in the patch. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
Skipping CI for Draft Pull Request. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@cristianoveiga: This pull request references GCP-198 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In
`@control-plane-operator/controllers/gcpprivateserviceconnect/psc_endpoint_controller.go`:
- Around line 354-362: The code uses
hcp.Spec.Platform.GCP.NetworkConfig.Network.Name directly when building the
ForwardingRule; add the same nil checks used for the subnet to validate
Platform, GCP, NetworkConfig and Network before accessing Network.Name (similar
to the existing ensureIPAddress/subnet checks), and if any are nil return an
appropriate error (or requeue) rather than dereferencing; locate the usage in
psc_endpoint_controller.go where endpoint is constructed and call
r.constructNetworkURL only after confirming
hcp.Spec.Platform.GCP.NetworkConfig.Network != nil (and parent fields present).
- Around line 252-256: The code reads
hcp.Spec.Platform.GCP.NetworkConfig.PrivateServiceConnectSubnet.Name into
pscSubnet without checking for nil; update the logic in the block that sets
pscSubnet (and related code in setFromHCP if applicable) to first verify that
hcp.Spec.Platform.GCP.NetworkConfig != nil and
hcp.Spec.Platform.GCP.NetworkConfig.PrivateServiceConnectSubnet != nil, and
return a clear error (or handle defaulting) if either is nil before accessing
.Name; reference the symbols hcp, setFromHCP, NetworkConfig,
PrivateServiceConnectSubnet, and pscSubnet to locate and fix the checks.
- Around line 138-166: When handling deletion of gcpPSC (the block checking
gcpPSC.DeletionTimestamp and pscEndpointFinalizer), ensure the GCP client
builder is initialized before calling r.gcpClientBuilder.getClient(ctx): call
r.initializeWithHCP(ctx, log) (or the method that seeds the builder from the
HostedControlPlane) at the start of the deletion branch, handle its error
(return or requeue if initialization fails), then call
r.gcpClientBuilder.getClient(ctx) and proceed to r.reconcileDelete(...) as
before; this prevents "client not initialized" errors and avoids orphaning GCP
resources if the controller restarts during deletion.
🧹 Nitpick comments (2)
control-plane-operator/controllers/gcpprivateserviceconnect/psc_endpoint_controller.go (2)
104-119: Controller setup works but has unconventional client assignment.Setting
r.Client = mgr.GetClient()at line 116 afterBuild()works but is atypical. Usually the client is injected when the reconciler is constructed inmain.go(as done with other controllers in this codebase). Consider aligning with the pattern used byGCPPrivateServiceObserverwhere the client is passed during struct initialization.
554-581: Unused parameters in InitCustomerGCPClient.The
customerProjectandcontrolPlaneOperatorGSAparameters are never used in the function body. The function relies solely onGOOGLE_APPLICATION_CREDENTIALS. Either remove these parameters or add a TODO explaining their intended future use.♻️ Option 1: Remove unused parameters
-func InitCustomerGCPClient(ctx context.Context, customerProject string, controlPlaneOperatorGSA string) (*compute.Service, error) { +func InitCustomerGCPClient(ctx context.Context) (*compute.Service, error) {Then update the caller at line 94:
- return InitCustomerGCPClient(ctx, b.customerProject, b.controlPlaneOperatorGSA) + return InitCustomerGCPClient(ctx)♻️ Option 2: Add TODO for future use
func InitCustomerGCPClient(ctx context.Context, customerProject string, controlPlaneOperatorGSA string) (*compute.Service, error) { + // TODO: customerProject and controlPlaneOperatorGSA will be used for explicit + // credential configuration when RolesRef support is added. Currently unused + // as we rely on GOOGLE_APPLICATION_CREDENTIALS. credentialsFile := os.Getenv("GOOGLE_APPLICATION_CREDENTIALS")
Handle the case where the controller restarts during deletion. The in-memory client builder state is lost on restart, so we need to reinitialize from HCP before attempting cleanup. If the HCP is already deleted, log a debug message and continue with cleanup attempt (finalizer will still be removed following AWS pattern). Signed-off-by: Cristiano Veiga <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@control-plane-operator/controllers/gcpprivateserviceconnect/psc_endpoint_controller.go`:
- Around line 97-103: The shared gcpClientBuilder stored on
GCPPrivateServiceConnectReconciler causes races across concurrent reconciles;
remove the reconciler-level gcpClientBuilder field and instead construct a fresh
gcpClientBuilder inside the reconcile flow (e.g., at the start of Reconcile or
right after getHostedControlPlane returns) so each reconciliation uses its own
builder tied to that HCP/project; update any uses of the
reconciler.gcpClientBuilder (and its initialization guard) to use the local
builder variable and ensure both normal and deletion paths use the per-request
builder.
|
/unassign @bryan-cox |
Reorder imports to satisfy gci linter: stdlib, third-party, then local project imports. Signed-off-by: Cristiano Veiga <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
|
/retest |
|
/uncc @bryan-cox |
cblecker
left a comment
There was a problem hiding this comment.
Review Summary: GCP PSC Endpoint Controller
This PR adds a well-structured PSC endpoint controller following established patterns from the AWS PrivateLink implementation. The test coverage for helper functions is good, and the condition propagation to HostedCluster is cleanly implemented.
I've left inline comments comparing specific patterns to the AWS implementation in control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go. A few items warrant attention before merging.
Open Questions:
-
Mutex protection (line 68-73): With
MaxConcurrentReconciles: 10, is there a reason to omit mutex protection ongcpClientBuilder? AWS usessync.Mutexfor the equivalent struct. -
HCP watching (line 111-118): AWS watches HCP for endpoint access changes. Is this intentionally omitted for GCP, or should it be added?
-
Periodic requeue (line 236): AWS requeues every 5 minutes to catch out-of-band changes. Should GCP do the same for drift detection?
-
Endpoint adoption: AWS checks for existing VPC endpoints and adopts them (lines 617-646). Should GCP handle adoption of existing ForwardingRules/addresses?
-
TODO tracking (line 87): Is there a Jira for completing the
controlPlaneOperatorGSAextraction?
|
|
||
| func (b *gcpClientBuilder) initializeWithHCP(hcp *hyperv1.HostedControlPlane) { | ||
| if !b.initialized { | ||
| b.setFromHCP(hcp) | ||
| b.initialized = true | ||
| } |
There was a problem hiding this comment.
Race condition risk with concurrent reconciles
The gcpClientBuilder struct is accessed by multiple goroutines (MaxConcurrentReconciles: 10) but lacks mutex protection.
The AWS PrivateLink controller's equivalent clientBuilder uses a sync.Mutex:
// control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go:199-205
type clientBuilder struct {
mu sync.Mutex
initialized bool
// ...
}Consider adding mutex protection to prevent race conditions when multiple reconciles call initializeWithHCP() concurrently.
There was a problem hiding this comment.
The AWS implementation has some historical patterns that may not be optimal today. As noted in a previous review (PR #7207), the AWS controller used separate informers because the manager cache was previously scoped to the cluster - that's no longer the case since the cache is now scoped to the control plane namespace.
Similarly, the mutex in AWS's clientBuilder may be defensive for patterns that don't apply to our simpler GCP implementation:
- The CPO runs per-HCP namespace with a single GCPPrivateServiceConnect CR
- The GCP project and region are immutable fields
- Even concurrent reconciles for the same CR would set identical values
I would like to keep it simple without mutex for now. We can add it if the design evolves to require it.
| // Use service attachment name as base - it's unique and within GCP naming limits | ||
| return fmt.Sprintf("%s-ip", gcpPSC.Status.ServiceAttachmentName) | ||
| } | ||
|
|
There was a problem hiding this comment.
Potential invalid resource names if ServiceAttachmentName is empty
If gcpPSC.Status.ServiceAttachmentName is empty (before management-side populates it), this returns -endpoint and constructIPAddressName returns -ip, which would likely fail GCP API validation.
While isServiceAttachmentReady() should prevent this path, adding defensive validation here would be safer:
func (r *GCPPrivateServiceConnectReconciler) constructEndpointName(gcpPSC *hyperv1.GCPPrivateServiceConnect) string {
if gcpPSC.Status.ServiceAttachmentName == "" {
return "" // or panic, depending on how you want to handle this
}
return fmt.Sprintf("%s-endpoint", gcpPSC.Status.ServiceAttachmentName)
}There was a problem hiding this comment.
Thanks for the suggestion. I considered this, but I think we can keep it simple here:
- The management-side controller populates ServiceAttachmentURI and ServiceAttachmentName atomically in the same status update
- isServiceAttachmentReady() blocks the normal reconciliation flow until the URI and condition are set
- During deletion, if ServiceAttachmentName is somehow empty, the CR was never properly initialized - indicating a larger issue that should be fixed at the source
Adding defensive validation here would mask a root cause bug in the management-side controller. If this ever fails, the GCP API error (-endpoint invalid name) would clearly point to the issue for debugging.
Let me know if you feel strongly about adding it - happy to reconsider.
| }). | ||
| Build(r) | ||
| if err != nil { | ||
| return fmt.Errorf("failed setting up with a controller manager: %w", err) | ||
| } | ||
| r.Client = mgr.GetClient() | ||
|
|
||
| return nil |
There was a problem hiding this comment.
Question: Is HCP watching intentionally omitted?
AWS watches the HostedControlPlane for endpoint access changes:
// control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go:319
Watches(&hyperv1.HostedControlPlane{}, handler.Funcs{UpdateFunc: r.enqueueOnAccessChange(mgr)})This allows re-reconciling when certain HCP fields change. Is this not needed for GCP, or should it be added for parity?
There was a problem hiding this comment.
Intentionally omitted for now. The HCP watch in AWS is used to immediately react when EndpointAccess changes (to create/remove ExternalName services for private DNS routing).
I think it makes more sense add the HCP watch along with the DNS handling logic in the next PR (for https://issues.redhat.com/browse/GCP-199. I added a note on the Jira ticket about that)
Remove the controlPlaneOperatorGSA field from gcpClientBuilder and the corresponding parameter from InitCustomerGCPClient as it was never used. Signed-off-by: Cristiano Veiga <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
Add periodic reconciliation every 5 minutes to detect out-of-band changes to GCP resources. This matches the AWS private link controller pattern and ensures the controller can identify and report on manual or external modifications to PSC endpoints. Signed-off-by: Cristiano Veiga <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
Check the error return from logr.FromContext to maintain consistency with the AWS private link controller pattern. Signed-off-by: Cristiano Veiga <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@control-plane-operator/controllers/gcpprivateserviceconnect/psc_endpoint_controller.go`:
- Around line 307-310: The current checks only requeue when op.Status ==
"RUNNING", letting PENDING operations fall through and cause premature GETs;
update the three locations that inspect the operation (the blocks that log "IP
reservation in progress" and requeue using ctrl.Result{RequeueAfter: ...}) to
instead treat any non-DONE status as in-progress by replacing the equality check
with a non-DONE check (e.g., if op.Status != "DONE") so both "PENDING" and
"RUNNING" requeue; apply this change to each spot where the local op variable
and that log/requeue pattern appear in psc_endpoint_controller.go.
- Around line 179-188: The finalizer addition unconditionally waits for
isWIFTokenAccessible() which blocks when service-account-key credentials are
used; modify the logic so the WIF token gate runs only when WIF credentials are
expected. Update the block that adds pscEndpointFinalizer for gcpPSC to first
check whether WIF is in use (introduce a small predicate like isWIFEnabled() or
reuse credential-detection from InitCustomerGCPClient), and only call
isWIFTokenAccessible() and requeue when that predicate is true; otherwise add
the finalizer immediately and call r.Update(ctx, gcpPSC). Ensure you reference
and update the same control flow around controllerutil.ContainsFinalizer,
pscEndpointFinalizer, and isWIFTokenAccessible() so the check is conditional on
chosen credential type.
Update InitCustomerGCPClient comments to explicitly state that this controller requires Workload Identity Federation (WIF) credentials and does not support service account keys. This aligns with the finalizer gate that waits for WIF token availability. Signed-off-by: Cristiano Veiga <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@control-plane-operator/controllers/gcpprivateserviceconnect/psc_endpoint_controller.go`:
- Around line 458-461: The deletion path currently only treats op.Status ==
"RUNNING" as in-progress which misses other non-completed states like PENDING;
change the check in the deletion handling (where op and op.Status are inspected
and the log "PSC endpoint deletion in progress" is emitted) to treat any
non-DONE status as not completed (e.g., op != nil && op.Status != "DONE") so the
controller waits for actual completion before proceeding and still returns
false, nil while waiting.
GCP operations can be in PENDING, RUNNING, or DONE states. Previously the code only checked for RUNNING which could miss PENDING operations. Now we check for any non-DONE status to properly wait for operation completion before proceeding. Also added status field to log messages for better debugging. Signed-off-by: Cristiano Veiga <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@control-plane-operator/controllers/gcpprivateserviceconnect/psc_endpoint_controller.go`:
- Around line 404-436: In updateStatusFromEndpoint, don't mark available based
solely on non-empty endpoint fields; instead compare endpoint.Target and
endpoint.IPAddress against the expected values stored on the CR (use
gcpPSC.Spec.Target (or gcpPSC.Spec.ServiceAttachment) and
gcpPSC.Spec.IP/ExpectedIP as the canonical source) and only set
GCPEndpointAvailable = True when they exactly match; if they differ set
ConditionFalse with a message that includes expected vs actual values (and keep
the existing Status().Patch and requeue behavior); if the CR doesn't specify
expected values, fall back to the current non-empty check.
- Around line 438-485: The reconcileDelete function must check GCP operation
errors and wait for async IP deletion: after ForwardingRules.Delete in
reconcileDelete, when op != nil and op.Status == "DONE" inspect op.Error and if
non-nil return an error; if op.Status != "DONE" return false to retry. For the
Addresses.Delete call, capture the returned operation (e.g., ipOp), then poll
the regional operation via
customerGCPClient.RegionOperations.Get(customerProject, region, ipOp.Name) with
a timeout (gcpAPITimeout) until its Status == "DONE" (return false while
pending) and then check that operation's Error field and return an error if
present. Use the existing symbols constructEndpointName, constructIPAddressName,
reconcileDelete, gcpAPITimeout, isNotFoundError to locate and update the code
paths.
Follow reviewer suggestion to properly handle GCP async deletion: - Check op.Status != "DONE" to detect in-progress operations - Check op.Error for operation failures (DONE doesn't guarantee success) - Verify resource is actually gone via GET after successful operation - Handle "already deleted" case (404 on delete) gracefully Both ForwardingRule and Address deletions now: 1. Delete resource, get operation 2. If 404: already deleted, continue 3. If op.Status != "DONE": return false to retry 4. If op.Error != nil: return error with details 5. GET to verify resource is gone 6. If still exists: return false to retry Signed-off-by: Cristiano Veiga <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@control-plane-operator/controllers/gcpprivateserviceconnect/psc_endpoint_controller.go`:
- Around line 307-311: After detecting op.Status == "DONE" in the IP reservation
branch, also check op.Error like the deletion path does: if op.Error is non-nil
(or contains entries), log the error details and return an appropriate
error/result instead of proceeding; update the block around the existing
op.Status check (the branch that currently logs "IP reservation in progress"
using op.Name/op.Status) to perform the same op.Error inspection and handling
used by the deletion flow so failed reservations are not treated as successful.
Add op.Error checks after op.Status == "DONE" for creation operations to match the deletion path. GCP operations can complete with DONE status while still having errors. Affected paths: - IP reservation (Addresses.Insert) - PSC endpoint creation (ForwardingRules.Insert) Signed-off-by: Cristiano Veiga <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@control-plane-operator/controllers/gcpprivateserviceconnect/psc_endpoint_controller.go`:
- Around line 232-247: The isServiceAttachmentReady function only checks
ServiceAttachmentURI but must also validate ServiceAttachmentName because helper
functions constructEndpointName and constructIPAddressName rely on it; update
GCPPrivateServiceConnectReconciler.isServiceAttachmentReady to return false if
either gcpPSC.Status.ServiceAttachmentURI or gcpPSC.Status.ServiceAttachmentName
is empty, then continue the existing condition checks for
GCPServiceAttachmentAvailable.
Add defensive check for ServiceAttachmentName in addition to ServiceAttachmentURI. Both fields are required: - URI: used as the PSC endpoint target - Name: used by constructEndpointName and constructIPAddressName Without this check, if Name is empty we would create GCP resources with malformed names like "-endpoint" and "-ip". Signed-off-by: Cristiano Veiga <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
|
/lgtm |
|
/retest-required |
1 similar comment
|
/retest-required |
|
/verified later @cristianoveiga |
|
@cristianoveiga: This PR has been marked to be verified later by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cristianoveiga, csrwng The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@cristianoveiga: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What this PR does / why we need it:
Adds a new PSC endpoint controller in the control-plane-operator that manages Private Service Connect (PSC) endpoints in the customer's GCP project. This is the customer-side counterpart to the management-side PSC controller that creates Service Attachments.
Key features:
GCPEndpointAvailableandGCPServiceAttachmentAvailableconditions to HostedCluster statusWhich issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/GCP-198
Special notes for your reviewer:
hypershift-operator/controllers/platform/gcp/privateserviceconnect_controller.go)gcp-hcp-pilotbranchChecklist: