Overhaul RGD static validation to eliminate emulation #756

a-hilaly · 2025-10-27T02:20:11Z

Note to reviewers: This PR was crafted commit by commit to make the
refactoring easier to follow. Review each commit individually to
understand how the implementation evolved from emulation to native CEL
type checking.

When we built kro, we needed to validate that RGDs are correct and catch
as many errors as possible before users saw them when submitting
instances. Things like ensuring a string field doesnt get assigned an
integer expression, or that field paths actually exist on the resources
being referenced. We implemented an emulation based approach: for every
RGD at build time, generate fake data for every resource matching their
OpenAPI schemas, then evaluate every CEL expression against this fake
data. We needed this for three purposes: 1/ validate that resources
reference each other correctly and those links make sense. 2/ ensure
expressions resolved to something and validate that the result matched the
expected type. 3/ generate the instance status schema fields from the
expression output types. This was essentially a dry run of expression
evaluation at build time.

The emulation approach had fundamental limitations. Many schema
constraints were too complex to handle. Things like optional types, regex
patterns, and complex validation rules were either ignored or simplified.
The emulator couldn't generate valid data for all cases. More
fundamentally, we were executing expressions to observe their output types
rather than analyzing the expressions statically. Type coercion rules,
null handling, optional type behavior, and field access semantics became
approximations. Edge cases like optional_type(T) assignability to T,
dynamic type handling, and wrapper type conversions required custom logic
that diverged too far. Generating fake instances for every resource
consumed substantial memory, particularly for deeply nested schemas,
and all this data was discarded immediately after validation.

This series of commits replaces emulation with CELs built in type
checker. We leverage CEL's type checking and inference capabilities that
provide best-effort static analysis without needing to emulate data.
OpenAPI schemas are converted to CEL types and injected into the type
environment. CEL's compiler validates expressions during AST analysis,
checking field access and type compatibility against the actual schemas.
The type checker understands resource structures, validates field paths
exist, and ensures expressions return the expected types, all statically
without executing anything.

Type compatibility is checked using CEL's assignability rules, which
handle the nuances we previously approximated: optional types, null
handling, type coercion, and wrapper types all follow CEL's actual
semantics. Status schema generation now uses CEL's type inference directly
from checked ASTs. The compiler tells us what type an expression returns,
and we convert that CEL type back to an OpenAPI schema by walking through
the type structure. The same approach validates readyWhen and includeWhen
expressions, ensuring they return boolean types without executing anything.
No fake data generation, no expression execution, no emulator. We get
better correctness by using CEL's actual semantics and better performance
by eliminating all the overhead.

k8s-ci-robot · 2025-10-27T02:20:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a-hilaly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [a-hilaly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ellistarn

Reviewing as an exercise to get myself familiar with the code :)

test/integration/suites/core/crd_test.go

ellistarn · 2025-10-27T04:49:15Z

pkg/graph/schema/conversion_cel_type_test.go

+			name:    "convert double",
+			celType: cel.DoubleType,
+			want:    &extv1.JSONSchemaProps{Type: "number"},


Curious if you've run into trouble with floating point precision causing flip flops in graph evaluation.

Haven't seen it in practice yet, but yeah, could definitely cause issues. Any float comparisons expressions in resource templates, readyWhen and includeWhen could cause issues... worth documenting this to steering users towards using more of string/bool/int expression rather than floats, but it's inevitable i guess.

but yeah i can imagine cases where nodes are oscillating between ready/non ready state or included/excluded caused floating points precision issues.

Thoughts on just blocking floats?

We can block floats in simple schema, nothing we can do now for CRDs that already have a float fields. Maybe we can warn where users are manipulating float fields?

pkg/graph/schema/conversion_schema.go

pkg/graph/builder.go

Add `TypedEnvironment()` and `UntypedEnvironment()` APIs to support compile-time field validation using OpenAPI schemas. This prepares kro code base to use CEL native type checking instead of the emulator for validating RGD correctness.

Remove the resource emulator and all emulation-based validation from the graph builder. Expression extraction and dependency analysis remain functional. Validation is temporarily disabled and will be replaced with CEL type checking in the next commit.

…ources` Add RGD "compile" time validation of CEL expressions using the CEL typed environment. Expressions are now parsed and type checked against OpenAPI schemas, with output type verification against expected field types.

Replace open field status schema with type inference using CELs type system. Status field types are now derived by type-checking expressions against resource schemas and converting the output types to OpenAPI schemas.

Add validation for fields with multiple CEL expressions (string interpolation). All expressions in a multi-expression field must return strings, and the concatenated result is validated against the field's expected type.

CEL was treating "schema" as a type definition instead of a variable, causing field selection errors. Fixed by separating type and variable namespaces (__type_schema for types, schema for variables) and using the OpenAPI library's type conversion instead of manual string-based type matching. Also prevents circular dependencies by excluding status from the schema exposed to CEL expressions, since status is computed from resources, not an input to them.

Replace string-based type validation with CEL's native type system. Parser extracts expected types as `*cel.Type` using `openapi.SchemaDeclType()` for OpenAPI to CEL conversion. Validation uses `IsAssignableType` for proper type checking. Replace `ExpectedTypes []string` and `ExpectedSchema *spec.Schema` fields with single `ExpectedType *cel.Type` in FieldDescriptor. Add `WouldMatchIfUnwrapped()` helper to detect optional type mismatches and provide actionable `.orValue()` suggestions when expressions return `optional_type(T)` but fields expect `T`. Handle `dyn` type as special case matching any expected type. Preserve static validation for non-CEL scalar values during parsing.

Add compile-time validation for condition expressions with proper scoping. includeWhen expressions can only reference "schema" and readyWhen expressions can only reference their own resource. Both must return `bool` or `optional_type(bool)`.

Reorder validation to build the dependency graph before CEL type checking. This ensures undeclared resource references are caught with clear "container not found" errors rather than confusing CEL type errors about undefined variables.

cmd/kro/go.mod

pkg/cel/environment.go

tjamet · 2025-10-28T09:25:54Z

pkg/graph/builder.go

+				}
+			}
+			// All expressions are strings - result type is string
+			statusTypeMap[fieldDescriptor.Path] = cel.StringType


This is when we have a status like

status: someField: ${ expressionA }${ expressionB }

correct?

Could we find/test any case where type(expressionA)type(expressionB) is not a string and returns something valid at the CEL level?

2 side notes/nit (just thinking out loud)

I wonder if we wouldn't gain readability flexibility building directly the extva.JSONSchemaProps . Eventually, we could also declare statusTypeMap := make(map[string][]*cel.Type) so we can handle options like ``

+1 here, the CEL type of status.someField could be checked against the typed schema type for its assigneability. I think there are cases when expression results could be integers or nested objects. otherwise its a missing limitation doc I think

This is intentional. In status field schema generation, for "single expression fields" any type allowed:

status: replicas: ${deployment.status.replicas} # int metadata: ${deployment.metadata} # object

However with "multi expression fields" we do have restrictions:

status: allowed: "Count: ${string(a.spec.replicas)}, Ready: ${string(a.status.ready)}" #allowed notAllowed: "${deployment.metadata}-hellol-world-${deployment.status.replicas}"

Concatenating arbitrary types (objects, bools, ints) is complex and error prone. I'm just starting restrictive, we can relax later if there are good use cases and it's technically feasible. For now, users need explicit string() calls for multi exprssion fields.

Some historical context: Initially kro didn't support multi expression fields at all. We added it because users wanted simple string templating instead of complex CEL string manipulation and concatenation expressions. For example IAM policies:

status: policy: | { "Effect": "Allow", "Resource": "arn:aws:s3:::${bucket.name}/*", "Principal": "${serviceAccount.metadata.name}" }

vs:

status: policy: | ${json.marshal({ # json.Marshall doesn't exist yet AFAIK "Effect": "Allow", "Resource": "arn:aws:s3:::" + bucket.name + "/*", "Principal": serviceAccount.metadata.name })}

I have to admit I completely missed that we are in the else branch which only deals with concatenated expressions.

A single expression, strictly typed

Multiple expressions concatenated, always strings

The argument right now is that status fields only accept strings, whereas in theory they can be object types or integers or others when working with the first case. for the second case, they must stay strings (as you outlined). I subscribe to that 100% and I think we should leave this as is

The more I think about it, the less I see an instance of ${ expressionA }${ expressionB } where expressionA or expressionB are not strings and the expression makes sense.

edit:

Actually, yes, there is a case, when crafting a json as you shown for example

status: someField: | { "generation": ${resource.metadata.generation} }

Then, generation is an int and gets into a string.
How hard would it be to force a .String)() call?

jakobmoellerdev · 2025-10-29T07:50:40Z

cmd/kro/commands/generate/instance.go

+	crd := rgdGraph.Instance.GetCRD()
+	gvk := rgdGraph.Instance.GetCRD().GroupVersionKind()
+
+	s, err := schema.ConvertJSONSchemaPropsToSpecSchema(crd.Spec.Versions[0].Schema.OpenAPIV3Schema)


I think this schema conversion has to happen exactly once for any given crd version so maybe its worth caching it. worth a todo?

Agreed there are caching opportunities here. I think we should aim to address caching (schema conversions, CEL envs etc..) in a separate PR to keep this one focused on the emulation -> type checking migration. There are also runtime optimizations we can tackle separately.

pkg/cel/environment.go

pkg/graph/builder.go

jakobmoellerdev · 2025-10-29T07:57:23Z

pkg/graph/builder.go

+				}
+			}
+			// All expressions are strings - result type is string
+			statusTypeMap[fieldDescriptor.Path] = cel.StringType


+1 here, the CEL type of status.someField could be checked against the typed schema type for its assigneability. I think there are cases when expression results could be integers or nested objects. otherwise its a missing limitation doc I think

jakobmoellerdev · 2025-10-29T07:59:19Z

pkg/graph/builder.go

+// It type-checks that expressions reference valid fields and return the expected types
+// based on the OpenAPI schemas.
+func validateTemplateExpressions(env *cel.Env, resource *Resource) error {
+	for _, templateVariable := range resource.variables {


I think this could be concurrent for every single expression

Agreed. I purposefully avoided these kind of optimizations in this PR. i want to get to logic right first, optimize later on in a seperate PR.

Side thought: curious on how much we'll win by doing concurrency here, maybe useful for very large rgds.

jakobmoellerdev · 2025-10-29T08:01:48Z

pkg/graph/builder.go

+// Returns the checked AST on success, or the raw CEL error on failure.
+// Callers should wrap the error with appropriate context.
+func parseAndCheckCELExpression(env *cel.Env, expression string) (*cel.Ast, error) {
+	parsedAST, issues := env.Parse(expression)


this AST check could be cached later, maybe worth a todo

jakobmoellerdev · 2025-10-29T08:02:49Z

pkg/graph/builder.go

+func getSchemaWithoutStatus(crd *extv1.CustomResourceDefinition) (*spec.Schema, error) {
+	crdCopy := crd.DeepCopy()
+
+	if len(crdCopy.Spec.Versions) != 1 || crdCopy.Spec.Versions[0].Schema == nil {


this should probably get a TODO when we start managing upgrades to CRDs. Whats our stance here?

pkg/graph/parser/parser.go

pkg/graph/schema/conversion_cel_type.go

pkg/graph/schema/schema.go

Restructure validation to report errors per-node rather than globally, making failures easier to locate and fix. Consolidate all expression validation for each resource (`templates`, `includeWhen`, `readyWhen`) to provide clear, concise error context without redundant prefixes. Extract common parsing logic to reduce verbosity in error paths.

Inject 'ObjectMeta' schema into instance schema during CEL validation so expressions can reference standard Kubernetes fields like 'metadata.name|namespace|labels|annotations'

Fix bug where `strings.Trim(expr, "${}")` treated `"${}"` as a cutset, removing any `$`, `{`, or `}` characters from both ends of expressions. This corrupted expressions ending with braces like `"condition ? value : {}"`, commonly used in ternary operators with empty map literals.

…dule Move the emulator package from `pkg/graph` to `cmd/kro` as it's only used by the CLI's instance generation command. Create a separate `go.mod` for `cmd/kro` to isolate CLI dependencies. This is a temporary change while we work on the CLI module structure.

jakobmoellerdev

/lgtm
/hold

just waiting for a second reviewers ack, great work

a-hilaly · 2025-11-04T07:33:34Z

/cc @barney-s @matthchr

jakobmoellerdev · 2025-11-05T07:54:11Z

/unhold

to ~~Infinity~~ CEL and beyond

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 27, 2025

k8s-ci-robot requested review from bridgetkromhout and jakobmoellerdev October 27, 2025 02:20

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 27, 2025

a-hilaly added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 27, 2025

a-hilaly force-pushed the cel-la-vie branch 2 times, most recently from 72df551 to 9fe7eaa Compare October 27, 2025 03:08

ellistarn reviewed Oct 27, 2025

View reviewed changes

a-hilaly added 9 commits October 27, 2025 23:24

feat: infer status schema from CEL type checking

65b285e

Replace open field status schema with type inference using CELs type system. Status field types are now derived by type-checking expressions against resource schemas and converting the output types to OpenAPI schemas.

feat: validate multi-expression string concatenation

94daf91

Add validation for fields with multiple CEL expressions (string interpolation). All expressions in a multi-expression field must return strings, and the concatenated result is validated against the field's expected type.

a-hilaly force-pushed the cel-la-vie branch from 9fe7eaa to edbce61 Compare October 28, 2025 07:14

tjamet reviewed Oct 28, 2025

View reviewed changes

cmd/kro/go.mod Show resolved Hide resolved

tjamet reviewed Oct 28, 2025

View reviewed changes

pkg/cel/environment.go Outdated Show resolved Hide resolved

tjamet reviewed Oct 28, 2025

View reviewed changes

a-hilaly mentioned this pull request Oct 28, 2025

Regression: readyWhen condition is ignored when orchestrating dependencies #744

Closed

jakobmoellerdev mentioned this pull request Oct 28, 2025

High memory usage after applying RGDs in cluster with large CRDs #724

Closed

jakobmoellerdev reviewed Oct 29, 2025

View reviewed changes

matthchr added this to the 0.6 milestone Oct 29, 2025

a-hilaly mentioned this pull request Oct 30, 2025

Performance: add caching for repeated operations in controller #761

Open

a-hilaly force-pushed the cel-la-vie branch from edbce61 to b69549c Compare October 30, 2025 08:25

a-hilaly added 3 commits October 30, 2025 01:26

feat: Inject ObjectMeta schema into instance for moar type checking

92d096b

Inject 'ObjectMeta' schema into instance schema during CEL validation so expressions can reference standard Kubernetes fields like 'metadata.name|namespace|labels|annotations'

a-hilaly force-pushed the cel-la-vie branch from b69549c to eddbdea Compare October 30, 2025 08:26

jakobmoellerdev reviewed Nov 1, 2025

View reviewed changes

k8s-ci-robot assigned jakobmoellerdev Nov 1, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2025

k8s-ci-robot requested review from barney-s and matthchr November 4, 2025 07:33

a-hilaly added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Nov 5, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2025

k8s-ci-robot merged commit 14a86b9 into kubernetes-sigs:main Nov 5, 2025
8 checks passed

a-hilaly mentioned this pull request Nov 5, 2025

Adding support for customTypes in CEL environment #450

Closed

Overhaul RGD static validation to eliminate emulation #756

Overhaul RGD static validation to eliminate emulation #756

Conversation

a-hilaly commented Oct 27, 2025

Uh oh!

k8s-ci-robot commented Oct 27, 2025

Uh oh!

ellistarn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjamet Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jakobmoellerdev left a comment

Choose a reason for hiding this comment

Uh oh!

a-hilaly commented Nov 4, 2025

Uh oh!

jakobmoellerdev commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tjamet Oct 30, 2025 •

edited

Loading