Skip to content

Conversation

@a-hilaly
Copy link
Member

Note to reviewers: This PR was crafted commit by commit to make the
refactoring easier to follow. Review each commit individually to
understand how the implementation evolved from emulation to native CEL
type checking.

When we built kro, we needed to validate that RGDs are correct and catch
as many errors as possible before users saw them when submitting
instances. Things like ensuring a string field doesnt get assigned an
integer expression, or that field paths actually exist on the resources
being referenced. We implemented an emulation based approach: for every
RGD at build time, generate fake data for every resource matching their
OpenAPI schemas, then evaluate every CEL expression against this fake
data. We needed this for three purposes: 1/ validate that resources
reference each other correctly and those links make sense. 2/ ensure
expressions resolved to something and validate that the result matched the
expected type. 3/ generate the instance status schema fields from the
expression output types. This was essentially a dry run of expression
evaluation at build time.

The emulation approach had fundamental limitations. Many schema
constraints were too complex to handle. Things like optional types, regex
patterns, and complex validation rules were either ignored or simplified.
The emulator couldn't generate valid data for all cases. More
fundamentally, we were executing expressions to observe their output types
rather than analyzing the expressions statically. Type coercion rules,
null handling, optional type behavior, and field access semantics became
approximations. Edge cases like optional_type(T) assignability to T,
dynamic type handling, and wrapper type conversions required custom logic
that diverged too far. Generating fake instances for every resource
consumed substantial memory, particularly for deeply nested schemas,
and all this data was discarded immediately after validation.

This series of commits replaces emulation with CELs built in type
checker. We leverage CEL's type checking and inference capabilities that
provide best-effort static analysis without needing to emulate data.
OpenAPI schemas are converted to CEL types and injected into the type
environment. CEL's compiler validates expressions during AST analysis,
checking field access and type compatibility against the actual schemas.
The type checker understands resource structures, validates field paths
exist, and ensures expressions return the expected types, all statically
without executing anything.

Type compatibility is checked using CEL's assignability rules, which
handle the nuances we previously approximated: optional types, null
handling, type coercion, and wrapper types all follow CEL's actual
semantics. Status schema generation now uses CEL's type inference directly
from checked ASTs. The compiler tells us what type an expression returns,
and we convert that CEL type back to an OpenAPI schema by walking through
the type structure. The same approach validates readyWhen and includeWhen
expressions, ensuring they return boolean types without executing anything.
No fake data generation, no expression execution, no emulator. We get
better correctness by using CEL's actual semantics and better performance
by eliminating all the overhead.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a-hilaly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 27, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 27, 2025
@a-hilaly a-hilaly added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 27, 2025
@a-hilaly a-hilaly force-pushed the cel-la-vie branch 2 times, most recently from 72df551 to 9fe7eaa Compare October 27, 2025 03:08
Copy link
Contributor

@ellistarn ellistarn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing as an exercise to get myself familiar with the code :)

Comment on lines +52 to +54
name: "convert double",
celType: cel.DoubleType,
want: &extv1.JSONSchemaProps{Type: "number"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you've run into trouble with floating point precision causing flip flops in graph evaluation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't seen it in practice yet, but yeah, could definitely cause issues. Any float comparisons expressions in resource templates, readyWhen and includeWhen could cause issues... worth documenting this to steering users towards using more of string/bool/int expression rather than floats, but it's inevitable i guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but yeah i can imagine cases where nodes are oscillating between ready/non ready state or included/excluded caused floating points precision issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on just blocking floats?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can block floats in simple schema, nothing we can do now for CRDs that already have a float fields. Maybe we can warn where users are manipulating float fields?

Add `TypedEnvironment()` and `UntypedEnvironment()` APIs to support compile-time
field validation using OpenAPI schemas. This prepares kro code base to use CEL
native type checking instead of the emulator for validating RGD correctness.
Remove the resource emulator and all emulation-based validation from
the graph builder. Expression extraction and dependency analysis remain
functional. Validation is temporarily disabled and will be replaced with
CEL type checking in the next commit.
…ources`

Add RGD "compile" time validation of CEL expressions using the CEL typed
environment. Expressions are now parsed and type checked against OpenAPI
schemas, with output type verification against expected field types.
Replace open field status schema with type inference using CELs type system.
Status field types are now derived by type-checking expressions against resource
schemas and converting the output types to OpenAPI schemas.
Add validation for fields with multiple CEL expressions (string interpolation).
All expressions in a multi-expression field must return strings, and the
concatenated result is validated against the field's expected type.
CEL was treating "schema" as a type definition instead of a variable, causing
field selection errors. Fixed by separating type and variable namespaces
(__type_schema for types, schema for variables) and using the OpenAPI library's
type conversion instead of manual string-based type matching. Also prevents
circular dependencies by excluding status from the schema exposed to CEL
expressions, since status is computed from resources, not an input to them.
Replace string-based type validation with CEL's native type system. Parser
extracts expected types as `*cel.Type` using `openapi.SchemaDeclType()` for
OpenAPI to CEL conversion. Validation uses `IsAssignableType` for proper type
checking. Replace `ExpectedTypes []string` and `ExpectedSchema *spec.Schema`
fields with single `ExpectedType *cel.Type` in FieldDescriptor.

Add `WouldMatchIfUnwrapped()` helper to detect optional type mismatches and
provide actionable `.orValue()` suggestions when expressions return
`optional_type(T)` but fields expect `T`. Handle `dyn` type as special case
matching any expected type. Preserve static validation for non-CEL scalar
values during parsing.
Add compile-time validation for condition expressions with proper scoping.
includeWhen expressions can only reference "schema" and readyWhen expressions
can only reference their own resource. Both must return `bool` or
`optional_type(bool)`.
Reorder validation to build the dependency graph before CEL type checking.
This ensures undeclared resource references are caught with clear "container not found"
errors rather than confusing CEL type errors about undefined variables.
}
}
// All expressions are strings - result type is string
statusTypeMap[fieldDescriptor.Path] = cel.StringType
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is when we have a status like

status:
    someField: ${ expressionA }${ expressionB }

correct?

Could we find/test any case where type(expressionA)type(expressionB) is not a string and returns something valid at the CEL level?

2 side notes/nit (just thinking out loud)

I wonder if we wouldn't gain readability flexibility building directly the extva.JSONSchemaProps . Eventually, we could also declare statusTypeMap := make(map[string][]*cel.Type) so we can handle options like ``

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 here, the CEL type of status.someField could be checked against the typed schema type for its assigneability. I think there are cases when expression results could be integers or nested objects. otherwise its a missing limitation doc I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. In status field schema generation, for "single expression fields" any type allowed:

status:
  replicas: ${deployment.status.replicas}  # int
  metadata: ${deployment.metadata}  # object

However with "multi expression fields" we do have restrictions:

status:
  allowed: "Count: ${string(a.spec.replicas)}, Ready: ${string(a.status.ready)}" #allowed
  notAllowed: "${deployment.metadata}-hellol-world-${deployment.status.replicas}"

Concatenating arbitrary types (objects, bools, ints) is complex and error prone. I'm just starting restrictive, we can relax later if there are good use cases and it's technically feasible. For now, users need explicit string() calls for multi exprssion fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some historical context: Initially kro didn't support multi expression fields at all. We added it because users wanted simple string templating instead of complex CEL string manipulation and concatenation expressions. For example IAM policies:

status:
  policy: |
    {
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::${bucket.name}/*",
      "Principal": "${serviceAccount.metadata.name}"
    }

vs:

status:
  policy: | ${json.marshal({ # json.Marshall doesn't exist yet AFAIK
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::" + bucket.name + "/*",
      "Principal": serviceAccount.metadata.name
    })}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit I completely missed that we are in the else branch which only deals with concatenated expressions.

  1. A single expression, strictly typed
  2. Multiple expressions concatenated, always strings

The argument right now is that status fields only accept strings, whereas in theory they can be object types or integers or others when working with the first case. for the second case, they must stay strings (as you outlined). I subscribe to that 100% and I think we should leave this as is

Copy link

@tjamet tjamet Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, the less I see an instance of ${ expressionA }${ expressionB } where expressionA or expressionB are not strings and the expression makes sense.

edit:

Actually, yes, there is a case, when crafting a json as you shown for example

status:
   someField: |
      {
          "generation": ${resource.metadata.generation}
      }

Then, generation is an int and gets into a string.
How hard would it be to force a .String)() call?

crd := rgdGraph.Instance.GetCRD()
gvk := rgdGraph.Instance.GetCRD().GroupVersionKind()

s, err := schema.ConvertJSONSchemaPropsToSpecSchema(crd.Spec.Versions[0].Schema.OpenAPIV3Schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this schema conversion has to happen exactly once for any given crd version so maybe its worth caching it. worth a todo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed there are caching opportunities here. I think we should aim to address caching (schema conversions, CEL envs etc..) in a separate PR to keep this one focused on the emulation -> type checking migration. There are also runtime optimizations we can tackle separately.

}
}
// All expressions are strings - result type is string
statusTypeMap[fieldDescriptor.Path] = cel.StringType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 here, the CEL type of status.someField could be checked against the typed schema type for its assigneability. I think there are cases when expression results could be integers or nested objects. otherwise its a missing limitation doc I think

// It type-checks that expressions reference valid fields and return the expected types
// based on the OpenAPI schemas.
func validateTemplateExpressions(env *cel.Env, resource *Resource) error {
for _, templateVariable := range resource.variables {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be concurrent for every single expression

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I purposefully avoided these kind of optimizations in this PR. i want to get to logic right first, optimize later on in a seperate PR.

Side thought: curious on how much we'll win by doing concurrency here, maybe useful for very large rgds.

// Returns the checked AST on success, or the raw CEL error on failure.
// Callers should wrap the error with appropriate context.
func parseAndCheckCELExpression(env *cel.Env, expression string) (*cel.Ast, error) {
parsedAST, issues := env.Parse(expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this AST check could be cached later, maybe worth a todo

func getSchemaWithoutStatus(crd *extv1.CustomResourceDefinition) (*spec.Schema, error) {
crdCopy := crd.DeepCopy()

if len(crdCopy.Spec.Versions) != 1 || crdCopy.Spec.Versions[0].Schema == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably get a TODO when we start managing upgrades to CRDs. Whats our stance here?

Restructure validation to report errors per-node rather than globally, making
failures easier to locate and fix. Consolidate all expression validation for
each resource (`templates`, `includeWhen`, `readyWhen`) to provide clear, concise
error context without redundant prefixes.

Extract common parsing logic to reduce verbosity in error paths.
Inject 'ObjectMeta' schema into instance schema during CEL validation so expressions
can reference standard Kubernetes fields like 'metadata.name|namespace|labels|annotations'
Fix bug where `strings.Trim(expr, "${}")` treated `"${}"` as a cutset, removing
any `$`, `{`, or `}` characters from both ends of expressions. This corrupted
expressions ending with braces like `"condition ? value : {}"`, commonly used
in ternary operators with empty map literals.
…dule

Move the emulator package from `pkg/graph` to `cmd/kro` as it's only used by
the CLI's instance generation command. Create a separate `go.mod` for
`cmd/kro` to isolate CLI dependencies. This is a temporary change while we
work on the CLI module structure.
Copy link
Member

@jakobmoellerdev jakobmoellerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold

just waiting for a second reviewers ack, great work

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2025
@a-hilaly
Copy link
Member Author

a-hilaly commented Nov 4, 2025

/cc @barney-s @matthchr

@a-hilaly a-hilaly added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Nov 5, 2025
@jakobmoellerdev
Copy link
Member

/unhold

to Infinity CEL and beyond

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2025
@k8s-ci-robot k8s-ci-robot merged commit 14a86b9 into kubernetes-sigs:main Nov 5, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants