-
Notifications
You must be signed in to change notification settings - Fork 255
Overhaul RGD static validation to eliminate emulation #756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: a-hilaly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
72df551 to
9fe7eaa
Compare
ellistarn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing as an exercise to get myself familiar with the code :)
| name: "convert double", | ||
| celType: cel.DoubleType, | ||
| want: &extv1.JSONSchemaProps{Type: "number"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if you've run into trouble with floating point precision causing flip flops in graph evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't seen it in practice yet, but yeah, could definitely cause issues. Any float comparisons expressions in resource templates, readyWhen and includeWhen could cause issues... worth documenting this to steering users towards using more of string/bool/int expression rather than floats, but it's inevitable i guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but yeah i can imagine cases where nodes are oscillating between ready/non ready state or included/excluded caused floating points precision issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on just blocking floats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can block floats in simple schema, nothing we can do now for CRDs that already have a float fields. Maybe we can warn where users are manipulating float fields?
Add `TypedEnvironment()` and `UntypedEnvironment()` APIs to support compile-time field validation using OpenAPI schemas. This prepares kro code base to use CEL native type checking instead of the emulator for validating RGD correctness.
Remove the resource emulator and all emulation-based validation from the graph builder. Expression extraction and dependency analysis remain functional. Validation is temporarily disabled and will be replaced with CEL type checking in the next commit.
…ources` Add RGD "compile" time validation of CEL expressions using the CEL typed environment. Expressions are now parsed and type checked against OpenAPI schemas, with output type verification against expected field types.
Replace open field status schema with type inference using CELs type system. Status field types are now derived by type-checking expressions against resource schemas and converting the output types to OpenAPI schemas.
Add validation for fields with multiple CEL expressions (string interpolation). All expressions in a multi-expression field must return strings, and the concatenated result is validated against the field's expected type.
CEL was treating "schema" as a type definition instead of a variable, causing field selection errors. Fixed by separating type and variable namespaces (__type_schema for types, schema for variables) and using the OpenAPI library's type conversion instead of manual string-based type matching. Also prevents circular dependencies by excluding status from the schema exposed to CEL expressions, since status is computed from resources, not an input to them.
Replace string-based type validation with CEL's native type system. Parser extracts expected types as `*cel.Type` using `openapi.SchemaDeclType()` for OpenAPI to CEL conversion. Validation uses `IsAssignableType` for proper type checking. Replace `ExpectedTypes []string` and `ExpectedSchema *spec.Schema` fields with single `ExpectedType *cel.Type` in FieldDescriptor. Add `WouldMatchIfUnwrapped()` helper to detect optional type mismatches and provide actionable `.orValue()` suggestions when expressions return `optional_type(T)` but fields expect `T`. Handle `dyn` type as special case matching any expected type. Preserve static validation for non-CEL scalar values during parsing.
Add compile-time validation for condition expressions with proper scoping. includeWhen expressions can only reference "schema" and readyWhen expressions can only reference their own resource. Both must return `bool` or `optional_type(bool)`.
Reorder validation to build the dependency graph before CEL type checking. This ensures undeclared resource references are caught with clear "container not found" errors rather than confusing CEL type errors about undefined variables.
| } | ||
| } | ||
| // All expressions are strings - result type is string | ||
| statusTypeMap[fieldDescriptor.Path] = cel.StringType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is when we have a status like
status:
someField: ${ expressionA }${ expressionB }correct?
Could we find/test any case where type(expressionA)type(expressionB) is not a string and returns something valid at the CEL level?
2 side notes/nit (just thinking out loud)
I wonder if we wouldn't gain readability flexibility building directly the extva.JSONSchemaProps . Eventually, we could also declare statusTypeMap := make(map[string][]*cel.Type) so we can handle options like ``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 here, the CEL type of status.someField could be checked against the typed schema type for its assigneability. I think there are cases when expression results could be integers or nested objects. otherwise its a missing limitation doc I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional. In status field schema generation, for "single expression fields" any type allowed:
status:
replicas: ${deployment.status.replicas} # int
metadata: ${deployment.metadata} # objectHowever with "multi expression fields" we do have restrictions:
status:
allowed: "Count: ${string(a.spec.replicas)}, Ready: ${string(a.status.ready)}" #allowed
notAllowed: "${deployment.metadata}-hellol-world-${deployment.status.replicas}"Concatenating arbitrary types (objects, bools, ints) is complex and error prone. I'm just starting restrictive, we can relax later if there are good use cases and it's technically feasible. For now, users need explicit string() calls for multi exprssion fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some historical context: Initially kro didn't support multi expression fields at all. We added it because users wanted simple string templating instead of complex CEL string manipulation and concatenation expressions. For example IAM policies:
status:
policy: |
{
"Effect": "Allow",
"Resource": "arn:aws:s3:::${bucket.name}/*",
"Principal": "${serviceAccount.metadata.name}"
}vs:
status:
policy: | ${json.marshal({ # json.Marshall doesn't exist yet AFAIK
"Effect": "Allow",
"Resource": "arn:aws:s3:::" + bucket.name + "/*",
"Principal": serviceAccount.metadata.name
})}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit I completely missed that we are in the else branch which only deals with concatenated expressions.
- A single expression, strictly typed
- Multiple expressions concatenated, always strings
The argument right now is that status fields only accept strings, whereas in theory they can be object types or integers or others when working with the first case. for the second case, they must stay strings (as you outlined). I subscribe to that 100% and I think we should leave this as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it, the less I see an instance of ${ expressionA }${ expressionB } where expressionA or expressionB are not strings and the expression makes sense.
edit:
Actually, yes, there is a case, when crafting a json as you shown for example
status:
someField: |
{
"generation": ${resource.metadata.generation}
}
Then, generation is an int and gets into a string.
How hard would it be to force a .String)() call?
| crd := rgdGraph.Instance.GetCRD() | ||
| gvk := rgdGraph.Instance.GetCRD().GroupVersionKind() | ||
|
|
||
| s, err := schema.ConvertJSONSchemaPropsToSpecSchema(crd.Spec.Versions[0].Schema.OpenAPIV3Schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this schema conversion has to happen exactly once for any given crd version so maybe its worth caching it. worth a todo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed there are caching opportunities here. I think we should aim to address caching (schema conversions, CEL envs etc..) in a separate PR to keep this one focused on the emulation -> type checking migration. There are also runtime optimizations we can tackle separately.
| } | ||
| } | ||
| // All expressions are strings - result type is string | ||
| statusTypeMap[fieldDescriptor.Path] = cel.StringType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 here, the CEL type of status.someField could be checked against the typed schema type for its assigneability. I think there are cases when expression results could be integers or nested objects. otherwise its a missing limitation doc I think
| // It type-checks that expressions reference valid fields and return the expected types | ||
| // based on the OpenAPI schemas. | ||
| func validateTemplateExpressions(env *cel.Env, resource *Resource) error { | ||
| for _, templateVariable := range resource.variables { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be concurrent for every single expression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I purposefully avoided these kind of optimizations in this PR. i want to get to logic right first, optimize later on in a seperate PR.
Side thought: curious on how much we'll win by doing concurrency here, maybe useful for very large rgds.
| // Returns the checked AST on success, or the raw CEL error on failure. | ||
| // Callers should wrap the error with appropriate context. | ||
| func parseAndCheckCELExpression(env *cel.Env, expression string) (*cel.Ast, error) { | ||
| parsedAST, issues := env.Parse(expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this AST check could be cached later, maybe worth a todo
| func getSchemaWithoutStatus(crd *extv1.CustomResourceDefinition) (*spec.Schema, error) { | ||
| crdCopy := crd.DeepCopy() | ||
|
|
||
| if len(crdCopy.Spec.Versions) != 1 || crdCopy.Spec.Versions[0].Schema == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably get a TODO when we start managing upgrades to CRDs. Whats our stance here?
Restructure validation to report errors per-node rather than globally, making failures easier to locate and fix. Consolidate all expression validation for each resource (`templates`, `includeWhen`, `readyWhen`) to provide clear, concise error context without redundant prefixes. Extract common parsing logic to reduce verbosity in error paths.
Inject 'ObjectMeta' schema into instance schema during CEL validation so expressions can reference standard Kubernetes fields like 'metadata.name|namespace|labels|annotations'
Fix bug where `strings.Trim(expr, "${}")` treated `"${}"` as a cutset, removing
any `$`, `{`, or `}` characters from both ends of expressions. This corrupted
expressions ending with braces like `"condition ? value : {}"`, commonly used
in ternary operators with empty map literals.
…dule Move the emulator package from `pkg/graph` to `cmd/kro` as it's only used by the CLI's instance generation command. Create a separate `go.mod` for `cmd/kro` to isolate CLI dependencies. This is a temporary change while we work on the CLI module structure.
jakobmoellerdev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold
just waiting for a second reviewers ack, great work
|
/unhold to |
Note to reviewers: This PR was crafted commit by commit to make the
refactoring easier to follow. Review each commit individually to
understand how the implementation evolved from emulation to native CEL
type checking.
When we built kro, we needed to validate that RGDs are correct and catch
as many errors as possible before users saw them when submitting
instances. Things like ensuring a string field doesnt get assigned an
integer expression, or that field paths actually exist on the resources
being referenced. We implemented an emulation based approach: for every
RGDat build time, generate fake data for every resource matching theirOpenAPIschemas, then evaluate everyCELexpression against this fakedata. We needed this for three purposes: 1/ validate that resources
reference each other correctly and those links make sense. 2/ ensure
expressions resolved to something and validate that the result matched the
expected type. 3/ generate the instance status schema fields from the
expression output types. This was essentially a dry run of expression
evaluation at build time.
The emulation approach had fundamental limitations. Many schema
constraints were too complex to handle. Things like optional types, regex
patterns, and complex validation rules were either ignored or simplified.
The emulator couldn't generate valid data for all cases. More
fundamentally, we were executing expressions to observe their output types
rather than analyzing the expressions statically. Type coercion rules,
null handling, optional type behavior, and field access semantics became
approximations. Edge cases like
optional_type(T)assignability toT,dynamic type handling, and wrapper type conversions required custom logic
that diverged too far. Generating fake instances for every resource
consumed substantial memory, particularly for deeply nested schemas,
and all this data was discarded immediately after validation.
This series of commits replaces emulation with CELs built in type
checker. We leverage CEL's type checking and inference capabilities that
provide best-effort static analysis without needing to emulate data.
OpenAPI schemas are converted to CEL types and injected into the type
environment. CEL's compiler validates expressions during AST analysis,
checking field access and type compatibility against the actual schemas.
The type checker understands resource structures, validates field paths
exist, and ensures expressions return the expected types, all statically
without executing anything.
Type compatibility is checked using CEL's assignability rules, which
handle the nuances we previously approximated: optional types, null
handling, type coercion, and wrapper types all follow CEL's actual
semantics. Status schema generation now uses CEL's type inference directly
from checked ASTs. The compiler tells us what type an expression returns,
and we convert that CEL type back to an OpenAPI schema by walking through
the type structure. The same approach validates
readyWhenandincludeWhenexpressions, ensuring they return boolean types without executing anything.
No fake data generation, no expression execution, no emulator. We get
better correctness by using CEL's actual semantics and better performance
by eliminating all the overhead.