Skip to content

Conversation

@mzihlmann
Copy link
Collaborator

@mzihlmann mzihlmann commented Oct 10, 2025

Fixes #334

Description

For single stage builds we perform a cache lookahead. This means if we have 100% cache hitrate we don't even need to download and unpack the files in order to create the image. The idea here is to take this lookahead functionality further to work across stages too.

Currently in multistage builds every stage is built (from cache), even if the dependent stages would have resolved to a 100% cache hitrate themselves. This means that even if we have 100% cache hitrate, we at least need to download and unpack all the files for all multistage ancestors. In case of long FROM chains this is partially mitigated by squashing the stages together, meaning that we in effect build a single stage. In case we have COPY --from though this is not possible, as we can't squash across forks & merges. If we opt to --cache-copy-layers we don't even use the files from the hereby built image at all and instead load them cache directly. This means that in that case downloading and unpacking was completely in vain and we can skip it completely as an optimization.

The difficulty here is that currently our cache-key depends on the file contents. This is the safe option. Even if an upstream image changes, or a multistage ancestor changes, as long as the file contents stay the same, we have a cache hit. The reverse is true too, we don't need to detect upstream changes, as we will notice them by the changed files. However, our lookahead needs to know this a priori, hence it only works if the files are guaranteed to be the same. This is for example the case if you reference images by their shasum. In that case it is guaranteed that the files will be the same after download. The same logic also applies if you provide a checksum to COPY/ADD, which is not yet implemented in kaniko, but would be a nice incentive to do so.

We can know a-priori whether a lookahead key will be stable and we can opt to do file hashing in case it is not. The only downside is that we would lose a cache hit if the a-priori key changed but the file-contents did not. Not yet sure whether we can remedy this case, as it basically would need to have two references to the same cache layer.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes unit tests
  • Adds integration tests if needed.

See the contribution guide for more details.

Reviewer Notes

  • The code flow looks good.
  • Unit tests and or integration tests added.

Release Notes

Describe any changes here so maintainer can include it in the release notes, or delete this block.

Examples of user facing changes:
- kaniko adds a new flag `--registry-repo` to override registry

@mzihlmann mzihlmann changed the title implement cache lookahead multistage cache lookahead Oct 10, 2025
@mzihlmann
Copy link
Collaborator Author

mzihlmann commented Oct 11, 2025

hmmm... for FROM from remote images, ie. FROM ubuntu it's actually not too hard to implement. For chained FROMs, as in FROM base, I'm struggling a bit. The problem is the banana-jungle problem that you usually get with bloated interfaces. The stageBuilder needs to know everything, not only how to build the image, but also how to push it eventually etc etc. To keep the parameter set limited, these options are not exposed in the interface, but inferred from the baseImage on disk. This is fair as so far this was all done in one go, but now if we want to split it up into a lookahead preparation and an actual build step, it is problematic.

I can work with stageIdxToCacheKey and only do my optimizations if I have a hit there, but I can't create a stageBuilder without having the image downloaded locally, as it infers all the meta-information for how to push image from there.

@mzihlmann mzihlmann force-pushed the cache-lookahead branch 7 times, most recently from 277158d to a85bcb6 Compare October 25, 2025 05:42
@mzihlmann mzihlmann changed the title multistage cache lookahead mz334: multistage cache lookahead Oct 25, 2025
@mzihlmann mzihlmann force-pushed the cache-lookahead branch 4 times, most recently from 9c5cfec to 6e616c0 Compare November 8, 2025 11:33
@mzihlmann
Copy link
Collaborator Author

YES! multistage cache lookahead appears to work correctly, it's just a bit ugly, but now I know where to continue with the refactoring to make this change feel good.


func squash(a, b *stageBuilder) *stageBuilder {
acmds := filterOnBuild(a.cmds)
return &stageBuilder{
Copy link
Collaborator Author

@mzihlmann mzihlmann Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should split stageBuilder into two parts. It currently has two responsibilities, it cares both how to build and what to build. I think the how can stay in stageBuilder, but the what should be a separate struct entirely. This new struct can then be passed into the build() function. Then the same stagebuilder entity can crunch through all stages. Here in the squash it becomes obvious as only the what parts are affected by the squashing, for the how parts squashing is meaningless.

}

// Apply optimizations to the instructions.
if err := sb.optimize(*compositeKey, sb.cf.Config); err != nil {
Copy link
Collaborator Author

@mzihlmann mzihlmann Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do do cache-optimization already here, as we want to run skip&squash over the cached commands. I left it as part of the build() function anyways though as I'm not yet sure whether it has any side-effects on he compositeKey.

@mzihlmann
Copy link
Collaborator Author

mzihlmann commented Nov 8, 2025

For a simple build like this

FROM busybox AS base
RUN touch /blubb

FROM base AS final
COPY --from=base /blubb /blubb
RUN ls -lah /blubb

I get promising results already:

  • initial run (no caching): 5.485s
  • regular (per-stage) caching: 2.370s
  • multi-stage cache-lookahead: 0.537s

of course this is a trivial example, in a real-world use-case the expected saving will be way higher. But it is promising that the saving is visible in such a trivial example already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

multistage cache lookahead

2 participants