-
Notifications
You must be signed in to change notification settings - Fork 6.5k
feat: Add sparse checkout support for Git repositories to improve monorepo performance #25137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This commit adds test cases demonstrating the intended behavior for sparse checkout functionality in git monorepos. Tests cover: - Basic sparse checkout for single path (skipped until implemented) - Multiple sparse paths (skipped until implemented) - Backward compatibility when sparse checkout is disabled (passing) - Cross-platform file URL handling (Windows/Unix) - Wildcard patterns (skipped for future work) The passing test validates current behavior. Skipped tests document intended sparse checkout behavior and will be enabled as the feature is implemented. Relates to argoproj#11198 Signed-off-by: nwarinner <[email protected]>
- Add sparsePaths field to nativeGitClient - Implement WithSparse ClientOpts function for cone mode - Update tests to use new WithSparse API - Tests demonstrate intended behavior (currently skipped) This follows the design from issue argoproj#11198 discussion: - Per-application sparse checkout config - Cone mode for simplicity and performance - Native git CLI (go-git lacks sparse support) Next steps: implement Init() logic for sparse-checkout Relates to argoproj#11198 Signed-off-by: nwarinner <[email protected]>
Core implementation: - configureSparseCheckout() sets up cone mode before checkout - Uses git config and .git/info/sparse-checkout file - Checkout() applies sparse patterns to working tree - Works with both single and multiple directory paths All tests passing: - BasicPaths: only specified dirs are checked out - MultiplePaths: multiple sparse paths work correctly - Disabled: backward compatibility maintained This solves the 8GB monorepo clone issue by only pulling specified directories. Example: app in apps/frontend only clones that directory instead of the entire repo. Relates to argoproj#11198 Signed-off-by: nwarinner <[email protected]>
Adds sparseCheckoutPaths field to the Application CRD, allowing users
to specify which directories to include in sparse git checkout.
Usage example:
`yaml
spec:
source:
repoURL: https://github.com/org/monorepo
path: apps/frontend
directory:
sparseCheckoutPaths:
- apps/frontend
- shared/common
`
This CRD change enables per-application sparse checkout configuration,
solving the large monorepo performance problem at the API level.
- Field is optional (backward compatible)
- Uses protobuf tag for proper serialization
- Includes comprehensive documentation
- Validated with build tests
Next: Wire through repo-server to git client
Relates to argoproj#11198
Signed-off-by: nwarinner <[email protected]>
Connects the sparseCheckoutPaths CRD field to the git client implementation.
Changes:
- Read sparseCheckoutPaths from ApplicationSource.Directory
- Pass paths to git client via WithSparse() option
- Update resolveReferencedSources to accept variadic opts
- Add logging when sparse checkout is enabled
Now users can specify sparse paths in their Application YAML and
Argo CD will automatically use sparse checkout:
\\\yaml
spec:
source:
directory:
sparseCheckoutPaths:
- apps/frontend
- shared/common
\\\
This completes the end-to-end integration. The feature is now fully functional.
Relates to argoproj#11198
Signed-off-by: nwarinner <[email protected]>
…mance gains Combines sparse checkout with partial clone (--filter=blob:none) to dramatically reduce both clone time and disk usage for monorepo scenarios. Performance (Argo CD repo, single 5-file directory): - Full clone: 15.28s, 184.52 MB - Sparse + partial: 5.32s, 53.63 MB - Improvement: 3x faster, 71% less disk This automatically enables partial clone when sparse checkout paths are configured, avoiding the download of blob objects outside sparse paths while preserving commit history for features like ChangedFiles() and revision metadata. Critical for repo-server managing hundreds of apps from large monorepos. Scales linearly with number of apps from same repo. Relates to argoproj#11198 Signed-off-by: nwarinner <[email protected]>
🔴 Preview Environment stopped on BunnyshellSee: Environment Details | Pipeline Logs Available commands (reply to this comment):
|
- Use new octal literal style (0o755, 0o644) - Add t.Helper() to test helper function - Fix gofumpt formatting (blank line spacing) - Follows gocritic, gofumpt, and thelper linter rules Signed-off-by: nwarinner <[email protected]>
9920f7a to
2e4e037
Compare
Signed-off-by: nwarinner <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #25137 +/- ##
=========================================
Coverage ? 62.28%
=========================================
Files ? 351
Lines ? 49264
Branches ? 0
=========================================
Hits ? 30684
Misses ? 15653
Partials ? 2927 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! At first blush this looks pretty inline with the implementation I was working on. I'll review more tomorrow.
Did you handle caching keys and what is the approach?
I think manifest generation calls
looks like the I think that with two sources: JSON marshaling should produce: |
|
Oooh i see what you mean; the manifest cache keys will be ok but the repo cache path keys are normalized.. I have not worked on that. Let me see if can take a swing at it 🤞 |
the repository clone cache was keyed only on the normalized repository URL. This could cause applications using the same repository but different sparse checkout paths to share the same working directory, leading to sparse checkout configuration conflicts where the .git/info/sparse-checkout file would be overwritten. This commit fixes the issue by: 1. Adding GetSparseCheckoutKey() helper that generates a deterministic cache key component from sparse checkout paths. Paths are sorted alphabetically to ensure consistency regardless of input order. 2. Modifying newClient() to include sparse checkout paths in the repository cache key using format: "repo-url|sparse:path1,path2" 3. Updating runRepoOperation() to extract sparse paths from the application source and pass them through to client creation. 4. Adding tests for cache key generation and sparse checkout behavior. Applications without sparse checkout continue using the same cache keys (no suffix), maintaining backwards compatibility. Fixes cache interference between applications with different sparse checkout configurations while preserving manifest cache behavior which already correctly includes sparse paths. Signed-off-by: nwarinner <[email protected]>
7fb4a17 to
fd6dbf0
Compare
|
I was actually looking at this during the weekend. Since you beat me to the chase I do have a few things to point out. |
| if remotes, err := repo.Remotes(); err == nil && len(remotes) > 0 && len(remotes[0].Config().URLs) > 0 { | ||
| s.gitRepoPaths.Add(git.NormalizeGitURL(remotes[0].Config().URLs[0]), fullPath) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to consider the cache key here as well when it comes to sparse checkouts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooohhhh.. the Init() doesn't include sparse paths when restoring repo paths on startup. I added sparse path handling in newClientWithSparseCheckoutPaths() for normal operations, but didn't think about the startup case.
im not actually sure what the best way to fix this is. should Init() just skip restoring sparse paths, or is there a way to persist the sparse path info with the directories?
| func (s *Service) ListApps(ctx context.Context, q *apiclient.ListAppsRequest) (*apiclient.AppList, error) { | ||
| gitClient, commitSHA, err := s.newClientResolveRevision(q.Repo, q.Revision) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("error setting up git client and resolving given revision: %w", err) | ||
| } | ||
| if apps, err := s.cache.ListApps(q.Repo.Repo, commitSHA); err == nil { | ||
| log.Infof("cache hit: %s/%s", q.Repo.Repo, q.Revision) | ||
| return &apiclient.AppList{Apps: apps}, nil | ||
| } | ||
|
|
||
| s.metricsServer.IncPendingRepoRequest(q.Repo.Repo) | ||
| defer s.metricsServer.DecPendingRepoRequest(q.Repo.Repo) | ||
|
|
||
| closer, err := s.repoLock.Lock(gitClient.Root(), commitSHA, true, func() (goio.Closer, error) { | ||
| return s.checkoutRevision(gitClient, commitSHA, s.initConstants.SubmoduleEnabled) | ||
| }) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("error acquiring repository lock: %w", err) | ||
| } | ||
|
|
||
| defer utilio.Close(closer) | ||
| apps, err := discovery.Discover(ctx, gitClient.Root(), gitClient.Root(), q.EnabledSourceTypes, s.initConstants.CMPTarExcludedGlobs, []string{}) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("error discovering applications: %w", err) | ||
| } | ||
| err = s.cache.SetApps(q.Repo.Repo, commitSHA, apps) | ||
| if err != nil { | ||
| log.Warnf("cache set error %s/%s: %v", q.Repo.Repo, commitSHA, err) | ||
| } | ||
| res := apiclient.AppList{Apps: apps} | ||
| return &res, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what to do when it comes to listing apps, but we need to consider what happens here if we have multiple checked out paths
| func (s *Service) GetGitFiles(_ context.Context, request *apiclient.GitFilesRequest) (*apiclient.GitFilesResponse, error) { | ||
| repo := request.GetRepo() | ||
| revision := request.GetRevision() | ||
| gitPath := request.GetPath() | ||
| noRevisionCache := request.GetNoRevisionCache() | ||
| enableNewGitFileGlobbing := request.GetNewGitFileGlobbingEnabled() | ||
| if gitPath == "" { | ||
| gitPath = "." | ||
| } | ||
|
|
||
| if repo == nil { | ||
| return nil, status.Error(codes.InvalidArgument, "must pass a valid repo") | ||
| } | ||
|
|
||
| gitClient, revision, err := s.newClientResolveRevision(repo, revision, git.WithCache(s.cache, !noRevisionCache)) | ||
| if err != nil { | ||
| return nil, status.Errorf(codes.Internal, "unable to resolve git revision %s: %v", revision, err) | ||
| } | ||
|
|
||
| if err := verifyCommitSignature(request.VerifyCommit, gitClient, revision, repo); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| // check the cache and return the results if present | ||
| if cachedFiles, err := s.cache.GetGitFiles(repo.Repo, revision, gitPath); err == nil { | ||
| log.Debugf("cache hit for repo: %s revision: %s pattern: %s", repo.Repo, revision, gitPath) | ||
| return &apiclient.GitFilesResponse{ | ||
| Map: cachedFiles, | ||
| }, nil | ||
| } | ||
|
|
||
| s.metricsServer.IncPendingRepoRequest(repo.Repo) | ||
| defer s.metricsServer.DecPendingRepoRequest(repo.Repo) | ||
|
|
||
| // cache miss, generate the results | ||
| closer, err := s.repoLock.Lock(gitClient.Root(), revision, true, func() (goio.Closer, error) { | ||
| return s.checkoutRevision(gitClient, revision, request.GetSubmoduleEnabled()) | ||
| }) | ||
| if err != nil { | ||
| return nil, status.Errorf(codes.Internal, "unable to checkout git repo %s with revision %s pattern %s: %v", repo.Repo, revision, gitPath, err) | ||
| } | ||
| defer utilio.Close(closer) | ||
|
|
||
| gitFiles, err := gitClient.LsFiles(gitPath, enableNewGitFileGlobbing) | ||
| if err != nil { | ||
| return nil, status.Errorf(codes.Internal, "unable to list files. repo %s with revision %s pattern %s: %v", repo.Repo, revision, gitPath, err) | ||
| } | ||
| log.Debugf("listed %d git files from %s under %s", len(gitFiles), repo.Repo, gitPath) | ||
|
|
||
| res := make(map[string][]byte) | ||
| for _, filePath := range gitFiles { | ||
| fileContents, err := os.ReadFile(filepath.Join(gitClient.Root(), filePath)) | ||
| if err != nil { | ||
| return nil, status.Errorf(codes.Internal, "unable to read files. repo %s with revision %s pattern %s: %v", repo.Repo, revision, gitPath, err) | ||
| } | ||
| res[filePath] = fileContents | ||
| } | ||
|
|
||
| err = s.cache.SetGitFiles(repo.Repo, revision, gitPath, res) | ||
| if err != nil { | ||
| log.Warnf("error caching git files for repo %s with revision %s pattern %s: %v", repo.Repo, revision, gitPath, err) | ||
| } | ||
|
|
||
| return &apiclient.GitFilesResponse{ | ||
| Map: res, | ||
| }, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called by the applicationset controller (used by the git generator). Would be nice to consider if it should be sparse checked out or not.
| func (s *Service) GetGitDirectories(_ context.Context, request *apiclient.GitDirectoriesRequest) (*apiclient.GitDirectoriesResponse, error) { | ||
| repo := request.GetRepo() | ||
| revision := request.GetRevision() | ||
| noRevisionCache := request.GetNoRevisionCache() | ||
| if repo == nil { | ||
| return nil, status.Error(codes.InvalidArgument, "must pass a valid repo") | ||
| } | ||
|
|
||
| gitClient, revision, err := s.newClientResolveRevision(repo, revision, git.WithCache(s.cache, !noRevisionCache)) | ||
| if err != nil { | ||
| return nil, status.Errorf(codes.Internal, "unable to resolve git revision %s: %v", revision, err) | ||
| } | ||
|
|
||
| if err := verifyCommitSignature(request.VerifyCommit, gitClient, revision, repo); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| // check the cache and return the results if present | ||
| if cachedPaths, err := s.cache.GetGitDirectories(repo.Repo, revision); err == nil { | ||
| log.Debugf("cache hit for repo: %s revision: %s", repo.Repo, revision) | ||
| return &apiclient.GitDirectoriesResponse{ | ||
| Paths: cachedPaths, | ||
| }, nil | ||
| } | ||
|
|
||
| s.metricsServer.IncPendingRepoRequest(repo.Repo) | ||
| defer s.metricsServer.DecPendingRepoRequest(repo.Repo) | ||
|
|
||
| // cache miss, generate the results | ||
| closer, err := s.repoLock.Lock(gitClient.Root(), revision, true, func() (goio.Closer, error) { | ||
| return s.checkoutRevision(gitClient, revision, request.GetSubmoduleEnabled()) | ||
| }) | ||
| if err != nil { | ||
| return nil, status.Errorf(codes.Internal, "unable to checkout git repo %s with revision %s: %v", repo.Repo, revision, err) | ||
| } | ||
| defer utilio.Close(closer) | ||
|
|
||
| repoRoot := gitClient.Root() | ||
| var paths []string | ||
| if err := filepath.WalkDir(repoRoot, func(path string, entry fs.DirEntry, fnErr error) error { | ||
| if fnErr != nil { | ||
| return fmt.Errorf("error walking the file tree: %w", fnErr) | ||
| } | ||
| if !entry.IsDir() { // Skip files: directories only | ||
| return nil | ||
| } | ||
|
|
||
| if !s.initConstants.IncludeHiddenDirectories && strings.HasPrefix(entry.Name(), ".") { | ||
| return filepath.SkipDir // Skip hidden directory | ||
| } | ||
|
|
||
| relativePath, err := filepath.Rel(repoRoot, path) | ||
| if err != nil { | ||
| return fmt.Errorf("error constructing relative repo path: %w", err) | ||
| } | ||
|
|
||
| if relativePath == "." { // Exclude '.' from results | ||
| return nil | ||
| } | ||
|
|
||
| paths = append(paths, relativePath) | ||
|
|
||
| return nil | ||
| }); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| log.Debugf("found %d git paths from %s", len(paths), repo.Repo) | ||
| err = s.cache.SetGitDirectories(repo.Repo, revision, paths) | ||
| if err != nil { | ||
| log.Warnf("error caching git directories for repo %s with revision %s: %v", repo.Repo, revision, err) | ||
| } | ||
|
|
||
| return &apiclient.GitDirectoriesResponse{ | ||
| Paths: paths, | ||
| }, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing applies here as for GetGitFiles
Co-authored-by: Papapetrou Patroklos <[email protected]> Signed-off-by: N W <[email protected]>
…eckout cache keys - Add constants for repeated test path strings - Use inline require.NoError for cleaner error checking - Add getCommitSHA helper function to reduce code duplication - Hash sparse checkout paths for manageable cache keys - Normalize and trim sparse paths for consistency - Update tests to work with hashed cache key format Addresses review comments from ppapapetrou76 and blakepettersson
Resolve conflict in fetch function by combining: - New depth parameter from master for shallow clone support - Sparse checkout --filter=blob:none logic for partial clones Both features now work together.
The cache key for ListApps wasnt including sparse checkout paths, so theres a chance a user might get cache hits across different sparse configs and see the wrong apps. Added sparseCheckoutPaths to the request proto and included it in the cache key alongside the commit SHA. Also regenerated proto files and manifests - looks like that's the pattern for proto changes in this repo? Signed-off-by: NW <[email protected]> Signed-off-by: nwarinner <[email protected]>
Summary
This PR introduces sparse checkout support for Git repositories in Argo CD, dramatically improving performance for large monorepos by allowing users to clone only specific directories they need.
Motivation
Large monorepos can cause significant performance issues in Argo CD:
Sparse checkout solves this by cloning only the specified directories, reducing both clone time and disk usage by 90%+ in typical monorepo scenarios.
Changes
Core Implementation
util/git/client.go)WithSparse(paths []string)option for git client configurationInit()andCheckout()to configure and apply sparse checkout when paths are providedCRD Extension
sparseCheckoutPathsfield toApplicationSourceDirectoryin CRD schemaRepo-Server Integration
reposerver/repository/repository.goPartial Clone Support
--filter=blob:none) when sparse checkout is configuredChangedFiles()and revision metadataRepository Cache Key Isolation
.git/info/sparse-checkoutconfiguration conflicts between apps using the same reponormalized-repo-url|sparse:path1,path2(sorted alphabetically for consistency)Testing
util/git/sparse_checkout_test.goutil/git/cache_key_test.goExample Usage