Skip to content

Conversation

@nwarinner
Copy link

@nwarinner nwarinner commented Nov 2, 2025

Summary

This PR introduces sparse checkout support for Git repositories in Argo CD, dramatically improving performance for large monorepos by allowing users to clone only specific directories they need.

Motivation

Large monorepos can cause significant performance issues in Argo CD:

  • Clone operations can take 5+ minutes for repos over 10GB
  • Disk usage grows linearly with repo size
  • Sync times increase proportionally

Sparse checkout solves this by cloning only the specified directories, reducing both clone time and disk usage by 90%+ in typical monorepo scenarios.

Changes

Core Implementation

  • Added cone-mode sparse checkout support to the git client (util/git/client.go)
  • Introduced WithSparse(paths []string) option for git client configuration
  • Modified Init() and Checkout() to configure and apply sparse checkout when paths are provided

CRD Extension

  • Added sparseCheckoutPaths field to ApplicationSourceDirectory in CRD schema
  • Field is optional (empty array or nil means full checkout, preserving backward compatibility)
  • Users can specify directory paths relative to repo root

Repo-Server Integration

  • Wired sparse checkout from CRD through to git client in reposerver/repository/repository.go
  • Added logic to detect sparse checkout paths from ApplicationSource and apply them via git client options
  • Propagated sparse options through referenced source resolution

Partial Clone Support

  • Automatically enables partial clone (--filter=blob:none) when sparse checkout is configured
  • Avoids downloading blob objects for files outside sparse paths
  • Preserves commit history for features like ChangedFiles() and revision metadata

Repository Cache Key Isolation

  • Fixed: Repository clone cache now includes sparse checkout paths in the cache key
  • Applications with different sparse checkout paths get isolated working directories
  • Prevents .git/info/sparse-checkout configuration conflicts between apps using the same repo
  • Format: normalized-repo-url|sparse:path1,path2 (sorted alphabetically for consistency)
  • Backward compatible: apps without sparse checkout continue using original cache keys

Testing

  • Some test coverage in util/git/sparse_checkout_test.go
  • Cache key generation tests in util/git/cache_key_test.go
  • Tests verify sparse checkout configuration, file visibility, and cache isolation
  • Benchmark tests demonstrate real-world performance improvements
  • All existing tests pass, confirming backward compatibility

Example Usage

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  source:
    repoURL: https://github.com/large-org/monorepo
    path: apps/frontend
    directory:
      sparseCheckoutPaths:
        - apps/frontend
        - shared/common

This commit adds test cases demonstrating the intended behavior
for sparse checkout functionality in git monorepos.

Tests cover:
- Basic sparse checkout for single path (skipped until implemented)
- Multiple sparse paths (skipped until implemented)
- Backward compatibility when sparse checkout is disabled (passing)
- Cross-platform file URL handling (Windows/Unix)
- Wildcard patterns (skipped for future work)

The passing test validates current behavior. Skipped tests
document intended sparse checkout behavior and will be enabled
as the feature is implemented.

Relates to argoproj#11198

Signed-off-by: nwarinner <[email protected]>
- Add sparsePaths field to nativeGitClient
- Implement WithSparse ClientOpts function for cone mode
- Update tests to use new WithSparse API
- Tests demonstrate intended behavior (currently skipped)

This follows the design from issue argoproj#11198 discussion:
- Per-application sparse checkout config
- Cone mode for simplicity and performance
- Native git CLI (go-git lacks sparse support)

Next steps: implement Init() logic for sparse-checkout

Relates to argoproj#11198

Signed-off-by: nwarinner <[email protected]>
Core implementation:
- configureSparseCheckout() sets up cone mode before checkout
- Uses git config and .git/info/sparse-checkout file
- Checkout() applies sparse patterns to working tree
- Works with both single and multiple directory paths

All tests passing:
- BasicPaths: only specified dirs are checked out
- MultiplePaths: multiple sparse paths work correctly
- Disabled: backward compatibility maintained

This solves the 8GB monorepo clone issue by only pulling
specified directories. Example: app in apps/frontend only
clones that directory instead of the entire repo.

Relates to argoproj#11198

Signed-off-by: nwarinner <[email protected]>
Adds sparseCheckoutPaths field to the Application CRD, allowing users
to specify which directories to include in sparse git checkout.

Usage example:
`yaml
spec:
  source:
    repoURL: https://github.com/org/monorepo
    path: apps/frontend
    directory:
      sparseCheckoutPaths:
        - apps/frontend
        - shared/common
`

This CRD change enables per-application sparse checkout configuration,
solving the large monorepo performance problem at the API level.

- Field is optional (backward compatible)
- Uses protobuf tag for proper serialization
- Includes comprehensive documentation
- Validated with build tests

Next: Wire through repo-server to git client

Relates to argoproj#11198

Signed-off-by: nwarinner <[email protected]>
Connects the sparseCheckoutPaths CRD field to the git client implementation.

Changes:
- Read sparseCheckoutPaths from ApplicationSource.Directory
- Pass paths to git client via WithSparse() option
- Update resolveReferencedSources to accept variadic opts
- Add logging when sparse checkout is enabled

Now users can specify sparse paths in their Application YAML and
Argo CD will automatically use sparse checkout:

\\\yaml
spec:
  source:
    directory:
      sparseCheckoutPaths:
        - apps/frontend
        - shared/common
\\\

This completes the end-to-end integration. The feature is now fully functional.

Relates to argoproj#11198

Signed-off-by: nwarinner <[email protected]>
…mance gains

Combines sparse checkout with partial clone (--filter=blob:none) to
dramatically reduce both clone time and disk usage for monorepo scenarios.

Performance (Argo CD repo, single 5-file directory):
- Full clone: 15.28s, 184.52 MB
- Sparse + partial: 5.32s, 53.63 MB
- Improvement: 3x faster, 71% less disk

This automatically enables partial clone when sparse checkout paths
are configured, avoiding the download of blob objects outside sparse
paths while preserving commit history for features like ChangedFiles()
and revision metadata.

Critical for repo-server managing hundreds of apps from large monorepos.
Scales linearly with number of apps from same repo.

Relates to argoproj#11198

Signed-off-by: nwarinner <[email protected]>
@nwarinner nwarinner requested a review from a team as a code owner November 2, 2025 06:24
@bunnyshell
Copy link

bunnyshell bot commented Nov 2, 2025

🔴 Preview Environment stopped on Bunnyshell

See: Environment Details | Pipeline Logs

Available commands (reply to this comment):

  • 🔵 /bns:start to start the environment
  • 🚀 /bns:deploy to redeploy the environment
  • /bns:delete to remove the environment

- Use new octal literal style (0o755, 0o644)
- Add t.Helper() to test helper function
- Fix gofumpt formatting (blank line spacing)
- Follows gocritic, gofumpt, and thelper linter rules

Signed-off-by: nwarinner <[email protected]>
@nwarinner nwarinner force-pushed the feature/sparse-checkout-tdd branch from 9920f7a to 2e4e037 Compare November 2, 2025 07:07
@codecov
Copy link

codecov bot commented Nov 2, 2025

Codecov Report

❌ Patch coverage is 67.74194% with 20 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@b91c191). Learn more about missing BASE report.

Files with missing lines Patch % Lines
util/git/client.go 65.85% 7 Missing and 7 partials ⚠️
reposerver/repository/repository.go 71.42% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #25137   +/-   ##
=========================================
  Coverage          ?   62.28%           
=========================================
  Files             ?      351           
  Lines             ?    49264           
  Branches          ?        0           
=========================================
  Hits              ?    30684           
  Misses            ?    15653           
  Partials          ?     2927           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@todaywasawesome todaywasawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! At first blush this looks pretty inline with the implementation I was working on. I'll review more tomorrow.

Did you handle caching keys and what is the approach?

@nwarinner
Copy link
Author

nwarinner commented Nov 2, 2025

@todaywasawesome

Great work! At first blush this looks pretty inline with the implementation I was working on. I'll review more tomorrow.

Did you handle caching keys and what is the approach?

I think manifest generation calls manifestCacheKey in cache.go:326
manifestCacheKey(revision, appSrc, srcRefs, namespace, trackingMethod, appLabelKey, appName, clusterInfo, refSourceCommitSHAs, installationID)

manifestCacheKey() includes appSourceKey(appSrc, srcRefs, refSourceCommitSHAs) at line 332

looks like the appSourceKey and appSourceKeyJSON functions might already include SparseCheckoutPaths because the entire ApplicationSource (which contains Directory.SparseCheckoutPaths) gets marshaled into the cache key via appSourceKeyJSON(). which grabs the ENTIRE applicationSource struct as JSON? I might need a second pair of eyes on that tho. I thought (maybe incorrectly) that the sparse checkout paths would result in differrent cache keys and maybe prevent cache pollution.

I think that with two sources:
• Source A: {path: "apps", directory: {recurse: true}}
• Source B: {path: "apps", directory: {recurse: true, sparseCheckoutPaths: ["apps/frontend"]}}

JSON marshaling should produce:
• Source A JSON: {"path":"apps","directory":{"recurse":true}}
• Source B JSON: {"path":"apps","directory":{"recurse":true,"sparseCheckoutPaths":["apps/frontend"]}}

@nwarinner
Copy link
Author

@todaywasawesome ,

Oooh i see what you mean; the manifest cache keys will be ok but the repo cache path keys are normalized.. I have not worked on that. Let me see if can take a swing at it 🤞

the repository clone cache was keyed only on the normalized
repository URL. This could cause applications using the same repository but
different sparse checkout paths to share the same working directory,
leading to sparse checkout configuration conflicts where the
.git/info/sparse-checkout file would be overwritten.

This commit fixes the issue by:

1. Adding GetSparseCheckoutKey() helper that generates a deterministic
   cache key component from sparse checkout paths. Paths are sorted
   alphabetically to ensure consistency regardless of input order.

2. Modifying newClient() to include sparse checkout paths in the
   repository cache key using format: "repo-url|sparse:path1,path2"

3. Updating runRepoOperation() to extract sparse paths from the
   application source and pass them through to client creation.

4. Adding tests for cache key generation and sparse
   checkout behavior.

Applications without sparse checkout continue using the same cache keys
(no suffix), maintaining backwards compatibility.

Fixes cache interference between applications with different sparse
checkout configurations while preserving manifest cache behavior which
already correctly includes sparse paths.

Signed-off-by: nwarinner <[email protected]>
@nwarinner nwarinner force-pushed the feature/sparse-checkout-tdd branch 2 times, most recently from 7fb4a17 to fd6dbf0 Compare November 2, 2025 23:23
@blakepettersson
Copy link
Member

I was actually looking at this during the weekend. Since you beat me to the chase I do have a few things to point out.

Comment on lines 180 to 182
if remotes, err := repo.Remotes(); err == nil && len(remotes) > 0 && len(remotes[0].Config().URLs) > 0 {
s.gitRepoPaths.Add(git.NormalizeGitURL(remotes[0].Config().URLs[0]), fullPath)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider the cache key here as well when it comes to sparse checkouts

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooohhhh.. the Init() doesn't include sparse paths when restoring repo paths on startup. I added sparse path handling in newClientWithSparseCheckoutPaths() for normal operations, but didn't think about the startup case.

im not actually sure what the best way to fix this is. should Init() just skip restoring sparse paths, or is there a way to persist the sparse path info with the directories?

Comment on lines 236 to 267
func (s *Service) ListApps(ctx context.Context, q *apiclient.ListAppsRequest) (*apiclient.AppList, error) {
gitClient, commitSHA, err := s.newClientResolveRevision(q.Repo, q.Revision)
if err != nil {
return nil, fmt.Errorf("error setting up git client and resolving given revision: %w", err)
}
if apps, err := s.cache.ListApps(q.Repo.Repo, commitSHA); err == nil {
log.Infof("cache hit: %s/%s", q.Repo.Repo, q.Revision)
return &apiclient.AppList{Apps: apps}, nil
}

s.metricsServer.IncPendingRepoRequest(q.Repo.Repo)
defer s.metricsServer.DecPendingRepoRequest(q.Repo.Repo)

closer, err := s.repoLock.Lock(gitClient.Root(), commitSHA, true, func() (goio.Closer, error) {
return s.checkoutRevision(gitClient, commitSHA, s.initConstants.SubmoduleEnabled)
})
if err != nil {
return nil, fmt.Errorf("error acquiring repository lock: %w", err)
}

defer utilio.Close(closer)
apps, err := discovery.Discover(ctx, gitClient.Root(), gitClient.Root(), q.EnabledSourceTypes, s.initConstants.CMPTarExcludedGlobs, []string{})
if err != nil {
return nil, fmt.Errorf("error discovering applications: %w", err)
}
err = s.cache.SetApps(q.Repo.Repo, commitSHA, apps)
if err != nil {
log.Warnf("cache set error %s/%s: %v", q.Repo.Repo, commitSHA, err)
}
res := apiclient.AppList{Apps: apps}
return &res, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what to do when it comes to listing apps, but we need to consider what happens here if we have multiple checked out paths

Comment on lines 2881 to 2947
func (s *Service) GetGitFiles(_ context.Context, request *apiclient.GitFilesRequest) (*apiclient.GitFilesResponse, error) {
repo := request.GetRepo()
revision := request.GetRevision()
gitPath := request.GetPath()
noRevisionCache := request.GetNoRevisionCache()
enableNewGitFileGlobbing := request.GetNewGitFileGlobbingEnabled()
if gitPath == "" {
gitPath = "."
}

if repo == nil {
return nil, status.Error(codes.InvalidArgument, "must pass a valid repo")
}

gitClient, revision, err := s.newClientResolveRevision(repo, revision, git.WithCache(s.cache, !noRevisionCache))
if err != nil {
return nil, status.Errorf(codes.Internal, "unable to resolve git revision %s: %v", revision, err)
}

if err := verifyCommitSignature(request.VerifyCommit, gitClient, revision, repo); err != nil {
return nil, err
}

// check the cache and return the results if present
if cachedFiles, err := s.cache.GetGitFiles(repo.Repo, revision, gitPath); err == nil {
log.Debugf("cache hit for repo: %s revision: %s pattern: %s", repo.Repo, revision, gitPath)
return &apiclient.GitFilesResponse{
Map: cachedFiles,
}, nil
}

s.metricsServer.IncPendingRepoRequest(repo.Repo)
defer s.metricsServer.DecPendingRepoRequest(repo.Repo)

// cache miss, generate the results
closer, err := s.repoLock.Lock(gitClient.Root(), revision, true, func() (goio.Closer, error) {
return s.checkoutRevision(gitClient, revision, request.GetSubmoduleEnabled())
})
if err != nil {
return nil, status.Errorf(codes.Internal, "unable to checkout git repo %s with revision %s pattern %s: %v", repo.Repo, revision, gitPath, err)
}
defer utilio.Close(closer)

gitFiles, err := gitClient.LsFiles(gitPath, enableNewGitFileGlobbing)
if err != nil {
return nil, status.Errorf(codes.Internal, "unable to list files. repo %s with revision %s pattern %s: %v", repo.Repo, revision, gitPath, err)
}
log.Debugf("listed %d git files from %s under %s", len(gitFiles), repo.Repo, gitPath)

res := make(map[string][]byte)
for _, filePath := range gitFiles {
fileContents, err := os.ReadFile(filepath.Join(gitClient.Root(), filePath))
if err != nil {
return nil, status.Errorf(codes.Internal, "unable to read files. repo %s with revision %s pattern %s: %v", repo.Repo, revision, gitPath, err)
}
res[filePath] = fileContents
}

err = s.cache.SetGitFiles(repo.Repo, revision, gitPath, res)
if err != nil {
log.Warnf("error caching git files for repo %s with revision %s pattern %s: %v", repo.Repo, revision, gitPath, err)
}

return &apiclient.GitFilesResponse{
Map: res,
}, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called by the applicationset controller (used by the git generator). Would be nice to consider if it should be sparse checked out or not.

Comment on lines 2969 to 3045
func (s *Service) GetGitDirectories(_ context.Context, request *apiclient.GitDirectoriesRequest) (*apiclient.GitDirectoriesResponse, error) {
repo := request.GetRepo()
revision := request.GetRevision()
noRevisionCache := request.GetNoRevisionCache()
if repo == nil {
return nil, status.Error(codes.InvalidArgument, "must pass a valid repo")
}

gitClient, revision, err := s.newClientResolveRevision(repo, revision, git.WithCache(s.cache, !noRevisionCache))
if err != nil {
return nil, status.Errorf(codes.Internal, "unable to resolve git revision %s: %v", revision, err)
}

if err := verifyCommitSignature(request.VerifyCommit, gitClient, revision, repo); err != nil {
return nil, err
}

// check the cache and return the results if present
if cachedPaths, err := s.cache.GetGitDirectories(repo.Repo, revision); err == nil {
log.Debugf("cache hit for repo: %s revision: %s", repo.Repo, revision)
return &apiclient.GitDirectoriesResponse{
Paths: cachedPaths,
}, nil
}

s.metricsServer.IncPendingRepoRequest(repo.Repo)
defer s.metricsServer.DecPendingRepoRequest(repo.Repo)

// cache miss, generate the results
closer, err := s.repoLock.Lock(gitClient.Root(), revision, true, func() (goio.Closer, error) {
return s.checkoutRevision(gitClient, revision, request.GetSubmoduleEnabled())
})
if err != nil {
return nil, status.Errorf(codes.Internal, "unable to checkout git repo %s with revision %s: %v", repo.Repo, revision, err)
}
defer utilio.Close(closer)

repoRoot := gitClient.Root()
var paths []string
if err := filepath.WalkDir(repoRoot, func(path string, entry fs.DirEntry, fnErr error) error {
if fnErr != nil {
return fmt.Errorf("error walking the file tree: %w", fnErr)
}
if !entry.IsDir() { // Skip files: directories only
return nil
}

if !s.initConstants.IncludeHiddenDirectories && strings.HasPrefix(entry.Name(), ".") {
return filepath.SkipDir // Skip hidden directory
}

relativePath, err := filepath.Rel(repoRoot, path)
if err != nil {
return fmt.Errorf("error constructing relative repo path: %w", err)
}

if relativePath == "." { // Exclude '.' from results
return nil
}

paths = append(paths, relativePath)

return nil
}); err != nil {
return nil, err
}

log.Debugf("found %d git paths from %s", len(paths), repo.Repo)
err = s.cache.SetGitDirectories(repo.Repo, revision, paths)
if err != nil {
log.Warnf("error caching git directories for repo %s with revision %s: %v", repo.Repo, revision, err)
}

return &apiclient.GitDirectoriesResponse{
Paths: paths,
}, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing applies here as for GetGitFiles

nwarinner and others added 4 commits November 5, 2025 16:04
Co-authored-by: Papapetrou Patroklos <[email protected]>
Signed-off-by: N W <[email protected]>
…eckout cache keys

- Add constants for repeated test path strings
- Use inline require.NoError for cleaner error checking
- Add getCommitSHA helper function to reduce code duplication
- Hash sparse checkout paths for manageable cache keys
- Normalize and trim sparse paths for consistency
- Update tests to work with hashed cache key format

Addresses review comments from ppapapetrou76 and blakepettersson
Resolve conflict in fetch function by combining:
- New depth parameter from master for shallow clone support
- Sparse checkout --filter=blob:none logic for partial clones

Both features now work together.
The cache key for ListApps wasnt including sparse checkout paths,
so theres a chance a user might get cache hits across different sparse configs and see the wrong apps.

Added sparseCheckoutPaths to the request proto and included it in the
cache key alongside the commit SHA.

Also regenerated proto files and manifests - looks like that's the pattern
for proto changes in this repo?

Signed-off-by: NW <[email protected]>
Signed-off-by: nwarinner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants