Skip to content

feat(fs): use git commit hash as cache key for clean repositories#8278

Merged
knqyf263 merged 14 commits into
aquasecurity:mainfrom
knqyf263:feat/use_git_commit
Jan 27, 2025
Merged

feat(fs): use git commit hash as cache key for clean repositories#8278
knqyf263 merged 14 commits into
aquasecurity:mainfrom
knqyf263:feat/use_git_commit

Conversation

@knqyf263
Copy link
Copy Markdown
Collaborator

@knqyf263 knqyf263 commented Jan 22, 2025

Overview

Improve cache efficiency for Git repositories by using commit hash as cache key when the repository is in a clean state.

Description

When scanning a Git repository that is in a clean state, this change uses the commit hash as the cache key instead of calculating it from UUID. This allows for better cache reuse and prevents unnecessary cache deletion.

Currently, the file system scanner uses UUID to calculate the cache key, which effectively does not work as a cache key and is deleted after scanning with DeleteBlobs. However, if the directory is a Git repository, we can use the commit hash as a cache key, eliminating the need to delete the cache after scanning.

Key changes:

  • Add Git repository detection and status check
    • For Git repositories, check if the repository is dirty
      • Use commit hash as cache key for clean repositories
      • Use UUID for dirty repositories (existing behavior)
    • Skip cache deletion for clean repositories
  • Add comprehensive tests for both clean and dirty repository states

Related Issues

Checklist

  • I have read the guidelines for contributing to this repository.
  • I have followed the conventions in the PR title.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have updated the documentation with the relevant information (if needed).

When scanning a git repository that is in a clean state, use the commit hash as the cache key instead of calculating it from blob info. This allows for better cache reuse and prevents unnecessary cache deletion.
@knqyf263 knqyf263 self-assigned this Jan 22, 2025
- Introduced a new fixture in `internal/gittest/testdata/fixture.go` to clone a Git repository for unit tests.
- Updated the `magefiles/magefile.go` to include the new fixture in the unit test dependencies.
- Added the test repository path to `.gitignore` to prevent it from being tracked.
- Added NewServerWithRepository function to create a git server with an existing repository, including cloning and fetching all branches and tags.
- Updated NewTestServer to check for the existence of the test repository and provide a clear error message if not found.
- Modified the cloning process in fixture.go to include all branches and tags.
- Removed unused test files.
Signed-off-by: knqyf263 <knqyf263@gmail.com>
Add debug logging when using the latest git commit hash for calculating the artifact cache key, providing more visibility into the cache key generation process.
Signed-off-by: knqyf263 <knqyf263@gmail.com>
Signed-off-by: knqyf263 <knqyf263@gmail.com>
@knqyf263 knqyf263 marked this pull request as ready for review January 24, 2025 08:51
Copy link
Copy Markdown
Contributor

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

But i think we need to update docs:

  • fs may save/extract cache for clean repositories
  • by default we don't save cache for repositories (--cache-backend memory is default for fs/repo mode)
  • it works by default for client/server mode
  • cache only works if the repository directory is the target of Trivy scan (but maybe that's redundant)

art.logger.Debug("Using the latest commit hash for calculating cache key", log.String("commit_hash", hash))
art.commitHash = hash
} else {
art.logger.Debug("Random cache key will be used", log.Err(err))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we don't need to show this log unless it's a git repository directory.
Because we will show the following log for each fs scan

[fs] Random cache key will be used      err="failed to open git repository: repository does not exist"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it might make sense to add path here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated 3802124

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is a bug 😄 I'll fix it.

Signed-off-by: knqyf263 <knqyf263@gmail.com>
Signed-off-by: knqyf263 <knqyf263@gmail.com>
Signed-off-by: knqyf263 <knqyf263@gmail.com>
Signed-off-by: knqyf263 <knqyf263@gmail.com>
Copy link
Copy Markdown
Contributor

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Can we merge this PR?

@knqyf263 knqyf263 added this pull request to the merge queue Jan 27, 2025
Merged via the queue into aquasecurity:main with commit b5062f3 Jan 27, 2025
@knqyf263 knqyf263 deleted the feat/use_git_commit branch January 27, 2025 09:13
RingoDev referenced this pull request in RingoDev/trivy Feb 26, 2025
…278)

Signed-off-by: knqyf263 <knqyf263@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(fs): improve cache efficiency using git commit hash

3 participants