Skip to content

DAAC-521: GitHub REST single-repo template and docs#14

Open
DavidDwyer87 wants to merge 27 commits intomainfrom
DAAC-521-Github-rest
Open

DAAC-521: GitHub REST single-repo template and docs#14
DavidDwyer87 wants to merge 27 commits intomainfrom
DAAC-521-Github-rest

Conversation

@DavidDwyer87
Copy link
Copy Markdown
Contributor

@DavidDwyer87 DavidDwyer87 commented Mar 26, 2026

Summary

Adds github/github-repo-v1.json and github/github-v1.asciidoc — a REST connector template and documentation for crawling a single GitHub repository.

The template targets /repos/{owner}/{repo} and indexes 15 entity types as separate Solr documents: REPOSITORY, ISSUE, PULL_REQUEST, BRANCH, COMMIT, COMMIT_DIFF, TAG, MILESTONE, COLLABORATOR, RELEASE, COMMENT, COMMIT_COMMENT, CONTENT, FOLDER, BLOB.

To crawl multiple repositories, create one datasource per repository.

Design decisions

  • COMMIT per-branch crawling — COMMIT is a child of BRANCH (not REPOSITORY). The sha=${LW_PARENT_DATA_KEY} query parameter receives the branch name via COMMIT's parentIdKey=name, so commits are crawled across all branches. A commit reachable from multiple branches produces one Solr document per branch (known delta vs classic connector which deduplicates by SHA).

  • COMMIT_DIFF — child of COMMIT, re-fetches each commit via /commits/{sha} and uses dataPath=files to index one Solr document per changed file with per-file diff fields (filename, status, patch, additions, deletions).

  • File content via Contents API — three-stage chain: CONTENT (root listing) → FOLDER (recursive traversal via recursiveRequest=true) → BLOB (binary download). CONTENT and FOLDER use skipIndexation=true for discovery only. BLOB uses binaryResponse=true with Accept: application/vnd.github.raw+json for Tika-based text extraction.

Test Results (github-rest vs github-classic)

Entity github-rest github-classic Match
repository 1 1
pull_request 15 15
branch 7 7
commit 49 46 ~+3 (shared commits across branches)
commit_diff 90 80 ~+10 (from extra per-branch commits)
collaborator 23 23
tag 1 1
blob 11 4 REST=file content, Classic=tree nodes
issue_comment 4 4
Total 201 190

Known Feature Parity Gaps (vs GitHub Classic V1 Connector)

Feature Status Details
Per-branch commit deduplication Known delta Commits reachable from multiple branches are indexed per-branch; classic deduplicates by SHA
ETag incremental crawling Not supported Requires persisting ETags per resource across crawls. Deferred.

Related docs:

Test plan

  • Validate github-repo-v1.json is valid JSON
  • Deploy repo template to Fusion and test crawl against a single repo
  • Verify all 14 entity types are indexed (including COMMIT_DIFF, CONTENT/FOLDER/BLOB)
  • Verify COMMIT per-branch crawling produces commits across all branches
  • Verify COMMIT_DIFF extracts per-file change details with dataPath=files
  • Verify Contents API crawling produces no 404 errors
  • Compare document counts against classic connector
  • Review asciidoc for accuracy and completeness

mcondo and others added 12 commits February 10, 2025 11:48
- Update v1 API sunset date from March 31, 2025 to August 5, 2026
- Fix typo: "obtains" → "obtain"
- Normalize COMMENT_FOOTER_BLOG limit from 25 to 50 for consistency
- Add missing trailing newlines to both files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace github-v1.json with github-repo-v1.json for single-repo crawling.
Update asciidoc documentation to focus on single-repo template only -
users should create one datasource per repository for multi-repo crawling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add two new child request blocks for repository labels and issue events,
bringing the total entity types from 13 to 15. Update documentation with
new endpoint rows, permissions, and entity type references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@DavidDwyer87 DavidDwyer87 marked this pull request as ready for review March 26, 2026 19:14
DavidDwyer87 and others added 5 commits March 26, 2026 18:27
Revert confluence directory to match main — these changes belong on a separate branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hardcode owner/repo in all child endpoints of the repo template,
add BLOB child request under CONTENT to download file content via
/git/blobs/{sha} with binaryResponse=true for Tika text extraction.
parentIdKey=full_name is retained on all children (connector requires it).
CONTENT sets parentIdKey=sha to pass blob SHA to BLOB child.
Document expected 404 errors for directory entries in git tree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…template

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…plate

Add COMMIT_DIFF child request (parent: COMMIT) that fetches single-commit
detail endpoint to get file-level diff info (files array and stats object).
Update AsciiDoc documentation across all sections: entity list, permissions
table, rate limiting count (13→14), pagination, variables, endpoints table,
parent-child hierarchy notes, and response parsing references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread github/github-v1.asciidoc Outdated
Comment thread github/github-v1.asciidoc Outdated
Comment thread github/github-v1.asciidoc Outdated
Comment thread github/github-v1.asciidoc Outdated
Comment thread github/github-v1.asciidoc Outdated
Comment thread github/github-v1.asciidoc Outdated
Comment thread github/github-v1.asciidoc Outdated
Comment thread github/github-v1.asciidoc Outdated
Comment thread github/github-repo-v1.json Outdated
Comment thread github/github-repo-v1.json Outdated
Remove github-classic comparison sections and internal references that
don't belong in recipe docs. Fix BLOB request to use raw Accept header
instead of base64. Remove numFetchThreads/retryCount/maxDelayTime from
JSON properties level. Update pagination, rate limiting, and parsing
descriptions for accuracy.
Comment thread github/github-repo-v1.json Outdated
Comment thread github/github-v1.asciidoc Outdated
Comment thread github/github-v1.asciidoc Outdated
Comment thread github/github-repo-v1.json
Comment thread github/github-v1.asciidoc Outdated
The /contents endpoint only returns root-level items and is not recursive.
Restored /git/trees/HEAD?recursive=1 which returns the full file tree in
a single call. Also restored WARNING and NOTE about directory 404 errors
when BLOB requests encounter tree entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread github/github-repo-v1.json Outdated
…crawling

Switch from /git/trees/HEAD + /git/blobs/{sha} to /contents/{path} using a
recursive CONTENT → FOLDER → BLOB pattern (similar to Alfresco connector).
This eliminates 404 errors from blob requests hitting directory SHAs and
matches the classic GitHub connector approach. Updated both the JSON template
and AsciiDoc documentation.
…er-branch crawling

- Removed ISSUE and PR_REVIEW_COMMENT entity types from template and docs
- Changed COMMIT parent from REPOSITORY to BRANCH with sha=${LW_PARENT_DATA_KEY}
- Updated COMMIT_DIFF: dataId to blob_url, dataPath to files
- Updated asciidoc docs: entity list, endpoints table, pagination, variables,
  notes, known limitations, response parsing, and terminology sections
- Entity count reduced from 16 to 14
Comment thread github/github-v1.asciidoc Outdated
Comment thread github/github-repo-v1.json Outdated
Comment thread github/github-repo-v1.json Outdated
DavidDwyer87 and others added 2 commits April 17, 2026 12:04
…_COMMENT

Renamed ISSUE_COMMENT objectType to COMMENT and removed the separate
PR_REVIEW_COMMENT entity since the /issues/comments endpoint already
captures both issue and PR comments. Simplified documentation throughout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants