DAAC-521: GitHub REST single-repo template and docs#14
Open
DavidDwyer87 wants to merge 27 commits intomainfrom
Open
DAAC-521: GitHub REST single-repo template and docs#14DavidDwyer87 wants to merge 27 commits intomainfrom
DavidDwyer87 wants to merge 27 commits intomainfrom
Conversation
- Update v1 API sunset date from March 31, 2025 to August 5, 2026 - Fix typo: "obtains" → "obtain" - Normalize COMMENT_FOOTER_BLOG limit from 25 to 50 for consistency - Add missing trailing newlines to both files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace github-v1.json with github-repo-v1.json for single-repo crawling. Update asciidoc documentation to focus on single-repo template only - users should create one datasource per repository for multi-repo crawling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add two new child request blocks for repository labels and issue events, bringing the total entity types from 13 to 15. Update documentation with new endpoint rows, permissions, and entity type references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert confluence directory to match main — these changes belong on a separate branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hardcode owner/repo in all child endpoints of the repo template,
add BLOB child request under CONTENT to download file content via
/git/blobs/{sha} with binaryResponse=true for Tika text extraction.
parentIdKey=full_name is retained on all children (connector requires it).
CONTENT sets parentIdKey=sha to pass blob SHA to BLOB child.
Document expected 404 errors for directory entries in git tree.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…template Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…plate Add COMMIT_DIFF child request (parent: COMMIT) that fetches single-commit detail endpoint to get file-level diff info (files array and stats object). Update AsciiDoc documentation across all sections: entity list, permissions table, rate limiting count (13→14), pagination, variables, endpoints table, parent-child hierarchy notes, and response parsing references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mcondo
reviewed
Apr 2, 2026
Remove github-classic comparison sections and internal references that don't belong in recipe docs. Fix BLOB request to use raw Accept header instead of base64. Remove numFetchThreads/retryCount/maxDelayTime from JSON properties level. Update pagination, rate limiting, and parsing descriptions for accuracy.
mcondo
reviewed
Apr 8, 2026
The /contents endpoint only returns root-level items and is not recursive. Restored /git/trees/HEAD?recursive=1 which returns the full file tree in a single call. Also restored WARNING and NOTE about directory 404 errors when BLOB requests encounter tree entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mcondo
reviewed
Apr 13, 2026
…crawling
Switch from /git/trees/HEAD + /git/blobs/{sha} to /contents/{path} using a
recursive CONTENT → FOLDER → BLOB pattern (similar to Alfresco connector).
This eliminates 404 errors from blob requests hitting directory SHAs and
matches the classic GitHub connector approach. Updated both the JSON template
and AsciiDoc documentation.
…er-branch crawling
- Removed ISSUE and PR_REVIEW_COMMENT entity types from template and docs
- Changed COMMIT parent from REPOSITORY to BRANCH with sha=${LW_PARENT_DATA_KEY}
- Updated COMMIT_DIFF: dataId to blob_url, dataPath to files
- Updated asciidoc docs: entity list, endpoints table, pagination, variables,
notes, known limitations, response parsing, and terminology sections
- Entity count reduced from 16 to 14
mcondo
reviewed
Apr 15, 2026
mcondo
reviewed
Apr 15, 2026
mcondo
reviewed
Apr 15, 2026
…_COMMENT Renamed ISSUE_COMMENT objectType to COMMENT and removed the separate PR_REVIEW_COMMENT entity since the /issues/comments endpoint already captures both issue and PR comments. Simplified documentation throughout.
mcondo
approved these changes
Apr 20, 2026
…html_url Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
github/github-repo-v1.jsonandgithub/github-v1.asciidoc— a REST connector template and documentation for crawling a single GitHub repository.The template targets
/repos/{owner}/{repo}and indexes 15 entity types as separate Solr documents:REPOSITORY, ISSUE, PULL_REQUEST, BRANCH, COMMIT, COMMIT_DIFF, TAG, MILESTONE, COLLABORATOR, RELEASE, COMMENT, COMMIT_COMMENT, CONTENT, FOLDER, BLOB.To crawl multiple repositories, create one datasource per repository.
Design decisions
COMMIT per-branch crawling — COMMIT is a child of BRANCH (not REPOSITORY). The sha=${LW_PARENT_DATA_KEY} query parameter receives the branch name via COMMIT's parentIdKey=name, so commits are crawled across all branches. A commit reachable from multiple branches produces one Solr document per branch (known delta vs classic connector which deduplicates by SHA).
COMMIT_DIFF — child of COMMIT, re-fetches each commit via /commits/{sha} and uses dataPath=files to index one Solr document per changed file with per-file diff fields (filename, status, patch, additions, deletions).
File content via Contents API — three-stage chain: CONTENT (root listing) → FOLDER (recursive traversal via recursiveRequest=true) → BLOB (binary download). CONTENT and FOLDER use skipIndexation=true for discovery only. BLOB uses binaryResponse=true with Accept: application/vnd.github.raw+json for Tika-based text extraction.
Test Results (github-rest vs github-classic)
Known Feature Parity Gaps (vs GitHub Classic V1 Connector)
Related docs:
Test plan
github-repo-v1.jsonis valid JSONdataPath=files