feat(saved-jobs): add saved/bookmarked jobs scraping with pagination and progress#167
Open
IfThingsThenStuff wants to merge 4 commits intostickerdaniel:mainfrom
Open
Conversation
…orting - Fix wait_for_function positional arg bug (arg= keyword required) - Switch pagination from broken "Next" button to numbered page buttons (button[aria-label="Page N"]) which reliably triggers content updates - Replace arbitrary asyncio.sleep() calls with DOM-based waiting via wait_for_function to detect new job links - Embed job IDs summary in section text so LLMs always surface them - Add on_progress callback for per-page progress reporting Co-Authored-By: Claude Opus 4.6 <[email protected]>
Detect total pages from pagination buttons on the page instead of using max_pages (10), so progress reports reflect reality (1/2, 2/2 instead of 1/10, 2/10). Co-Authored-By: Claude Opus 4.6 <[email protected]>
…kups, and add tests Address review findings: cap total_pages with max_pages to fix misleading progress percentages, add _NAV_DELAY between page clicks for rate-limit safety, convert JS prevIds.includes() to Set.has() for O(1) lookups, guard division by zero in _report, fix docstring inaccuracies, and add 5 targeted tests covering progress callbacks, timeout graceful stop, max_pages cap, and session expired error handling. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Address Greptile review: use Set for O(1) dedup in _EXTRACT_JOB_IDS_JS, expose max_pages parameter on get_saved_jobs MCP tool, and document the new tool in AGENTS.md, README.md, and docs/docker-hub.md. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Author
|
Hey there @stickerdaniel - hope you're doing well. Is there anything I can do to help get this merged in sir? Thanks in advance. Let me know. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for your work here - useful tool. Appreciate your efforts. I wanted the ability to read out my saved jobs - so, I added it. It will handle multiple pages.
Let me know if this is aligned to what you would like to include. Let me know of any changes you think are needed.
Summary
scrape_saved_jobstoLinkedInExtractor— scrapes the LinkedIn jobs tracker page, extracts job IDs from link hrefs, and paginates through results using numbered page buttonsget_saved_jobsMCP tool with progress reporting viaon_progresscallbacktotal_pageswithmax_pagesfor accurate progress percentagesSetfor O(1) job ID deduplication in the DOM polling functionTest plan
test_scrape_saved_jobs_single_page— single page with progress callbacktest_scrape_saved_jobs_paginates— multi-page with progress and ID collectiontest_scrape_saved_jobs_timeout_stops_gracefully— timeout returns partial resultstest_scrape_saved_jobs_stops_at_max_pages_despite_more_buttons— respects max_pages captest_scrape_saved_jobs_empty— empty resultstest_get_saved_jobs— tool-level success pathtest_get_saved_jobs_error— session expired error handlingGreptile Summary
Adds
get_saved_jobsMCP tool to scrape saved/bookmarked jobs from LinkedIn's job tracker with pagination and progress reporting.Key Changes:
/jobs/view/<id>/)Setfor O(1) job ID lookups in both JavaScript extraction and Python filteringon_progresscallback with accurate page counts capped bymax_pagesImplementation Quality:
max_pagesparameter (default 10) for user controljob_idslist and formatted textPrevious Review Items Addressed:
_EXTRACT_JOB_IDS_JS(lines 389-390)max_pagesparameter in tool signature (line 75)Confidence Score: 5/5
Important Files Changed
scrape_saved_jobsmethod with robust pagination logic, Set-based O(1) deduplication, proper error handling, and progress callbacksget_saved_jobsMCP tool with exposedmax_pagesparameter, progress reporting, and consistent error handlingget_saved_jobssuccess path and error handlingFlowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD Start([Start]) --> Navigate[Navigate to jobs-tracker] Navigate --> ExtractPage1[Extract page 1 text and IDs] ExtractPage1 --> CountButtons[Count pagination buttons] CountButtons --> CalcTotal[Calculate total_pages cap] CalcTotal --> ReportP1[Report progress page 1] ReportP1 --> CheckMore{More pages?} CheckMore -->|Yes| CheckButton{Button exists?} CheckButton -->|No| Append[Append ID summary] CheckButton -->|Yes| ClickButton[Click page button] ClickButton --> WaitDelay[Wait nav delay] WaitDelay --> WaitNewIDs{Wait for new IDs} WaitNewIDs -->|Timeout| Append WaitNewIDs -->|Success| Scroll[Scroll to bottom] Scroll --> ExtractText[Extract page text] ExtractText --> ExtractIDs[Extract job IDs] ExtractIDs --> FilterDups[Filter duplicates] FilterDups --> CheckNewIDs{New IDs?} CheckNewIDs -->|No| Append CheckNewIDs -->|Yes| AddIDs[Add to all_job_ids] AddIDs --> ReportProgress[Report progress] ReportProgress --> CheckMore CheckMore -->|No| Append Append --> BuildSections[Build sections dict] BuildSections --> Return([Return result])Last reviewed commit: 5e68717