feat: enhanced job search with structured listings and pagination#171
Open
NoahStarkenburg wants to merge 2 commits intostickerdaniel:mainfrom
Open
feat: enhanced job search with structured listings and pagination#171NoahStarkenburg wants to merge 2 commits intostickerdaniel:mainfrom
NoahStarkenburg wants to merge 2 commits intostickerdaniel:mainfrom
Conversation
bb75fb3 to
0634d21
Compare
0b3d29a to
d529b1f
Compare
d529b1f to
db9501b
Compare
9 tasks
Owner
|
Hey, thanks for the PR and for filing the issue |
Owner
|
I won't merge this fix as-is though, because the structured card parsing (walking the DOM to extract title, company, location, work_type, etc. per card) goes against the core design of this project. I deliberately use innerText extraction so this mcp don't break every time LinkedIn changes their markup. |
Owner
|
I do want to keep the Job ID extraction from hrefs, Sidebar scrolling and Pagination, but the pagination offset should be dynamic instead of hardcoded start=25 |
- Extract job IDs and titles from link hrefs on search results pages - Add multi-page pagination (max_pages 1-100, default 3) with dynamic offset - Smart sidebar scrolling that walks the DOM to find scrollable job list container - Deduplicate job IDs across pages, stop early on empty results - Add 6 new tests covering pagination, dedup, early stopping, and clamping
db9501b to
4160c34
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
search_jobs returns raw innerText from the page, which only captures visible text. Job IDs exist solely in link href attributes (/jobs/view/12345/), so they are never returned. This makes it impossible to chain search_jobs →
get_job_details. Additionally, only one page of results is loaded (~7-10 jobs), and scrolling targets the main window instead of the sidebar container where job cards are rendered.
Closes #195
Solution
Job ID extraction from hrefs
Added _extract_job_listings() which finds job links via querySelectorAll('a[href*="/jobs/view/"]'), extracts the ID from the href, and gets the title from the link's innerText. No DOM walking or markup-dependent card parsing - stays
consistent with the project's innerText-based design.
Multi-page pagination with dynamic offset
search_jobs now accepts a max_pages parameter (1-100, default 3). The pagination offset advances dynamically by the actual number of listings found per page instead of a hardcoded value. Job IDs are deduplicated across pages, and
pagination stops early if a page returns no new results.
Sidebar scrolling fix
LinkedIn renders job cards in a scrollable sidebar div, not the main page. _scroll_job_list() walks up the DOM from the first job link to find the actual scrollable ancestor and scrolls that instead. Loads ~30% more results per page
(tested: 10 vs 7).
Changes
Test plan
Greptile Summary
This PR enhances
search_jobswith structured job listings (returning{job_id, title}per result), multi-page pagination via amax_pagesparameter, sidebar-aware scrolling, and cross-page deduplication. The implementation enables thesearch_jobs → get_job_detailsworkflow that was previously impossible.Strengths:
_extract_job_listingsqueriesa[href*="/jobs/view/"]to extract job IDs and titles without fragile DOM walking — a clean, resilient approach._scroll_job_listwalks up the DOM from the first job link to find and scroll the actual scrollable sidebar ancestor, yielding ~30% more results per page.Minor issues:
extractor.pyandjob.pyclaim "~10 results per page" but the pagination offset defaults to 25, creating contradictory expectations._scroll_job_listdefault parametermax_scrolls=25is never exercised — the only call site overrides it with20, making the default dead code.Confidence Score: 4/5
Sequence Diagram
sequenceDiagram participant LLM as LLM / MCP Client participant Tool as search_jobs (job.py) participant Extractor as LinkedInExtractor participant Page as Playwright Page participant LI as LinkedIn LLM->>Tool: search_jobs(keywords, location, max_pages) Tool->>Extractor: search_jobs(keywords, location, max_pages) loop For each page (up to max_pages) Extractor->>Extractor: build URL (?start=N) Extractor->>Page: _extract_job_page(url) Page->>LI: goto(url) LI-->>Page: DOM loaded Page->>Page: _scroll_job_list() — scroll sidebar ancestors Page->>Page: evaluate(main.innerText) — raw text Page->>Page: _extract_job_listings() — querySelectorAll a[href*="/jobs/view/"] Page-->>Extractor: (text, listings) Extractor->>Extractor: deduplicate listings by job_id alt new_on_page == 0 Extractor->>Extractor: early stop else more pages remain Extractor->>Extractor: sleep(_NAV_DELAY) end end Extractor-->>Tool: {url, sections, job_listings, pages_visited, sections_requested} Tool-->>LLM: result dictLast reviewed commit: cee8fd1