Skip to content

feat: enhanced job search with structured listings and pagination#171

Open
NoahStarkenburg wants to merge 2 commits intostickerdaniel:mainfrom
NoahStarkenburg:feat/enhanced-job-search
Open

feat: enhanced job search with structured listings and pagination#171
NoahStarkenburg wants to merge 2 commits intostickerdaniel:mainfrom
NoahStarkenburg:feat/enhanced-job-search

Conversation

@NoahStarkenburg
Copy link

@NoahStarkenburg NoahStarkenburg commented Mar 2, 2026

Problem
search_jobs returns raw innerText from the page, which only captures visible text. Job IDs exist solely in link href attributes (/jobs/view/12345/), so they are never returned. This makes it impossible to chain search_jobs →
get_job_details. Additionally, only one page of results is loaded (~7-10 jobs), and scrolling targets the main window instead of the sidebar container where job cards are rendered.

Closes #195

Solution

Job ID extraction from hrefs

Added _extract_job_listings() which finds job links via querySelectorAll('a[href*="/jobs/view/"]'), extracts the ID from the href, and gets the title from the link's innerText. No DOM walking or markup-dependent card parsing - stays
consistent with the project's innerText-based design.

Multi-page pagination with dynamic offset

search_jobs now accepts a max_pages parameter (1-100, default 3). The pagination offset advances dynamically by the actual number of listings found per page instead of a hardcoded value. Job IDs are deduplicated across pages, and
pagination stops early if a page returns no new results.

Sidebar scrolling fix

LinkedIn renders job cards in a scrollable sidebar div, not the main page. _scroll_job_list() walks up the DOM from the first job link to find the actual scrollable ancestor and scrolls that instead. Loads ~30% more results per page
(tested: 10 vs 7).

Changes

  • linkedin_mcp_server/scraping/extractor.py - new methods: _extract_job_listings, _scroll_job_list, _extract_job_page; modified: search_jobs
  • linkedin_mcp_server/tools/job.py - exposed max_pages parameter on the search_jobs MCP tool
  • tests/test_scraping.py - 6 new tests covering pagination, dedup, early stopping, and clamping

Test plan

  • All existing tests pass, plus 6 new tests
  • Live tested job search with 1, 3, 10, 20, and 25 pages (207 unique jobs on 25 pages)
  • Verified get_job_details works when chained with IDs from search results
  • Verified deduplication across pages and early stopping
  • A/B tested sidebar scrolling vs default scroll_to_bottom (10 vs 7 results per page)
  • Dynamic pagination offset confirmed working (offsets: 0, 10, 21, 32... based on actual results)

Greptile Summary

This PR enhances search_jobs with structured job listings (returning {job_id, title} per result), multi-page pagination via a max_pages parameter, sidebar-aware scrolling, and cross-page deduplication. The implementation enables the search_jobs → get_job_details workflow that was previously impossible.

Strengths:

  • _extract_job_listings queries a[href*="/jobs/view/"] to extract job IDs and titles without fragile DOM walking — a clean, resilient approach.
  • _scroll_job_list walks up the DOM from the first job link to find and scroll the actual scrollable sidebar ancestor, yielding ~30% more results per page.
  • Multi-page pagination with deduplication works as implemented and is live-tested.

Minor issues:

  • Docstrings in both extractor.py and job.py claim "~10 results per page" but the pagination offset defaults to 25, creating contradictory expectations.
  • The _scroll_job_list default parameter max_scrolls=25 is never exercised — the only call site overrides it with 20, making the default dead code.

Confidence Score: 4/5

  • Safe to merge. Minor documentation/parameter inconsistencies do not affect functionality.
  • The core feature (multi-page job search with ID extraction and deduplication) is solid and live-tested across 25 pages. The two findings are cosmetic issues: docstring inconsistency about page size and a dead default parameter. These don't impact functionality or behavior.
  • No files require special attention. Update docstrings in extractor.py and job.py (lines 479-481 and 88-90) to say ~25 results per page, and align the default max_scrolls parameter to 20.

Sequence Diagram

sequenceDiagram
    participant LLM as LLM / MCP Client
    participant Tool as search_jobs (job.py)
    participant Extractor as LinkedInExtractor
    participant Page as Playwright Page
    participant LI as LinkedIn

    LLM->>Tool: search_jobs(keywords, location, max_pages)
    Tool->>Extractor: search_jobs(keywords, location, max_pages)

    loop For each page (up to max_pages)
        Extractor->>Extractor: build URL (?start=N)
        Extractor->>Page: _extract_job_page(url)
        Page->>LI: goto(url)
        LI-->>Page: DOM loaded
        Page->>Page: _scroll_job_list() — scroll sidebar ancestors
        Page->>Page: evaluate(main.innerText) — raw text
        Page->>Page: _extract_job_listings() — querySelectorAll a[href*="/jobs/view/"]
        Page-->>Extractor: (text, listings)
        Extractor->>Extractor: deduplicate listings by job_id
        alt new_on_page == 0
            Extractor->>Extractor: early stop
        else more pages remain
            Extractor->>Extractor: sleep(_NAV_DELAY)
        end
    end

    Extractor-->>Tool: {url, sections, job_listings, pages_visited, sections_requested}
    Tool-->>LLM: result dict
Loading

Last reviewed commit: cee8fd1

Greptile also left 2 inline comments on this PR.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 12 comments

Edit Code Review Agent Settings | Greptile

@NoahStarkenburg NoahStarkenburg force-pushed the feat/enhanced-job-search branch 4 times, most recently from bb75fb3 to 0634d21 Compare March 2, 2026 13:57
@NoahStarkenburg NoahStarkenburg force-pushed the feat/enhanced-job-search branch 2 times, most recently from 0b3d29a to d529b1f Compare March 2, 2026 15:40
@NoahStarkenburg NoahStarkenburg force-pushed the feat/enhanced-job-search branch from d529b1f to db9501b Compare March 2, 2026 15:48
@stickerdaniel
Copy link
Owner

stickerdaniel commented Mar 5, 2026

Hey, thanks for the PR and for filing the issue

@stickerdaniel
Copy link
Owner

I won't merge this fix as-is though, because the structured card parsing (walking the DOM to extract title, company, location, work_type, etc. per card) goes against the core design of this project. I deliberately use innerText extraction so this mcp don't break every time LinkedIn changes their markup.

@stickerdaniel
Copy link
Owner

I do want to keep the Job ID extraction from hrefs, Sidebar scrolling and Pagination, but the pagination offset should be dynamic instead of hardcoded start=25

- Extract job IDs and titles from link hrefs on search results pages
- Add multi-page pagination (max_pages 1-100, default 3) with dynamic offset
- Smart sidebar scrolling that walks the DOM to find scrollable job list container
- Deduplicate job IDs across pages, stop early on empty results
- Add 6 new tests covering pagination, dedup, early stopping, and clamping
@NoahStarkenburg NoahStarkenburg force-pushed the feat/enhanced-job-search branch from db9501b to 4160c34 Compare March 5, 2026 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Search jobs does not return any job Ids

2 participants