Skip to content
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 68 additions & 62 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
- Format: `uv run ruff format .`
- Type check: `uv run ty check` (using ty, not mypy)
- Tests: `uv run pytest` (with coverage: `uv run pytest --cov`)
- Single test: `uv run pytest tests/test_tools.py::test_name -v`
- Parallel tests: `uv run pytest -n auto` (uses pytest-xdist)
- Tests use `asyncio_mode = auto` — async test functions are collected automatically without `@pytest.mark.asyncio`
- Pre-commit hooks: `uv run pre-commit install` then `uv run pre-commit run --all-files`

**Docker Commands:**
Expand All @@ -29,101 +32,104 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

## Architecture Overview

This is a **LinkedIn MCP (Model Context Protocol) Server** that enables AI assistants to interact with LinkedIn through web scraping. The codebase follows a two-phase startup pattern:
This is a **LinkedIn MCP (Model Context Protocol) Server** that enables AI assistants to interact with LinkedIn through web scraping. Built with FastMCP and Patchright (anti-detection Playwright fork).

1. **Authentication Phase** (`authentication.py`) - Validates LinkedIn browser profile exists
### Startup Flow

Two-phase startup pattern:

1. **Authentication Phase** (`authentication.py`) - Validates LinkedIn browser profile exists at `~/.linkedin-mcp/profile/`
2. **Server Runtime Phase** (`server.py`) - Runs FastMCP server with tool registration

**Core Components:**
Entry point is `cli_main.py` which handles CLI args (`--login`, `--logout`, `--status`) before reaching phase 2. Transport modes: `stdio` (default) or `streamable-http`.

### Tool Registration Pattern

Tools are registered in `server.py` via `create_mcp_server()`:

```
register_person_tools(mcp) → get_person_profile, search_people
register_company_tools(mcp) → get_company_profile, get_company_posts
register_job_tools(mcp) → get_job_details, search_jobs
register_posts_tools(mcp) → get_my_recent_posts, get_post_comments, get_post_content, find_unreplied_comments
close_session → registered inline
```

Each tool follows the same pattern: `ensure_authenticated()` → parse sections → `get_or_create_browser()` → `LinkedInExtractor(browser.page)` → scrape → return result.

All scraping tools return: `{url, sections: {name: raw_text}, pages_visited, sections_requested}`

### Two-Level Browser Architecture

**Level 1 — Core** (`core/browser.py`): `BrowserManager` wraps Patchright's `chromium.launch_persistent_context()`. Manages playwright, context, page instances. Handles cookie import/export for the cross-platform cookie bridge (macOS profile → Docker Linux via `cookies.json`).

**Level 2 — Driver** (`drivers/browser.py`): Module-level singleton. `get_or_create_browser()` returns existing or creates new. `close_browser()` exports cookies then cleans up. `ensure_authenticated()` uses a 120s TTL cache to avoid redundant DOM login checks.

### Scraping Engine (`scraping/`)

- `cli_main.py` - Entry point with CLI argument parsing and orchestration
- `server.py` - FastMCP server setup and tool registration
- `tools/` - LinkedIn scraping tools (person, company, job profiles)
- `drivers/browser.py` - Patchright browser management with persistent profile (singleton)
- `core/` - Inlined browser, auth, and utility code (replaces `linkedin_scraper` dependency)
- `scraping/` - innerText extraction engine with Flag-based section selection
- `config/` - Configuration management (schema, loaders)
- `authentication.py` - LinkedIn profile-based authentication
- **`fields.py`** — `PersonScrapingFields` and `CompanyScrapingFields` are `Flag` enums. **One flag = one page navigation.** Never combine multiple URLs behind a single flag. Section names are parsed from comma-separated strings via `parse_person_sections()` / `parse_company_sections()`.
- **`extractor.py`** — `LinkedInExtractor` implements the navigate-scroll-innerText pattern:
1. Navigate to URL, wait for DOM load
2. Dismiss modals (`handle_modal_close`)
3. Scroll to load lazy content (max 5 scrolls)
4. Extract `main.innerText` (or `body.innerText` fallback)
5. Strip sidebar/footer noise via regex (`strip_linkedin_noise`)
6. On soft rate limit (only chrome returned, no content): retry once after 5s backoff
7. Humanized delay (1.5–4s) between section navigations
- **`posts.py`** — Specialized scraping for user posts, post comments, and unreplied comment detection. Uses notifications page when possible for unreplied comments, falls back to scanning recent posts.
- **`cache.py`** — `ScrapingCache` (module-level singleton, 300s TTL). Keyed by URL, skips navigation on cache hit.

**Tool Categories:**
### Rate Limit Handling

- **Person Tools** (`tools/person.py`) - Profile scraping with explicit section selection
- **Company Tools** (`tools/company.py`) - Company profile and posts extraction
- **Job Tools** (`tools/job.py`) - Job posting details and search functionality
`RateLimitState` in `core/utils.py` uses exponential backoff: 30s → 60s → 120s → 300s cap. Success gradually decays the counter (not instant reset). Detection checks URL redirects, CAPTCHA markers, and body text heuristics via `detect_rate_limit()`.

**Available MCP Tools:**
### Configuration Flow (`config/`)

Three-layer precedence: defaults (`schema.py` dataclasses) → env vars (`load_from_env()`) → CLI args (`load_from_args()`). `AppConfig` contains `BrowserConfig` and `ServerConfig`. Accessed via `get_config()` singleton.

### Available MCP Tools

| Tool | Description |
|------|-------------|
| `get_person_profile` | Get profile with explicit `sections` selection (experience, education, interests, honors, languages, contact_info) |
| `get_company_profile` | Get company info with explicit `sections` selection (posts, jobs) |
| `get_company_posts` | Get recent posts from company feed |
| `get_my_recent_posts` | List recent posts from the logged-in user (post_url, post_id, text_preview, created_at) |
| `get_post_comments` | Get top-level comments for a post (post_url or post_id) |
| `get_post_content` | Get the text content of a specific post (post_url or post_id) |
| `find_unreplied_comments` | Find comments on your posts without your reply (since_days, max_posts) |
| `get_notifications` | Get recent notifications from your LinkedIn notifications page (comments, reactions, connections, mentions, jobs, etc.) |
| `get_job_details` | Get job posting details |
| `search_jobs` | Search jobs by keywords and location |
| `close_session` | Close browser session and clean up resources |
| `search_people` | Search for people by keywords and location |
| `close_session` | Close browser session and clean up resources |

**Tool Return Format:**

All scraping tools return: `{url, sections: {name: raw_text}, pages_visited, sections_requested}`

**Scraping Architecture (`scraping/`):**

- `fields.py` - `PersonScrapingFields` and `CompanyScrapingFields` Flag enums
- `extractor.py` - `LinkedInExtractor` class using navigate-scroll-innerText pattern
- **One flag = one navigation.** Each `PersonScrapingFields` / `CompanyScrapingFields` flag must map to exactly one page navigation. Never combine multiple URLs behind a single flag.

**Core Subpackage (`core/`):**

- `exceptions.py` - Exception hierarchy (AuthenticationError, RateLimitError, etc.)
- `browser.py` - `BrowserManager` with persistent context and cookie import/export
- `auth.py` - `is_logged_in()`, `wait_for_manual_login()`, `warm_up_browser()`
- `utils.py` - `detect_rate_limit()`, `scroll_to_bottom()`, `handle_modal_close()`

**Authentication Flow:**
## Testing

- Uses persistent browser profile at `~/.linkedin-mcp/profile/`
- Run with `--login` to create a profile via browser login
Tests live in `tests/` with ~13 test modules. Key fixtures in `conftest.py` (all `autouse=True`):

**Transport Modes:**
- **`reset_singletons`** — Resets browser driver, config, scraping cache, and rate limit state before and after each test.
- **`isolate_profile_dir`** — Monkeypatches `DEFAULT_PROFILE_DIR` and `get_profile_dir()` across all modules to redirect to `tmp_path`. Prevents tests from touching the real `~/.linkedin-mcp/profile/`.
- **`profile_dir`** (not autouse) — Creates a fake profile directory with a placeholder `Default/Cookies` file so `profile_exists()` returns True.
- **`mock_context`** — Mock FastMCP `Context` with `AsyncMock` for `report_progress`.

- `stdio` (default) - Standard I/O for CLI MCP clients
- `streamable-http` - HTTP server mode for web-based MCP clients
When writing new tests: async functions are collected automatically (no `@pytest.mark.asyncio` needed). Use `profile_dir` fixture when the test needs an existing profile. Browser-related tests should mock at the driver level (`drivers/browser.py`) rather than the core level.

## Development Notes

- **Python Version:** Requires Python 3.12+
- **Package Manager:** Uses `uv` for fast dependency resolution
- **Browser:** Uses Patchright (anti-detection Playwright fork) with Chromium
- **Browser:** Patchright (anti-detection Playwright fork) with Chromium
- **Key Dependencies:** `fastmcp` (MCP framework), `patchright` (browser automation)
- **Logging:** Configurable levels, JSON format for non-interactive mode
- **Error Handling:** Comprehensive exception handling for LinkedIn rate limits, captchas, etc.

**Key Dependencies:**

- `fastmcp` - MCP server framework
- `patchright` - Anti-detection browser automation (Playwright fork)

**Configuration:**

- CLI arguments with comprehensive help (`--help`)
- Browser profile stored at `~/.linkedin-mcp/profile/`
- **Browser profile:** Persistent at `~/.linkedin-mcp/profile/`, run `--login` to create

**Commit Message Format:**

- Follow conventional commits: `type(scope): subject`
- Types: feat, fix, docs, style, refactor, test, chore, perf, ci
- Keep subject <50 chars, imperative mood

## Commit Message Guidelines

**Commit Message Rules:**

- Always use the commit message format type(scope): subject
- Types: feat, fix, docs, style, refactor, test, chore, perf, ci
- Keep subject <50 chars, imperative mood

## Important Development Notes

### Development Workflow

- Never sign a PR or commit with Claude Code
Expand Down
49 changes: 49 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,11 @@ What has Anthropic been posting about recently? https://www.linkedin.com/company
| `get_person_profile` | Get profile info with explicit section selection (experience, education, interests, honors, languages, contact_info) | Working |
| `get_company_profile` | Extract company information with explicit section selection (posts, jobs) | Working |
| `get_company_posts` | Get recent posts from a company's LinkedIn feed | Working |
| `get_my_recent_posts` | List recent posts from the logged-in user's feed (post_url, post_id, text_preview, created_at) | Working |
| `get_post_comments` | Get top-level comments for a post (by post_url or post_id) | Working |
| `get_post_content` | Get the full text content of a specific post (by post_url or post_id) | Working |
| `find_unreplied_comments` | Find comments on your posts that you have not replied to (uses notifications when possible) | Working |
| `get_notifications` | Get recent notifications (comments, reactions, connections, mentions, jobs, etc.) | Working |
| `search_jobs` | Search for jobs with keywords and location filters | Working |
| `search_people` | Search for people by keywords and location | Working |
| `get_job_details` | Get detailed information about a specific job posting | Working |
Expand All @@ -51,6 +56,50 @@ What has Anthropic been posting about recently? https://www.linkedin.com/company
> [!IMPORTANT]
> **Breaking change:** LinkedIn recently made some changes to prevent scraping. The newest version uses [Patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright-python) with persistent browser profiles instead of Playwright with session files. Old `session.json` files and `LINKEDIN_COOKIE` env vars are no longer supported. Run `--login` again to create a new profile + cookie file that can be mounted in docker. 02/2026

**Testing posts and comments tools (Claude Desktop):** Run the server via Docker as usual (e.g. `docker run --rm -i -v ~/.linkedin-mcp:/home/pwuser/.linkedin-mcp stickerdaniel/linkedin-mcp-server:latest`). Then ask Claude to: (1) call `get_my_recent_posts` and pick a `post_url`; (2) call `get_post_comments` with that URL; (3) call `find_unreplied_comments(since_days=7, max_posts=20)` and check that returned links open correctly.

## Architecture and flows

The server uses a **two-phase startup**: first it ensures a browser profile exists for authentication, then it starts the FastMCP server with the chosen transport. The browser is a **singleton**—created on first tool use—and all scraping tools share the same extractor (navigate → scroll → innerText → strip noise).

### Startup flow

```mermaid
flowchart TD
main[main]
main --> hasLogout{--logout?}
hasLogout -->|yes| clearProfile[clear_profile_and_exit]
hasLogout -->|no| hasLogin{--login?}
hasLogin -->|yes| getProfile[get_profile_and_exit]
hasLogin -->|no| hasStatus{--status?}
hasStatus -->|yes| profileInfo[profile_info_and_exit]
hasStatus -->|no| ensureAuth[ensure_authentication_ready]
ensureAuth --> profileExists{Profile exists?}
profileExists -->|no, interactive| runSetup[run_interactive_setup]
profileExists -->|no, non-interactive| error[CredentialsNotFoundError]
profileExists -->|yes| phase2[Phase 2: Server runtime]
phase2 --> chooseTransport[Choose transport]
chooseTransport --> createServer[create_mcp_server]
createServer --> mcpRun[mcp.run]
```

### Tool execution flow

When an MCP client calls a scraping tool (e.g. `get_person_profile`, `get_company_profile`, `get_job_details`):

```mermaid
flowchart LR
client[Client calls tool]
client --> ensureAuth[ensure_authenticated]
ensureAuth --> getBrowser[get_or_create_browser]
getBrowser --> extractor[LinkedInExtractor]
extractor --> scrape[Navigate, scroll, innerText]
scrape --> result[Return url, sections, pages_visited]
```

- **ensure_authenticated** uses the same singleton browser; on first use it launches Patchright, opens `linkedin.com/feed`, and checks login (or applies the cookie bridge from `cookies.json` when running in Docker).
- **LinkedInExtractor** visits one URL per requested section (e.g. experience, education), scrolls to load lazy content, extracts innerText, strips sidebar/footer noise, and returns structured sections for the LLM to parse.

<br/>
<br/>

Expand Down
4 changes: 4 additions & 0 deletions docs/docker-hub.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ A Model Context Protocol (MCP) server that connects AI assistants to LinkedIn. A
- **Job Search**: Search for jobs with keywords and location filters
- **People Search**: Search for people by keywords and location
- **Company Posts**: Get recent posts from a company's LinkedIn feed
- **My Recent Posts**: List recent posts from the logged-in user's feed
- **Post Comments**: Get top-level comments for any post (by URL or post ID)
- **Unreplied Comments**: Find comments on your posts that you have not replied to (notifications or scan)
- **Notifications**: Get recent notifications (comments, reactions, connections, mentions, endorsements, jobs, etc.)

## Quick Start

Expand Down
12 changes: 11 additions & 1 deletion linkedin_mcp_server/core/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,14 @@
RateLimitError,
ScrapingError,
)
from .utils import detect_rate_limit, handle_modal_close, scroll_to_bottom
from .utils import (
detect_rate_limit,
handle_modal_close,
humanized_delay,
rate_limit_state,
scroll_to_bottom,
wait_for_cooldown,
)

__all__ = [
"AuthenticationError",
Expand All @@ -24,8 +31,11 @@
"ScrapingError",
"detect_rate_limit",
"handle_modal_close",
"humanized_delay",
"is_logged_in",
"rate_limit_state",
"scroll_to_bottom",
"wait_for_cooldown",
"wait_for_manual_login",
"warm_up_browser",
]
16 changes: 10 additions & 6 deletions linkedin_mcp_server/core/browser.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,14 +184,15 @@ async def export_cookies(self, cookie_path: str | Path | None = None) -> bool:
logger.exception("Failed to export cookies")
return False

_AUTH_COOKIE_NAMES = frozenset({"li_at", "li_rm"})
_REQUIRED_COOKIE_NAMES = frozenset({"li_at", "li_rm"})

async def import_cookies(self, cookie_path: str | Path | None = None) -> bool:
"""Import auth cookies (li_at, li_rm) from a portable JSON file.
"""Import all LinkedIn cookies from a portable JSON file.

Clears all existing browser cookies before importing to avoid
undecryptable cookie conflicts in the persistent store.
Only li_at and li_rm cookies are imported; others are ignored.
All LinkedIn cookies are imported to preserve session state
(JSESSIONID, bcookie, lidc, etc.) alongside auth tokens.
"""
if not self._context:
logger.warning("Cannot import cookies: no browser context")
Expand All @@ -211,17 +212,20 @@ async def import_cookies(self, cookie_path: str | Path | None = None) -> bool:
cookies = [
self._normalize_cookie_domain(c)
for c in all_cookies
if c.get("name") in self._AUTH_COOKIE_NAMES
if "linkedin.com" in c.get("domain", "")
]
if not cookies:

# Verify that required auth cookies are present
cookie_names = {c.get("name") for c in cookies}
if not self._REQUIRED_COOKIE_NAMES & cookie_names:
logger.warning("No auth cookies (li_at/li_rm) found in %s", path)
return False
Comment on lines +218 to 222
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No unit tests for the new import_cookies behavior

The core fix of this PR — filtering cookies by domain instead of by name — has no dedicated unit tests. The only reference to import_cookies in the test suite (tests/test_browser_driver.py:40) mocks it out entirely (browser.import_cookies = AsyncMock(return_value=False)), so the new domain-filter logic and the updated required-cookie check are not exercised at all.

Consider adding tests to tests/test_browser_driver.py (or a new tests/test_core_browser.py) that cover at minimum:

  • A cookie file containing only LinkedIn-domain cookies (including li_at) → should return True and import all of them.
  • A cookie file containing mixed-domain cookies → only LinkedIn cookies should be imported.
  • A cookie file with LinkedIn cookies but neither li_at nor li_rm → should return False.
  • An empty or missing cookie file → should return False.

Without these, a regression to the old name-based filtering (or a typo in the domain string) would go undetected.

Prompt To Fix With AI
This is a comment left during a code review.
Path: linkedin_mcp_server/core/browser.py
Line: 218-222

Comment:
**No unit tests for the new `import_cookies` behavior**

The core fix of this PR — filtering cookies by domain instead of by name — has no dedicated unit tests. The only reference to `import_cookies` in the test suite (`tests/test_browser_driver.py:40`) mocks it out entirely (`browser.import_cookies = AsyncMock(return_value=False)`), so the new domain-filter logic and the updated required-cookie check are not exercised at all.

Consider adding tests to `tests/test_browser_driver.py` (or a new `tests/test_core_browser.py`) that cover at minimum:
- A cookie file containing only LinkedIn-domain cookies (including `li_at`) → should return `True` and import all of them.
- A cookie file containing mixed-domain cookies → only LinkedIn cookies should be imported.
- A cookie file with LinkedIn cookies but **neither** `li_at` nor `li_rm` → should return `False`.
- An empty or missing cookie file → should return `False`.

Without these, a regression to the old name-based filtering (or a typo in the domain string) would go undetected.

How can I resolve this? If you propose a fix, please make it concise.


# Clear undecryptable cookies from the persistent store first.
await self._context.clear_cookies()
await self._context.add_cookies(cookies) # type: ignore[arg-type]
logger.info(
"Imported %d auth cookies from %s: %s",
"Imported %d LinkedIn cookies from %s: %s",
len(cookies),
path,
", ".join(c["name"] for c in cookies),
Expand Down
Loading