Skip to content

fix(auth): import all LinkedIn cookies in cross-platform bridge#217

Open
andrema2 wants to merge 14 commits intostickerdaniel:mainfrom
andrema2:feature/216-import-all-linkedin-cookies
Open

fix(auth): import all LinkedIn cookies in cross-platform bridge#217
andrema2 wants to merge 14 commits intostickerdaniel:mainfrom
andrema2:feature/216-import-all-linkedin-cookies

Conversation

@andrema2
Copy link

@andrema2 andrema2 commented Mar 12, 2026

Summary

  • Cookie bridge now imports all LinkedIn cookies instead of only li_at and li_rm
  • Session cookies (JSESSIONID, bcookie, lidc, bscookie, liap, etc.) are now preserved during cross-platform bridge
  • Still validates that required auth cookies (li_at/li_rm) are present before importing

Closes #216

Test plan

  • Run --login on macOS to create fresh profile and cookies.json
  • Start MCP server (Docker or uvx) and verify scraping returns content
  • Verify cookies.json contains all LinkedIn cookies after export
  • Verify cross-platform bridge imports all cookies (check logs for count)
  • Run uv run pytest — 163 tests passing

🤖 Generated with Claude Code

Greptile Summary

This PR delivers the stated cookie-bridge fix — import_cookies now filters by domain ("linkedin.com" in domain) instead of by cookie name, importing all LinkedIn session cookies (e.g. JSESSIONID, bcookie, lidc) while still validating that at least one auth token (li_at/li_rm) is present before committing. It also ships a significant batch of new features: posts/comments/notifications scraping tools, a TTL scraping cache, session-level rate-limit state with exponential backoff, an auth-check TTL cache in the driver, and humanized navigation delays. New unit tests specifically address the import_cookies behavior requested in the previous review.

Key changes:

  • core/browser.py: import_cookies domain filter replaces name-based filter; _AUTH_COOKIE_NAMES_REQUIRED_COOKIE_NAMES for clarity
  • scraping/posts.py: New 824-line module for get_my_recent_posts, get_post_comments, get_post_content, get_notifications, find_unreplied_comments
  • scraping/cache.py: New in-memory TTL cache (ScrapingCache) with module-level singleton
  • core/utils.py: New RateLimitState (exponential backoff), humanized_delay(), wait_for_cooldown(); detect_rate_limit now records state on each trigger
  • drivers/browser.py: 120s auth-check cache (_auth_valid_until), scraping_cache.clear() on browser close
  • scraping/extractor.py: Integrates cache, humanized delays, and rate-limit state tracking

Issues found:

  • In _unreplied_via_notifications the JS expression (a.closest('li') || a.closest('div')).innerText can throw TypeError if the anchor has no li or div ancestor, silently triggering the expensive post-scanning fallback
  • TestGetMyRecentPosts mocks page.evaluate to return a plain list (the legacy code path) rather than the { items, scrollHeight } dict the current JS produces, leaving the scroll-loop and height-comparison logic untested

Confidence Score: 4/5

  • Safe to merge; the core cookie-bridge fix is correct and well-tested, with one low-severity JS null-guard issue and a test coverage gap in the new posts scraper.
  • The targeted fix is sound and directly addresses the regression from Cookie bridge imports only auth cookies, losing session state #216. New features are well-structured and follow existing patterns. The two issues found are: (1) a JS null-dereference edge case that degrades gracefully by falling back to the post-scanning path rather than crashing, and (2) a test coverage gap where the primary dict-return path of get_my_recent_posts is not exercised. Neither blocks merging but both are worth addressing.
  • linkedin_mcp_server/scraping/posts.py (JS null guard in _unreplied_via_notifications) and tests/test_posts_scraping.py (dict-path coverage for get_my_recent_posts)

Important Files Changed

Filename Overview
linkedin_mcp_server/core/browser.py Core fix: import_cookies now filters by domain ("linkedin.com" in domain) instead of by name, and validates that at least one required auth cookie (li_at/li_rm) is present before importing. Logic is correct and well-tested in test_core_browser.py.
linkedin_mcp_server/scraping/posts.py Large new file (824 lines) adding posts/comments/notifications scraping. Contains a potential null-dereference in the _unreplied_via_notifications JS evaluation that could silently trigger the expensive fallback path on structural DOM edge cases.
linkedin_mcp_server/core/utils.py Adds RateLimitState with exponential backoff (30s → 300s cap), humanized_delay(), and wait_for_cooldown(). Implementation is clean; detect_rate_limit now records state on each detection. Module-level singleton reset is handled in conftest.py.
linkedin_mcp_server/drivers/browser.py Adds 120s auth-check TTL cache (_auth_valid_until) to avoid redundant DOM queries on every tool call. Cache is correctly invalidated in close_browser() and reset_browser_for_testing(). scraping_cache.clear() is also called on close.
linkedin_mcp_server/scraping/cache.py New in-memory TTL cache backed by a dict of (value, expires_at) tuples. Clean implementation with get, put, invalidate, and clear. Module-level singleton scraping_cache is well-tested in test_cache.py.
tests/test_core_browser.py New test file directly addressing the previous review's request for import_cookies unit tests. Covers all four key scenarios: LinkedIn-only cookies, mixed-domain filtering, missing auth cookies → False, empty/missing file → False, and domain normalization.
tests/test_posts_scraping.py Good coverage for _normalize_post_url, get_post_comments, find_unreplied_comments, and get_notifications. However, get_my_recent_posts tests mock evaluate to return a plain list (legacy path) rather than the dict the JS actually produces, leaving the scroll-loop and dict-handling code paths untested.
linkedin_mcp_server/tools/posts.py New MCP tool registrations for get_my_recent_posts, get_post_comments, get_post_content, get_notifications, find_unreplied_comments. Consistent with existing tool patterns; all have readOnlyHint=True and use handle_tool_error for error handling.
linkedin_mcp_server/scraping/extractor.py Integrates scraping_cache, humanized_delay, rate_limit_state, and wait_for_cooldown. Fixed static _NAV_DELAY replaced with randomized delay. Both extract_page and _extract_overlay now cache successful results and call rate_limit_state.record_success() after navigation.

Sequence Diagram

sequenceDiagram
    participant Client as MCP Client
    participant Tool as tools/posts.py
    participant Driver as drivers/browser.py
    participant Scraper as scraping/posts.py
    participant Cache as scraping/cache.py
    participant Page as Patchright Page
    participant RLS as RateLimitState

    Client->>Tool: get_my_recent_posts(limit)
    Tool->>Driver: ensure_authenticated()
    Note over Driver: Return early if _auth_valid_until not expired
    Driver->>Driver: validate_session() [DOM check]
    Driver-->>Tool: ok
    Tool->>Driver: get_or_create_browser()
    Driver-->>Tool: browser
    Tool->>Scraper: get_my_recent_posts(page, limit)
    Scraper->>RLS: wait_for_cooldown()
    Scraper->>Page: goto(_MY_POSTS_URL)
    Scraper->>RLS: record_success()
    loop scroll & collect until limit or stable height
        Scraper->>Page: evaluate(JS, limit)
        Page-->>Scraper: {items, scrollHeight}
        Scraper->>Page: scroll_to_bottom()
    end
    Scraper-->>Tool: posts[]
    Tool-->>Client: {posts: [...]}

    Client->>Tool: get_post_comments(post_url)
    Tool->>Scraper: get_post_comments(page, url)
    Scraper->>Cache: get(cache_key)
    alt cache hit
        Cache-->>Scraper: cached comments[]
    else cache miss
        Scraper->>RLS: wait_for_cooldown()
        Scraper->>Page: goto(post_url)
        Scraper->>RLS: record_success()
        Scraper->>Page: evaluate(JS comments extractor)
        Page-->>Scraper: comments[]
        Scraper->>Cache: put(cache_key, comments)
    end
    Scraper-->>Tool: comments[]
    Tool-->>Client: {comments: [...]}

    Client->>Tool: find_unreplied_comments(since_days, max_posts)
    Tool->>Scraper: find_unreplied_comments(page, ...)
    Scraper->>Scraper: _unreplied_via_notifications()
    alt notifications have comment items
        Scraper-->>Tool: unreplied from notifications
    else notifications empty (loaded OK)
        Scraper-->>Tool: []
    else notifications failed (None)
        Scraper->>Scraper: get_my_recent_posts() fallback
        loop each post (max 5 navigations)
            Scraper->>Scraper: get_post_comments()
        end
        Scraper-->>Tool: unreplied from post scan
    end
    Tool-->>Client: {unreplied_comments: [...]}
Loading

Comments Outside Diff (3)

  1. scripts/test_my_recent_posts.py, line 1910-1913 (link)

    Portuguese strings in an otherwise English codebase

    This script uses Portuguese for all user-facing output (e.g. "Erro de autenticação. Faça login uma vez:", "Buscando até … posts", "Encontrados:", "Nenhum post encontrado", "JSON completo:"). The rest of the project — docs, comments, log messages, and other scripts — is entirely in English. This inconsistency makes the script harder to use for non-Portuguese-speaking contributors and conflicts with the project's language convention.

    Please translate these strings to English for consistency.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: scripts/test_my_recent_posts.py
    Line: 1910-1913
    
    Comment:
    **Portuguese strings in an otherwise English codebase**
    
    This script uses Portuguese for all user-facing output (e.g. `"Erro de autenticação. Faça login uma vez:"`, `"Buscando até … posts"`, `"Encontrados:"`, `"Nenhum post encontrado"`, `"JSON completo:"`). The rest of the project — docs, comments, log messages, and other scripts — is entirely in English. This inconsistency makes the script harder to use for non-Portuguese-speaking contributors and conflicts with the project's language convention.
    
    Please translate these strings to English for consistency.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. linkedin_mcp_server/scraping/posts.py, line 1443 (link)

    Potential null-dereference in JS evaluation

    a.closest('li') || a.closest('div') can return null if the anchor element has no <li> or <div> ancestor (e.g., an anchor directly under <article> or <section>). Calling .innerText on null throws a TypeError, which causes page.evaluate() to reject. The outer except Exception handler then returns None, silently falling back to the expensive full post-scanning path even when the notifications page loaded correctly.

    Add a null guard before accessing .innerText:

  3. tests/test_posts_scraping.py, line 2349-2376 (link)

    Tests exercise legacy code path, not the current JS return format

    The get_my_recent_posts JS now returns { items: [...], scrollHeight: ... } (a dict), but these tests mock page.evaluate to return a plain list. This causes the tests to exercise the legacy else fallback branch in the Python code instead of the primary isinstance(raw, dict) path.

    As a result, the scroll-loop termination logic (height comparison, multi-call behavior) and the dict-path deduplication are not tested at all. A regression in the dict-handling code would silently go undetected.

    Consider adding a complementary test that mocks evaluate to return the dict format the JS actually produces:

    async def test_returns_posts_from_evaluate_dict_format(
        self, mock_scroll, mock_modal, mock_rate_limit, mock_page
    ):
        mock_page.evaluate = AsyncMock(
            return_value={
                "items": [
                    {
                        "post_url": "https://www.linkedin.com/feed/update/urn:li:activity:1/",
                        "post_id": "urn:li:activity:1",
                        "text_preview": "First post",
                        "created_at": None,
                    }
                ],
                "scrollHeight": 1000,
            }
        )
        result = await get_my_recent_posts(mock_page, limit=10)
        assert len(result) == 1
        assert result[0]["post_url"] == "https://www.linkedin.com/feed/update/urn:li:activity:1/"
Prompt To Fix All With AI
This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/posts.py
Line: 1443

Comment:
**Potential null-dereference in JS evaluation**

`a.closest('li') || a.closest('div')` can return `null` if the anchor element has no `<li>` or `<div>` ancestor (e.g., an anchor directly under `<article>` or `<section>`). Calling `.innerText` on `null` throws a `TypeError`, which causes `page.evaluate()` to reject. The outer `except Exception` handler then returns `None`, silently falling back to the expensive full post-scanning path even when the notifications page loaded correctly.

Add a null guard before accessing `.innerText`:

```suggestion
                const container = a.closest('li') || a.closest('div') || a.closest('article') || a;
                const text = (container.innerText || '').trim();
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/test_posts_scraping.py
Line: 2349-2376

Comment:
**Tests exercise legacy code path, not the current JS return format**

The `get_my_recent_posts` JS now returns `{ items: [...], scrollHeight: ... }` (a dict), but these tests mock `page.evaluate` to return a plain list. This causes the tests to exercise the legacy `else` fallback branch in the Python code instead of the primary `isinstance(raw, dict)` path.

As a result, the scroll-loop termination logic (height comparison, multi-call behavior) and the dict-path deduplication are not tested at all. A regression in the dict-handling code would silently go undetected.

Consider adding a complementary test that mocks `evaluate` to return the dict format the JS actually produces:

```python
async def test_returns_posts_from_evaluate_dict_format(
    self, mock_scroll, mock_modal, mock_rate_limit, mock_page
):
    mock_page.evaluate = AsyncMock(
        return_value={
            "items": [
                {
                    "post_url": "https://www.linkedin.com/feed/update/urn:li:activity:1/",
                    "post_id": "urn:li:activity:1",
                    "text_preview": "First post",
                    "created_at": None,
                }
            ],
            "scrollHeight": 1000,
        }
    )
    result = await get_my_recent_posts(mock_page, limit=10)
    assert len(result) == 1
    assert result[0]["post_url"] == "https://www.linkedin.com/feed/update/urn:li:activity:1/"
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 3532114

andrema2 and others added 13 commits March 3, 2026 10:35
- get_my_recent_posts: incremental scroll with scrollHeight stable stop, seen_urns dedupe
- Add _expand_comments_section helper (Load more / Ver mais)
- get_post_comments: use data-urn urn:li:comment, top-level only, expand before extract
- _get_current_user_name: avatar alt, Me/Eu menu, fallback nav link
- Notifications filter: PT/EN terms (comentário, comentou, resposta, respondeu)
- Tests: legacy evaluate format, find_unreplied since_days/max_scrolls assert

Made-with: Cursor
- Add humanized delays with jitter (1.5-4.0s random) replacing fixed 2.0s
- Add in-memory TTL cache (5min) to avoid re-scraping same pages
- Add session-level rate limit awareness with exponential backoff
- Optimize find_unreplied_comments: cap fallback at 5 navigations
- Improve notifications path to return early on successful empty result
- Cache auth checks for 120s to reduce redundant DOM queries
- Cache post comments to avoid re-fetching in same session

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add new MCP tool to read the text content of a specific LinkedIn post
given its URL, URN, or activity ID. Reuses LinkedInExtractor.extract_page()
for navigation, caching, rate limits, and noise stripping.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add MCP tool to scrape LinkedIn notifications page, returning
structured items with type classification (comment, reaction,
connection, mention, endorsement, job, post, birthday, etc.)

Closes stickerdaniel#211

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Implement since_days date filtering in get_my_recent_posts
- Add get_notifications, get_post_content, get_profile_recent_posts to scraping/__init__.py exports
- Use link-based dedup key for notifications to reduce false positives
- Include current_user_name in comment cache key to prevent stale reply detection

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The cookie bridge only imported li_at and li_rm, discarding session
cookies (JSESSIONID, bcookie, lidc, etc.) that LinkedIn requires for
valid requests. This caused empty responses after bridge activation.

Closes stickerdaniel#216

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Comment on lines +218 to 222
# Verify that required auth cookies are present
cookie_names = {c.get("name") for c in cookies}
if not self._REQUIRED_COOKIE_NAMES & cookie_names:
logger.warning("No auth cookies (li_at/li_rm) found in %s", path)
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No unit tests for the new import_cookies behavior

The core fix of this PR — filtering cookies by domain instead of by name — has no dedicated unit tests. The only reference to import_cookies in the test suite (tests/test_browser_driver.py:40) mocks it out entirely (browser.import_cookies = AsyncMock(return_value=False)), so the new domain-filter logic and the updated required-cookie check are not exercised at all.

Consider adding tests to tests/test_browser_driver.py (or a new tests/test_core_browser.py) that cover at minimum:

  • A cookie file containing only LinkedIn-domain cookies (including li_at) → should return True and import all of them.
  • A cookie file containing mixed-domain cookies → only LinkedIn cookies should be imported.
  • A cookie file with LinkedIn cookies but neither li_at nor li_rm → should return False.
  • An empty or missing cookie file → should return False.

Without these, a regression to the old name-based filtering (or a typo in the domain string) would go undetected.

Prompt To Fix With AI
This is a comment left during a code review.
Path: linkedin_mcp_server/core/browser.py
Line: 218-222

Comment:
**No unit tests for the new `import_cookies` behavior**

The core fix of this PR — filtering cookies by domain instead of by name — has no dedicated unit tests. The only reference to `import_cookies` in the test suite (`tests/test_browser_driver.py:40`) mocks it out entirely (`browser.import_cookies = AsyncMock(return_value=False)`), so the new domain-filter logic and the updated required-cookie check are not exercised at all.

Consider adding tests to `tests/test_browser_driver.py` (or a new `tests/test_core_browser.py`) that cover at minimum:
- A cookie file containing only LinkedIn-domain cookies (including `li_at`) → should return `True` and import all of them.
- A cookie file containing mixed-domain cookies → only LinkedIn cookies should be imported.
- A cookie file with LinkedIn cookies but **neither** `li_at` nor `li_rm` → should return `False`.
- An empty or missing cookie file → should return `False`.

Without these, a regression to the old name-based filtering (or a typo in the domain string) would go undetected.

How can I resolve this? If you propose a fix, please make it concise.

Add dedicated tests for BrowserManager.import_cookies covering domain
filtering, auth-cookie validation, empty/missing files, and domain
normalization. Remove unused _FEED_URL constant from posts.py.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cookie bridge imports only auth cookies, losing session state

1 participant