Skip to content

Conversation

@gvanrossum
Copy link
Collaborator

(AI-generated description)

Summary

Enhances email import to properly handle inline replies (where the sender responds inline to quoted text) and tracks whether each text chunk is original content or quoted from someone else.

Changes

New Features

  • Inline reply detection: Recognizes emails where the sender responds inline to > quoted text, not just top-posted replies
  • Chunk source attribution: New chunk_sources field on EmailMessage that parallels text_chunks:
    • None = original content from the email sender
    • str = quoted content (the string is the quoted person's name, or " " if unknown)
  • Quoted person extraction: Parses "On Mon, Dec 10, 2020 at 10:30 AM John Doe wrote:" headers to extract the quoted person's name

Implementation

  • New parse_email_chunks() function returns list[tuple[str, str | None]] with full text and source attribution
  • Preserves quoted content unabbreviated (previously it was discarded or summarized)
  • Handles signature markers (-- ) to exclude signatures from parsed content

Why This Matters

Higher-level ingestion code can now decide how to index quoted text so it doesn't get incorrectly attributed to the email's sender. This enables more accurate knowledge extraction from email threads.

Testing

  • Added comprehensive tests for is_inline_reply() and parse_email_chunks()
  • All existing tests continue to pass

@gvanrossum gvanrossum changed the title Improve email import: handle inline replies and track quoted content attribution Improve email ingestion: handle inline replies and track quoted content attribution Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants