Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions email_reply_parser/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ class EmailMessage(object):
def __init__(self, text):
self.fragments = []
self.fragment = None
self.text = text.replace('\r\n', '\n')
self.text = '\n'.join(text.splitlines())
self.found_visible = False

def read(self):
Expand Down Expand Up @@ -94,18 +94,20 @@ def _scan_line(self, line):

line - a row of text from an email message
"""
is_quote_header = self.QUOTE_HDR_REGEX.match(line) is not None
is_quoted = self.QUOTED_REGEX.match(line) is not None
is_header = is_quote_header or self.HEADER_REGEX.match(line) is not None
stripped_line = line.strip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about changing HEADER_REGEX from r'^\*?(From|Sent|To|Subject):\*? .+' to r'^\s*\*?(From|Sent|To|Subject):\*? .+' instead?
Other expressions seem not to care about leading whitespaces

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that Yahoo! started behaving weirdly - they add extra spaces everywhere, then they add what appear to be random |s where the original message should be (are they trying to recreate the table cells using |s?). See for example:

 |

|
|
| New Message from Alexandru on Sailo |

 |

 |

|
|
| Ahoy Alexandru, |

 |

 |

|
|
|  Alexandru has sent you a message regarding a trip aboard the X-Yachts Xp 38. |

 |

 |

I figured it was safer, for parsing purposes, if we ignored trailing spaces on every line.

Copy link

@illia-v illia-v Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may result in working incorrectly when someone adds leading spaces intentionally.
For example, list items or code with indents.

I agree that there is a little chance that the spaces will be important in our use case, but for a general-purpose library it's not good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree an email may contain whitespace that's no accident. However, you're gonna have a hard time distinguishing between user intent and email servers' vagaries.

I updated the code to reflect the change that's been observed in Yahoo! behavior. Indeed, it's fair that I don't make assumptions about the other lines.


if self.fragment and len(line.strip()) == 0:
is_quote_header = self.QUOTE_HDR_REGEX.match(stripped_line) is not None
is_quoted = self.QUOTED_REGEX.match(stripped_line) is not None
is_header = is_quote_header or self.HEADER_REGEX.match(stripped_line) is not None

if self.fragment and len(stripped_line) == 0:
if self.SIG_REGEX.match(self.fragment.lines[-1].strip()):
self.fragment.signature = True
self._finish_fragment()

if self.fragment \
and ((self.fragment.headers == is_header and self.fragment.quoted == is_quoted) or
(self.fragment.quoted and (is_quote_header or len(line.strip()) == 0))):
(self.fragment.quoted and (is_quote_header or len(stripped_line) == 0))):

self.fragment.lines.append(line)
else:
Expand Down
7 changes: 7 additions & 0 deletions test/emails/email_1_10.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Base tax cost environment side. May house most director treatment call heavy.
Forward professional woman institution happen. Tell girl hope to. Wrong perhaps apply anything expert main indeed.

On Monday, April 13, 2020, 06:49:16 PM GMT+3, Paige Lee wrote:

Thank experience bag memory hundred understand of. Environmental lose probably majority peace behind. When produce ask tough.
Institution thought system class nice instead speak.
9 changes: 9 additions & 0 deletions test/emails/email_1_9.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Resource popular local capital doctor. Wish with think north shoulder stand catch. Decade many production food view only green.

Believe concern floor treatment admit keep maintain put.
On Friday, April 3, 2020, 06:05:24 PM EDT, Vicki Davis wrote:


Example myself effect understand miss idea. Tonight work home policy arm time report.

Against rest concern each hotel. Person care policy sea. Attack realize suggest save all everything scientist.
72 changes: 69 additions & 3 deletions test/test_email_reply_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,72 @@ def test_complex_body_with_one_fragment(self):

self.assertEqual(1, len(message.fragments))

def test_whitespace_before_header(self):
'''Header has whitespace at the beginning of the line.

Seen in Yahoo! Mail (April 2020) with rich text reply.
'''

message = self.get_email('email_1_9')

self.assertEqual(
3,
len(message.fragments)
)

self.assertEqual(
[False, False, False],
[f.quoted for f in message.fragments]
)

self.assertEqual(
[False, False, False],
[f.signature for f in message.fragments]
)

self.assertEqual(
[False, True, False],
[f.headers for f in message.fragments]
)

self.assertEqual(
[False, True, True],
[f.hidden for f in message.fragments]
)

def test_quote_not_quoted(self):
'''Original email is not quoted at all.

Seen in Yahoo! Mail (April 2020) with plain text reply.
'''

message = self.get_email('email_1_10')

self.assertEqual(
3,
len(message.fragments)
)

self.assertEqual(
[False, False, False],
[f.quoted for f in message.fragments]
)

self.assertEqual(
[False, False, False],
[f.signature for f in message.fragments]
)

self.assertEqual(
[False, True, False],
[f.headers for f in message.fragments]
)

self.assertEqual(
[False, True, True],
[f.hidden for f in message.fragments]
)

def test_verify_reads_signature_correct(self):
message = self.get_email('correct_sig')
self.assertEqual(2, len(message.fragments))
Expand Down Expand Up @@ -166,17 +232,17 @@ def test_multiple_on(self):
self.assertTrue(re.match('^On 9 Jan 2014', message.fragments[1].content))

self.assertEqual(
[False, True, False],
[False, True],
[fragment.quoted for fragment in message.fragments]
)

self.assertEqual(
[False, False, False],
[False, False],
[fragment.signature for fragment in message.fragments]
)

self.assertEqual(
[False, True, True],
[False, True],
[fragment.hidden for fragment in message.fragments]
)

Expand Down