Skip to content

Conversation

@alexei
Copy link
Member

@alexei alexei commented Apr 13, 2020

No description provided.

@alexei alexei requested a review from illia-v April 14, 2020 12:13
@alexei alexei self-assigned this Apr 14, 2020
@alexei alexei added the bug Something isn't working label Apr 14, 2020
@alexei
Copy link
Member Author

alexei commented Apr 14, 2020

This PR fixes two bugs:

  1. Newlines were not handled properly
  2. Yahoo! prepends the header with a couple of extra whitespace characters

@alexei alexei marked this pull request as draft April 14, 2020 14:34
is_quote_header = self.QUOTE_HDR_REGEX.match(line) is not None
is_quoted = self.QUOTED_REGEX.match(line) is not None
is_header = is_quote_header or self.HEADER_REGEX.match(line) is not None
stripped_line = line.strip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about changing HEADER_REGEX from r'^\*?(From|Sent|To|Subject):\*? .+' to r'^\s*\*?(From|Sent|To|Subject):\*? .+' instead?
Other expressions seem not to care about leading whitespaces

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that Yahoo! started behaving weirdly - they add extra spaces everywhere, then they add what appear to be random |s where the original message should be (are they trying to recreate the table cells using |s?). See for example:

 |

|
|
| New Message from Alexandru on Sailo |

 |

 |

|
|
| Ahoy Alexandru, |

 |

 |

|
|
|  Alexandru has sent you a message regarding a trip aboard the X-Yachts Xp 38. |

 |

 |

I figured it was safer, for parsing purposes, if we ignored trailing spaces on every line.

Copy link

@illia-v illia-v Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may result in working incorrectly when someone adds leading spaces intentionally.
For example, list items or code with indents.

I agree that there is a little chance that the spaces will be important in our use case, but for a general-purpose library it's not good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree an email may contain whitespace that's no accident. However, you're gonna have a hard time distinguishing between user intent and email servers' vagaries.

I updated the code to reflect the change that's been observed in Yahoo! behavior. Indeed, it's fair that I don't make assumptions about the other lines.

@alexei alexei marked this pull request as ready for review April 16, 2020 08:30
@alexei alexei merged commit 8d99681 into master Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants