Skip to content

Handle embedded quote in mmcif#619

Merged
padix-key merged 5 commits intobiotite-dev:mainfrom
0ut0fcontrol:handle_embedded_quote
Jul 14, 2024
Merged

Handle embedded quote in mmcif#619
padix-key merged 5 commits intobiotite-dev:mainfrom
0ut0fcontrol:handle_embedded_quote

Conversation

@0ut0fcontrol
Copy link
Copy Markdown
Contributor

@0ut0fcontrol 0ut0fcontrol commented Jun 30, 2024

fix #570

use 3 regex patterns to match fields in one line for handle embed quote in mmcif file:

  single_quote_pattern = r"('(?:'(?! )|[^'])*')(?:\s|$)"
  double_quote_pattern = r'("(?:"(?! )|[^"])*")(?:\s|$)'
  unquoted_pattern = r"([^\s]+)"

GPT4 explain single_quote_pattern:

This regex single_quote_pattern = r"('(?:'(?! )|[^'])*')(?:\s|$)" is engineered to identify and extract substrings enclosed in single quotes from a larger text, with a particular sensitivity to handle internal apostrophes correctly. Let's dissect this expression to understand how it functions:

  1. ': This matches the opening single quote ' of the target substring.

  2. (?: ... ): This is a non-capturing group, which means it groups the contained pattern parts without storing the matched substring. This is used here mainly for grouping purposes without needing backreferences.

  3. '(?! ): This is a negative lookahead assertion that matches a single quote ' only if it's not immediately followed by a space . This allows the regex to match apostrophes within words (like in contractions such as don't) without treating them as the end of the quoted substring.

  4. |: The logical OR operator presents an alternative within the non-capturing group. It separates the negative lookahead for internal apostrophes from the next part of the pattern.

  5. [^']: This is a negated character class that matches any character except a single quote '. This part of the expression ensures that the regex consumes all characters within the quotes until it encounters the next single quote, which might signify the end of the quoted substring.

  6. *: This quantifier applies to the non-capturing group, allowing the contained pattern to repeat any number of times — including zero times — thus enabling the regex to match quoted substrings of any length.

  7. ': Matches the closing single quote of the substring.

  8. (?:\s|$): Another non-capturing group that operates as a condition for what follows the closing quote. It matches either:

    • \s: A whitespace character, ensuring that the quoted substring is followed by a space, or
    • $: The end of a line or string, allowing for the quoted substring to appear at the end of the text.

The Key Points:

  • The pattern is designed to efficiently target substrings enclosed in single quotes within a larger string or document.
  • It smartly handles situations where an apostrophe is part of the enclosed text (like in contractions) without mistakenly recognizing it as the end of the quoted section.
  • By requiring the quoted substring to be followed by a space or the end of the text, it imposes a sensible boundary condition to identify discrete quoted substrings within a flow of text.

This regex could be particularly useful in text parsing applications where accurately distinguishing between quoted strings and regular text is crucial, such as in natural language processing tasks, data extraction, or in developing syntax highlighters for code editors.

@0ut0fcontrol
Copy link
Copy Markdown
Contributor Author

@padix-key
I'm sorry, I've been too busy with work and haven't had much time to delve into regex.
Regex can be quite a headache.

Could you take a look at this solution?
I'm not sure if the test covers all scenarios, and I'm thinking of adding more tests.
Do you have any suggestions?

@0ut0fcontrol 0ut0fcontrol force-pushed the handle_embedded_quote branch from a09e8ab to 23f4e2f Compare June 30, 2024 09:36
@0ut0fcontrol 0ut0fcontrol marked this pull request as ready for review June 30, 2024 09:54
Copy link
Copy Markdown
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for preparing the fix! I have put a few suggestions into the review.

@0ut0fcontrol
Copy link
Copy Markdown
Contributor Author

Thank you for your review. I will provide feedback as soon as possible. I will have time in the evening or during the weekend.

@padix-key
Copy link
Copy Markdown
Member

Thanks for the benchmarks. I will look into your comments tomorrow.

@padix-key
Copy link
Copy Markdown
Member

Seems like your approach seems the most efficient one (at least I could not come up with a better one). So only two discussions remain.

@0ut0fcontrol
Copy link
Copy Markdown
Contributor Author

Thank you for your review, I will finish this PR ASAP.

@0ut0fcontrol
Copy link
Copy Markdown
Contributor Author

@padix-key
All discussions have been resolved. This PR is ready for review.

Copy link
Copy Markdown
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks again for delving into regex and the the thorough benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failed to deserialize category 'entity' with ValueError: No closing quotation

2 participants