Conversation
jnwei
left a comment
There was a problem hiding this comment.
Thank you for submitting a fix for this issue @gnikolenyi ! I understand that this fix might be a smaller part of a set of larger fixes planned with the later release. Thank you for adding this fix separately first.
To ensure correctness of the check_sequence logic, it would be extremely helpful to have a few test cases. Simple examples like the one provided by @ECalfeeAdaptive in #72 are perfect for testing this function as it is easy to see the contents of the examples, and easy to check what the expected output should be.
Please let me know if I can provide any assistance with adding the the tests / documentation.
| @@ -167,13 +167,32 @@ def check_sequence( | |||
| bool: | |||
There was a problem hiding this comment.
Could we update the return description to reflect the 3 values that are now being returned?
Also, it doesn't appear that we use the other return values of query_aln and hit_aln. Perhaps it would be easier to add these return values later if/when they are needed?
| # Template cache construction | ||
| def check_sequence( | ||
| query_seq: str, | ||
| query: TemplateHit, |
There was a problem hiding this comment.
Not strictly related to this PR, but could we update docstring for TemplateHit? Some of the fields seem out of date:
openfold-3/openfold3/core/data/io/sequence/template.py
Lines 69 to 85 in 0b9df93
| @@ -152,8 +152,8 @@ def check_sequence( | |||
| """Applies sequence filters to template hits following AF3 SI Section 2.4. | |||
There was a problem hiding this comment.
Could we add a quick description of the template filters from this section. AFAICT from the code, these filters are:
- Fails if
coverage < min_alignthreshold. - Fails if
coverage >= max_subseq AND covered == identical.
Digging into the second statement, the anticipated outputs are:
- covered == identical -- this suggests that the template hit is identical to the query hit, because the non-gaps are located in the same places in the query / template hit. This hit would fail the filter
- covered != identical -- this suggests that some of the gap tokens in the matching sequence are not in the same locations, and thus the sequence is not a perfect match. This hit would pass the filter.
If the above understanding is correct, I am not sure if this would resolve the issue raised in the test example given in #72 . In that case, we have a hit which has 100% coverage, but has a different sequence value. In that test case, I believe the function would still fail the checks in this function.
| @@ -143,7 +143,7 @@ def parse_representatives( | |||
|
|
|||
| # Template cache construction | |||
| def check_sequence( | |||
There was a problem hiding this comment.
Could we consider making this function name more descriptive. Perhaps something like "check_seqence_similarity_within_range"?
Fixes #72