Skip to content

Add recombinase assembly algorithm for attB/attP -- Generalized integrase issue #435#496

Closed
areebamomin wants to merge 1 commit intopydna-group:masterfrom
areebamomin:issue_435
Closed

Add recombinase assembly algorithm for attB/attP -- Generalized integrase issue #435#496
areebamomin wants to merge 1 commit intopydna-group:masterfrom
areebamomin:issue_435

Conversation

@areebamomin
Copy link
Contributor

This PR implements a recombinase-based assembly algorithm for pydna by adding a new function, make_recombinase_algorithm, to src/pydna/assembly2.py. The function identifies homologous recombination regions by extracting the lowercase core shared between attB and attP recognition sites and returning match tuples in the format expected by Assembly to behave consistently with other supported assembly strategies. A corresponding test suite (tests/test_recombinase_overlap.py) was added to verify homology detection, edge cases, multiple matches, and full integration with the Assembly class. All tests pass successfully using both python run_test.py and pytest, and all doctests in assembly2.py also run without errors.

Hopefully closes or makes some progress on #435 !

Thank you for letting me have a go at learning more about the program and hopefully can build on this to a successful contribution!

@manulera
Copy link
Collaborator

Dear @areebamomin, thanks for your contribution. It's already looking great, but I have a few comments and suggestions:

Need to have

  • Could you move the functions to a separate module? I think it's important enough to have it's own module. You can call it recombinase.py, then the test file could be renamed to test_module_recombinase.py. You can then also add a short description to the module as a docstring.
  • Use regex for _recombinase_homology_offset_and_length, it will simplify the implementation, and provide further validation (e.g. it will fail if there are numbers or symbols).
  • Handle circular sequences and degenerate sequences.
    • Your current implementation would not work if the sites span the origin of a circular sequence.
    • It would be nice to accept degenerate sequences as sites (e.g. CHWVTWTGTACAAAAAANNNG, see gateway.py)
    • Check the function dseqrecord_finditer, which handles both this cases. There are a few examples of usage in the repository.
  • Add tests that handle the circular case and use a degenerate sequence.
  • Remove attB / attP nomenclature. attL and attR sites can also recombine with each other, and could be handled with the same functionality. You can simply call them site1 site2. I think this makes sense, but let me know if I am missing something.
  • In recombinase_overlap, make limit None by default and in the docstring explain that it's not used, that it's just a convention for all algorithm functions.

Nice to have

These are just some ideas in case you want to invest a more time in this. They would be quite helpful!

  • Make the recombinase into a class, and the algorithm could be a method in it.
  • Add a function to annotate the sites, like annotate_loxP_sites.
  • Add methods to assembly2, equivalent to cre_lox_integration and cre_lox_excision called recombinase_integration and recombinase_excision, and test those with sequences.
  • Generalise recombinase functionality. Gateway and Cre Lox are also recombinases, so they could leverage the same functionality rather than having a separate implementation for their algorithms.
  • In general, but also related to cre-lox and Gateway, we could think of handling the reverse reaction as well. In principle all these recombination reactions can go either way. If you are interested we can discuss what the best way of implementing this would be.

@areebamomin
Copy link
Contributor Author

Hi @manulera

Thank you for looking it over and the feedback! I will work on this in the following weeks.

@manulera
Copy link
Collaborator

Degenerate nucleotide codes represent a position that could be occupied by more than one nucleotide in a consensus sequence. You have them listed here: https://people.bath.ac.uk/jm2219/biology/degenerate.htm

You don't have to handle how to find these degenerate sequences in your new code. You can use the function dseqrecord_finditer included in the library. Below is a minimal usage example:

from pydna.sequence_regex import dseqrecord_finditer, compute_regex_site
from pydna.dseqrecord import Dseqrecord

seq = Dseqrecord('CTaaaACGTaaaAC')

# Turn degenerate sequence into regex pattern (case insensitive)
regex_pattern = compute_regex_site('ACNT')

print('regex pattern', regex_pattern)
# Find it in the sequence
result = dseqrecord_finditer(regex_pattern, seq)

print([r for r in result])

# Handles circular sequences, note that it
# returns 12,16 as the span for the circular-spanning motif
seq2 = Dseqrecord('CTaaaACGTaaaAC', circular=True)
result2 = dseqrecord_finditer(regex_pattern, seq2)
print([r for r in result2])

@BjornFJohansson
Copy link
Collaborator

@areebamomin @manulera

Any updates on this?

@areebamomin
Copy link
Contributor Author

@BjornFJohansson @manulera

I was traveling for the last month / holidays but hoping to work more on this in the upcoming couple of weeks!

@BjornFJohansson
Copy link
Collaborator

@areebamomin Good to hear, mind that pydna has undergone some fundamental internal changes, the Dseq class now relies on a single string instead of two. See last release v5.5.5

@manulera
Copy link
Collaborator

Hi @areebamomin pinging you here, do you think you will have time to finish this?

@areebamomin
Copy link
Contributor Author

Hi @manulera just emailed you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants