-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reducing picking.py
to be simpler, more focused, and more performant.
#75
Conversation
I'm still actively working on this - mostly writing tests and fixing up documentation. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #75 +/- ##
==========================================
- Coverage 96.74% 96.63% -0.11%
==========================================
Files 26 26
Lines 1751 1696 -55
Branches 333 330 -3
==========================================
- Hits 1694 1639 -55
Misses 31 31
Partials 26 26 ☔ View full report in Codecov by Sentry. |
if ntthal.duplex_tm(lp.bases, rp.bases) > max_heterodimer_tm: | ||
continue | ||
|
||
penalty = score( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question Naive question - why is duplex_tm()
a method on NtThermoAligner
, but score()
and calculate_long_seq_tm()
are standalone functions? Are the latter accessible through primer3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplex_tm()
is a complicated thing that does a thermodynamic alignment of two sequences and predicts their Tm, which is why we delegate to ntthal instead of trying to replicate that.
score()
is mimicking something in primer3, but is not easily accessible independent of designing primer pairs in primer3.
calculate_long_seq_tm()
is so simple it's not worth trying to redirect to primer3.
for lp in left_primers: | ||
for rp in right_primers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
itertools.product
would help you un-nest this a level:
https://docs.python.org/3/library/itertools.html#itertools.product
35ef948
to
eb5dc15
Compare
WalkthroughThe pull request introduces significant updates to the Additionally, the Assessment against linked issues
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (3)
docs/overview.md (1)
53-53
: Fix broken reference link.The reference link for
[prymer.api.picking.score]
is missing its definition. This needs to be added to maintain documentation integrity.Add the following at the bottom of the file:
+[prymer.api.picking.score]: https://prymer.readthedocs.io/en/latest/api/picking.html#prymer.api.picking.score
🧰 Tools
🪛 Markdownlint
53-53: Missing link or image reference definition: "prymer.api.picking.score"
Reference links and images should use a label that is defined(MD052, reference-links-images)
tests/api/test_picking.py (2)
255-259
: Refactor repeated assertion logic into a helper functionThe assertion logic within your test functions is repeated in multiple places (lines 255-259, 286-291, and 317-323). Refactoring this code into a helper function can improve maintainability and reduce duplication.
You could define a helper function like:
def assert_primer_pair_properties(pair: PrimerPair) -> None: assert pair.span.length == len(pair.bases) assert pair.amplicon.start == pair.left_primer.span.start assert pair.amplicon.end == pair.right_primer.span.end assert pair.bases == REF_BASES[pair.amplicon.start - 1 : pair.amplicon.end]And then update your test functions:
for pair in pairs: - assert pair.span.length == len(pair.bases) - assert pair.amplicon.start == pair.left_primer.span.start - assert pair.amplicon.end == pair.right_primer.span.end - assert pair.bases == REF_BASES[pair.amplicon.start - 1 : pair.amplicon.end] + assert_primer_pair_properties(pair)Also applies to: 286-291, 317-323
244-245
: Use constants for primer sequence indices for clarityIn
test_build_primer_pairs_single_pair
, the indices used to sliceREF_BASES
(e.g.,[200:220]
,[280:300]
) are hard-coded. Defining these indices as constants can improve readability and ease future modifications.For example:
LEFT_PRIMER_START = 200 LEFT_PRIMER_END = 220 RIGHT_PRIMER_START = 280 RIGHT_PRIMER_END = 300 left_primers = [ p( REF_BASES[LEFT_PRIMER_START:LEFT_PRIMER_END], tm=60.1, pos=LEFT_PRIMER_START + 1, pen=1.0, ) ] right_primers = [ p( reverse_complement(REF_BASES[RIGHT_PRIMER_START:RIGHT_PRIMER_END]), tm=61.8, pos=RIGHT_PRIMER_START + 1, pen=1.1, ) ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
docs/overview.md
(1 hunks)prymer/api/__init__.py
(0 hunks)prymer/api/picking.py
(4 hunks)tests/api/test_picking.py
(1 hunks)
💤 Files with no reviewable changes (1)
- prymer/api/init.py
🧰 Additional context used
🪛 Markdownlint
docs/overview.md
53-53: Missing link or image reference definition: "prymer.api.picking.score"
Reference links and images should use a label that is defined
(MD052, reference-links-images)
🪛 Gitleaks
tests/api/test_picking.py
50-50: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
51-51: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
53-53: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
55-55: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
🔇 Additional comments (2)
docs/overview.md (1)
50-52
: Documentation accurately reflects the new functionality.
The documentation clearly explains the primer pair building process and introduces the new max_heterodimer_tm
parameter. This aligns well with the PR's objective of simplifying the picking functionality while maintaining clear documentation of the core features.
prymer/api/picking.py (1)
136-141
:
Verify sequence extraction logic for potential off-by-one errors
When fetching sequences using fasta.fetch()
, the coordinates are 0-based and half-open intervals. Subtracting 1 from region_start
may lead to incorrect sequence extraction. Please verify that the sequence extraction and slicing correctly correspond to the desired amplicon regions without introducing off-by-one errors.
To ensure accurate sequence retrieval, confirm the coordinate system of fasta.fetch()
and adjust region_start
and region_end
accordingly. Consider the following when adjusting the code:
fasta.fetch(reference, start, end)
retrieves bases fromstart
toend - 1
.- Ensure that the slicing of
bases
inamp_bases
aligns with the primer positions.
prymer/api/picking.py
Outdated
print(f"Target={target}; lefts={left_primers}.") | ||
print(f"Target={target}; rights={right_primers}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove debug print statements from production code
The print
statements on lines 123-124 may not be suitable for production code as they can clutter the console output and potentially expose sensitive information. Consider removing them or using a logging framework with appropriate log levels.
Apply this diff to remove the debug statements:
- print(f"Target={target}; lefts={left_primers}.")
- print(f"Target={target}; rights={right_primers}.")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
print(f"Target={target}; lefts={left_primers}.") | |
print(f"Target={target}; rights={right_primers}.") |
if bases is None: | ||
length = rp.span.end - lp.span.start + 1 | ||
needed = length - lp.span.length - rp.span.length | ||
bases = lp.bases + ("A" * needed) + reverse_complement(rp.bases) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevent potential negative sequence length in function pp
In the pp
function, the variable needed
is calculated as length - lp.span.length - rp.span.length
. If needed
becomes negative, multiplying a string by a negative number will raise a ValueError
. Consider adding a check to handle cases where needed
is negative.
Apply this diff to handle the potential issue:
if bases is None:
length = rp.span.end - lp.span.start + 1
needed = length - lp.span.length - rp.span.length
+ if needed < 0:
+ raise ValueError(f"Negative sequence length calculated in function `pp`: needed={needed}")
bases = lp.bases + ("A" * needed) + reverse_complement(rp.bases)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if bases is None: | |
length = rp.span.end - lp.span.start + 1 | |
needed = length - lp.span.length - rp.span.length | |
bases = lp.bases + ("A" * needed) + reverse_complement(rp.bases) | |
if bases is None: | |
length = rp.span.end - lp.span.start + 1 | |
needed = length - lp.span.length - rp.span.length | |
if needed < 0: | |
raise ValueError(f"Negative sequence length calculated in function `pp`: needed={needed}") | |
bases = lp.bases + ("A" * needed) + reverse_complement(rp.bases) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any objections to the refactoring.
I do continue to struggle to understand the motivation for reimplementing primer pair picking and scoring in a less performant language, rather than using primer3
to design and filter pairs where possible.
I understand that primer3
does not support all the possible flavors of filtering we sometimes have to implement (e.g. I don't think its design algorithm supports filtering based on heterodimerization tm), but I think that it's generally simpler and less prone to inconsistency to rely on it for the features it does provide.
Only primer pairs that meet the requirements for amplicon sizes and tms will be returned. | ||
In addition, if `max_heterodimer_tm` is provided then primer pairs with a heterodimer tm | ||
exceeding the maximum will also be discarded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question Why is this necessary when primer3
will apply these constraints to designed primer pairs if they are specified? e.g. specifying PRIMER_PRODUCT_MIN_TM
and PRIMER_PRODUCT_MAX_TM
will constrain the pairs returned by primer3
to the provided range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is used when designing left and right primers independently - so primer3 is not emitting pairs and not cannot check these constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
prymer/api/picking.py (1)
134-134
: Consider improving FastaFile path handlingThe f-string for the FastaFile path is unnecessary and could be simplified.
- with FastaFile(f"{fasta_path}") as fasta: + with FastaFile(fasta_path) as fasta:tests/api/test_picking.py (2)
297-332
: Consider adding edge cases for heterodimer filtering testsWhile the current tests cover the basic functionality, consider adding these edge cases:
- Test with
max_heterodimer_tm=0
to verify complete filtering- Test with very high
max_heterodimer_tm
to verify no filtering
48-59
: Consider moving test sequences to fixturesThe hardcoded DNA sequences could be moved to fixtures to improve maintainability and reusability across tests. This would also make it easier to modify test data when needed.
Example refactor:
@pytest.fixture def test_sequences() -> dict[str, str]: return { "ref_bases": """ CCTGCTGGTTTTATTTCTGCTACTT... """.strip().replace("\n", ""), "gc_rich": "CGCGCGCGCGCGCGCGCGCG", "at_rich": "CTGCTGCTGCTGCTGCTGCT" }🧰 Tools
🪛 Gitleaks
50-50: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
51-51: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
53-53: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
55-55: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
prymer/api/picking.py
(4 hunks)tests/api/test_picking.py
(1 hunks)
🧰 Additional context used
🪛 Gitleaks
tests/api/test_picking.py
50-50: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
51-51: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
53-53: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
55-55: Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
(aws-access-token)
🔇 Additional comments (10)
prymer/api/picking.py (8)
4-7
: LGTM: Clear and focused module documentation
The updated docstring accurately reflects the module's simplified functionality after removing deprecated functions.
Line range hint 34-90
: LGTM: Well-structured penalty calculation with clear parameter dependencies
The refactored score
function improves clarity by:
- Making dependencies explicit through individual parameters
- Maintaining clear documentation of penalty calculations
- Using type hints effectively
102-102
: LGTM: Memory-efficient Iterator return type
The change to return an Iterator instead of a list improves memory efficiency when dealing with large sets of primer pairs.
127-131
: LGTM: Good input validation
The added validation ensures consistency between primers and target reference names, preventing potential runtime errors.
124-124
: Fix potential iterator consumption issue
The any()
function will consume elements from iterators. Since this is a duplicate of a previous comment that hasn't been addressed, I'll skip generating a new suggestion.
141-142
: Consider using itertools.product for nested iteration
The nested loops could be simplified using itertools.product
. This is a duplicate of a previous comment that hasn't been addressed, so I'll skip generating a new suggestion.
146-149
: LGTM: Efficient amplicon sequence extraction
The code efficiently extracts the amplicon sequence using relative positions from the fetched region.
155-158
: LGTM: Optional heterodimer temperature check
The addition of an optional heterodimer temperature check provides flexibility while maintaining performance when the check isn't needed.
tests/api/test_picking.py (2)
86-102
: Previous review comment about negative sequence length is still applicable
The potential issue with negative sequence length calculation in the pp
function hasn't been addressed yet.
106-209
: Well-structured and comprehensive test suite for scoring functionality
The test coverage is thorough, covering various edge cases and realistic scenarios. The use of pytest.approx for floating-point comparisons and clear test organization makes the tests robust and maintainable.
Closes #38