-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add cigar utility method get_alignment_offsets #188
Conversation
yfarjoun
commented
Sep 19, 2024
- add cigar utility method get_alignment_offsets to find the offsets of the aligned segment in the read
- tests included
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, once the ruff/mypy CI stuff passes
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #188 +/- ##
==========================================
+ Coverage 89.40% 89.48% +0.07%
==========================================
Files 18 18
Lines 2105 2120 +15
Branches 467 471 +4
==========================================
+ Hits 1882 1897 +15
Misses 146 146
Partials 77 77 ☔ View full report in Codecov by Sentry. |
not sure what's going on the mkdocs. I had to comment out some lines to get it to compile.... if someone knows how to fix this please let me know... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a good addition to the package, thanks for adding it!
I think the documentation could be a bit clearer, and it looks like there's a bug when you have adjacent hard/soft clips
36be5bb
to
f371b6c
Compare
Thanks for the detailed review @msto , much appreciated! mkdocs is giving me grief and I was unable to make it play nice.... if someone knows why it barfs when I change the indentation or remove the final "comment" line... please let me know. this is the best (passing) state I could put it into.... |
fgpyo/sam/__init__.py
Outdated
@@ -368,7 +368,7 @@ class CigarOp(enum.Enum): | |||
code (int): The `~pysam` cigar operator code. | |||
character (int): The single character cigar operator. | |||
consumes_query (bool): True if this operator consumes query bases, False otherwise. | |||
consumes_target (bool): True if this operator consumes target bases, False otherwise. | |||
consumes_reference (bool): True if this operator consumes reference bases, False otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unrelated fix. if deemed imporatant I could extract it to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yfarjoun This uses target
intentionally - not all alignments are between a read and a reference - what if you align two reads vs. each other, or two contigs vs. each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfenne While not all alignments are between a read and a reference, the CIGAR specification uses the terms "query" and "reference". I agree with Yossi, I find it clearer to have our documentation consistent with spec.
Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the
alignment to step along the query sequence and the reference sequence respectively
![Screenshot 2024-09-19 at 11 23 12 AM](https://private-user-images.githubusercontent.com/55061/369064086-fb9e38d3-6b56-46a2-aea7-aefa9d988adf.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg4MDU2MTgsIm5iZiI6MTczODgwNTMxOCwicGF0aCI6Ii81NTA2MS8zNjkwNjQwODYtZmI5ZTM4ZDMtNmI1Ni00NmEyLWFlYTctYWVmYTlkOTg4YWRmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA2VDAxMjgzOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTM5YjI2NTc0Mjk4MjFjMjBkZDc0MDJkOTRhOGFmMjA4N2NhNDdlMWI4ZTk2MGI0MDkyOGZiN2NjOWYzMjJkMmUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.GuZWO64IeaFNEdbGfuJsHU1hzQeIjJHALl02YYluyn8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I was correcting a broken link.... if this is intentional, I'll bring it back and let's haved the discussion in a different PR (if at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% on different PR. Though I'll also argue there for maintaining "target".
Little known fact is that CIGAR actually predates SAM/BAM and Google believes it was first documented as part of exonerate which uses "query" and "target" 😛
fgpyo/sam/__init__.py
Outdated
@@ -368,7 +368,7 @@ class CigarOp(enum.Enum): | |||
code (int): The `~pysam` cigar operator code. | |||
character (int): The single character cigar operator. | |||
consumes_query (bool): True if this operator consumes query bases, False otherwise. | |||
consumes_target (bool): True if this operator consumes target bases, False otherwise. | |||
consumes_reference (bool): True if this operator consumes reference bases, False otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yfarjoun This uses target
intentionally - not all alignments are between a read and a reference - what if you align two reads vs. each other, or two contigs vs. each other?
fgpyo/sam/__init__.py
Outdated
If the Cigar contains no alignment operators that consume sequence bases, or | ||
only clipping operators, the start and end offsets will be the same value | ||
(indicating an empty region). This shared value will be the offset to the first | ||
base consumed by a non-clipping operator or the length of the read sequence if | ||
there is no such base. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that either a) this function should return Optional[...]
and return None when invoked on a cigar without alignment operations, or raise an error in that situation. There is no aligned start/end if there are no alignment operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue in favor of Optional
vs raising an error, but would be comfortable either way
6ab8245
to
b07c3e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yfarjoun: @msto and I caught up in person and would request the following changes to this PR:
- Could you please remove the
reverse
argument, and in the docstring mention that if you have a CIGAR from a negative strand read you might want tocigar.reversed().query_alignment_offsets()
? - Can we please stick to just a
Tuple[int, int]
for the return type? - Instead of returning
Optional
.. could it just raise an exception at the end if it was called on a pathological CIGAR for which we can't generate this? That seems exceptional since it would be an empty cigar or a cigar that is all-clipping.
fgpyo/sam/__init__.py
Outdated
"""Gets the 0-based, end-exclusive positions of the first and last aligned base in the | ||
query. The resulting range will contain the range of positions in the SEQ string for | ||
the bases that are aligned. If no bases are aligned, the return value will be None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Gets the 0-based, end-exclusive positions of the first and last aligned base in the | |
query. The resulting range will contain the range of positions in the SEQ string for | |
the bases that are aligned. If no bases are aligned, the return value will be None. | |
""" | |
Gets the 0-based, end-exclusive positions of the first and last aligned base in the query. | |
The resulting range will contain the range of positions in the SEQ string for | |
the bases that are aligned. If no bases are aligned, the return value will be None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested change is to adhere to Google style guidelines of a single summary sentence separated from the longer description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Thanks for the comments. done. |
WalkthroughThe changes introduced in the pull request include the addition of a new method, Additionally, the test suite for the 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
…of the aligned segment in the read - tests included
Co-authored-by: Matt Stone <[email protected]>
Co-authored-by: Matt Stone <[email protected]>
e9c0b25
to
e2ba773
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (3)
fgpyo/sam/__init__.py (2)
550-566
: Improve docstring formatting and content.The docstring has some formatting issues and could be more precise. Please consider the following improvements:
- Adjust the indentation to align with the opening quotes.
- Be more specific about the return value, mentioning it's a tuple of two integers.
- Clarify the exception description.
Here's a suggested revision:
def query_alignment_offsets(self) -> tuple[int, int]: """Gets the 0-based, end-exclusive positions of the first and last aligned base in the query. The resulting range will contain the positions in the SEQ string for the bases that are aligned. If counting from the end of the query is desired, use `cigar.reversed().query_alignment_offsets()`. Returns: A tuple (start, end) containing the start and end positions of the aligned part of the query. These offsets are 0-based and open-ended, with respect to the beginning of the query sequence. Raises: ValueError: If the CIGAR string indicates no aligned bases. """Also, consider using Python 3.9+ type hinting for the return value:
tuple[int, int]
.
567-589
: LGTM! Implementation looks correct with a minor suggestion.The implementation of
query_alignment_offsets
looks correct. It properly handles clipping operators and calculates the alignment offsets accurately. The method also raises aValueError
when no bases are aligned, as specified in the docstring.Minor suggestion: Consider adding a type hint for the
element
variable in the for loop for better code readability:for element in self.elements:to:
for element in self.elements: # type: CigarElementThis change is optional but can improve code clarity.
tests/fgpyo/sam/test_cigar.py (1)
62-62
: Consistent use ofTuple
from thetyping
moduleIn the function definition,
maybe_range
is annotated asOptional[tuple]
. For consistency and to leverage the benefits of type hints, consider usingOptional[Tuple]
from thetyping
module, which you've already imported.Apply this diff to improve type annotation consistency:
def test_get_alignments(cigar_string: str, maybe_range: Optional[tuple]) -> None: +def test_get_alignments(cigar_string: str, maybe_range: Optional[Tuple]) -> None:
fgpyo/sam/__init__.py
Outdated
@@ -547,6 +547,46 @@ def length_on_target(self) -> int: | |||
"""Returns the length of the alignment on the target sequence.""" | |||
return sum([elem.length_on_target for elem in self.elements]) | |||
|
|||
def query_alignment_offsets(self) -> tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider renaming the method for clarity and consistency.
Based on past review comments and Python naming conventions, consider renaming the method to alignment_offsets
and removing the query_
prefix. This follows the principle that getter methods typically don't use the get_
prefix in Python.
def alignment_offsets(self, reverse: bool = False) -> tuple[int, int]:
If you decide to keep the current name, ensure it's consistent with other method names in the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query here doesn't mean "get"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yfarjoun, thank you for the clarification. I understand now that "query" refers to the query sequence in this context.
(*•.•)
✏️ Learnings added
Learnt from: yfarjoun
PR: fulcrumgenomics/fgpyo#188
File: fgpyo/sam/__init__.py:550-550
Timestamp: 2024-10-21T13:26:33.869Z
Learning: In this codebase, "query" in method names refers to the query sequence, not the verb "get".
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
tests/fgpyo/sam/test_cigar.py
Outdated
@pytest.mark.parametrize( | ||
("cigar_string", "maybe_range"), | ||
{ | ||
("10M", (0, 10)), | ||
("10M10I", (0, 20)), | ||
("10X10I", (0, 20)), | ||
("10X10D", (0, 10)), | ||
("10=10D", (0, 10)), | ||
("10S10M", (0, 10)), | ||
("10H10M", (0, 10)), | ||
("10H10S10M", (0, 10)), | ||
("10H10S10M5S", (5, 15)), | ||
("10H10S10M5S10H", (5, 15)), | ||
("10H", None), | ||
("10S", None), | ||
("10S10H", None), | ||
("5H10S10H", None), | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use square brackets instead of curly braces in @pytest.mark.parametrize
Similar to the previous function, the parameter set here is enclosed in curly braces {}
. Replace them with square brackets []
to correctly create a list of tuples for test parameters.
Apply this diff to correct the parameterization:
@pytest.mark.parametrize(
("cigar_string", "maybe_range"),
- {
+ [
("10M", (0, 10)),
("10M10I", (0, 20)),
("10X10I", (0, 20)),
("10X10D", (0, 10)),
("10=10D", (0, 10)),
("10S10M", (0, 10)),
("10H10M", (0, 10)),
("10H10S10M", (0, 10)),
("10H10S10M5S", (5, 15)),
("10H10S10M5S10H", (5, 15)),
("10H", None),
("10S", None),
("10S10H", None),
("5H10S10H", None),
- },
+ ],
)
def test_get_alignments_reversed(cigar_string: str, maybe_range: Optional[Tuple]) -> None:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@pytest.mark.parametrize( | |
("cigar_string", "maybe_range"), | |
{ | |
("10M", (0, 10)), | |
("10M10I", (0, 20)), | |
("10X10I", (0, 20)), | |
("10X10D", (0, 10)), | |
("10=10D", (0, 10)), | |
("10S10M", (0, 10)), | |
("10H10M", (0, 10)), | |
("10H10S10M", (0, 10)), | |
("10H10S10M5S", (5, 15)), | |
("10H10S10M5S10H", (5, 15)), | |
("10H", None), | |
("10S", None), | |
("10S10H", None), | |
("5H10S10H", None), | |
}, | |
) | |
@pytest.mark.parametrize( | |
("cigar_string", "maybe_range"), | |
[ | |
("10M", (0, 10)), | |
("10M10I", (0, 20)), | |
("10X10I", (0, 20)), | |
("10X10D", (0, 10)), | |
("10=10D", (0, 10)), | |
("10S10M", (0, 10)), | |
("10H10M", (0, 10)), | |
("10H10S10M", (0, 10)), | |
("10H10S10M5S", (5, 15)), | |
("10H10S10M5S10H", (5, 15)), | |
("10H", None), | |
("10S", None), | |
("10S10H", None), | |
("5H10S10H", None), | |
], | |
) |
tests/fgpyo/sam/test_cigar.py
Outdated
@pytest.mark.parametrize( | ||
("cigar_string", "maybe_range"), | ||
{ | ||
("10M", (0, 10)), | ||
("10M10I", (0, 20)), | ||
("10X10I", (0, 20)), | ||
("10X10D", (0, 10)), | ||
("10=10D", (0, 10)), | ||
("10S10M", (10, 20)), | ||
("10H10M", (0, 10)), | ||
("10H10S10M", (10, 20)), | ||
("10H10S10M5S", (10, 20)), | ||
("10H10S10M5S10H", (10, 20)), | ||
("10H", None), | ||
("10S", None), | ||
("10S10H", None), | ||
("5H10S10H", None), | ||
("76D", None), | ||
("76I", (0, 76)), | ||
("10P76S", None), | ||
("50S1000N50S", None), | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use square brackets instead of curly braces in @pytest.mark.parametrize
In the @pytest.mark.parametrize
decorator, the parameter set is enclosed in curly braces {}
, which creates a set. Sets are unordered and do not allow duplicate elements, which can lead to unpredictable behavior or errors in tests. Instead, you should use square brackets []
to create a list of tuples.
Apply this diff to correct the parameterization:
@pytest.mark.parametrize(
("cigar_string", "maybe_range"),
- {
+ [
("10M", (0, 10)),
("10M10I", (0, 20)),
("10X10I", (0, 20)),
("10X10D", (0, 10)),
("10=10D", (0, 10)),
("10S10M", (10, 20)),
("10H10M", (0, 10)),
("10H10S10M", (10, 20)),
("10H10S10M5S", (10, 20)),
("10H10S10M5S10H", (10, 20)),
("10H", None),
("10S", None),
("10S10H", None),
("5H10S10H", None),
("76D", None),
("76I", (0, 76)),
("10P76S", None),
("50S1000N50S", None),
- },
+ ],
)
def test_get_alignments(cigar_string: str, maybe_range: Optional[Tuple]) -> None:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@pytest.mark.parametrize( | |
("cigar_string", "maybe_range"), | |
{ | |
("10M", (0, 10)), | |
("10M10I", (0, 20)), | |
("10X10I", (0, 20)), | |
("10X10D", (0, 10)), | |
("10=10D", (0, 10)), | |
("10S10M", (10, 20)), | |
("10H10M", (0, 10)), | |
("10H10S10M", (10, 20)), | |
("10H10S10M5S", (10, 20)), | |
("10H10S10M5S10H", (10, 20)), | |
("10H", None), | |
("10S", None), | |
("10S10H", None), | |
("5H10S10H", None), | |
("76D", None), | |
("76I", (0, 76)), | |
("10P76S", None), | |
("50S1000N50S", None), | |
}, | |
) | |
@pytest.mark.parametrize( | |
("cigar_string", "maybe_range"), | |
[ | |
("10M", (0, 10)), | |
("10M10I", (0, 20)), | |
("10X10I", (0, 20)), | |
("10X10D", (0, 10)), | |
("10=10D", (0, 10)), | |
("10S10M", (10, 20)), | |
("10H10M", (0, 10)), | |
("10H10S10M", (10, 20)), | |
("10H10S10M5S", (10, 20)), | |
("10H10S10M5S10H", (10, 20)), | |
("10H", None), | |
("10S", None), | |
("10S10H", None), | |
("5H10S10H", None), | |
("76D", None), | |
("76I", (0, 76)), | |
("10P76S", None), | |
("50S1000N50S", None), | |
], | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
tests/fgpyo/sam/test_cigar.py (1)
1-2
: Combine imports from thetyping
moduleYou can consolidate the imports from the
typing
module into a single line for brevity and clarity.Apply this diff to combine the import statements:
-from typing import Optional -from typing import Tuple +from typing import Optional, Tuple
tests/fgpyo/sam/test_cigar.py
Outdated
@pytest.mark.parametrize( | ||
("cigar_string", "maybe_range"), | ||
[ | ||
("10M", (0, 10)), | ||
("10M10I", (0, 20)), | ||
("10X10I", (0, 20)), | ||
("10X10D", (0, 10)), | ||
("10=10D", (0, 10)), | ||
("10S10M", (10, 20)), | ||
("10H10M", (0, 10)), | ||
("10H10S10M", (10, 20)), | ||
("10H10S10M5S", (10, 20)), | ||
("10H10S10M5S10H", (10, 20)), | ||
("10H", None), | ||
("10S", None), | ||
("10S10H", None), | ||
("5H10S10H", None), | ||
("76D", None), | ||
("76I", (0, 76)), | ||
("10P76S", None), | ||
("50S1000N50S", None), | ||
], | ||
) | ||
def test_get_alignments(cigar_string: str, maybe_range: Optional[tuple]) -> None: | ||
cig = Cigar.from_cigarstring(cigar_string) | ||
if not maybe_range: | ||
with pytest.raises(ValueError): | ||
cig.query_alignment_offsets() | ||
else: | ||
ret = cig.query_alignment_offsets() | ||
assert ret == maybe_range | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor tests to reduce code duplication
The test functions test_get_alignments
and test_get_alignments_reversed
share similar logic, differing mainly in whether they test the reversed CIGAR strings. Consider combining them into a single parameterized test function with an additional parameter to indicate reversal. This will improve maintainability and reduce redundancy.
Here's how you might refactor the tests:
@pytest.mark.parametrize(
("cigar_string", "maybe_range", "reverse"),
[
# Forward CIGAR strings
("10M", (0, 10), False),
("10M10I", (0, 20), False),
("10X10I", (0, 20), False),
("10X10D", (0, 10), False),
("10=10D", (0, 10), False),
("10S10M", (10, 20), False),
("10H10M", (0, 10), False),
("10H10S10M", (10, 20), False),
("10H10S10M5S", (10, 20), False),
("10H10S10M5S10H", (10, 20), False),
("10H", None, False),
("10S", None, False),
("10S10H", None, False),
("5H10S10H", None, False),
("76D", None, False),
("76I", (0, 76), False),
("10P76S", None, False),
("50S1000N50S", None, False),
# Reversed CIGAR strings
("10M", (0, 10), True),
("10M10I", (0, 20), True),
("10X10I", (0, 20), True),
("10X10D", (0, 10), True),
("10=10D", (0, 10), True),
("10S10M", (0, 10), True),
("10H10M", (0, 10), True),
("10H10S10M", (0, 10), True),
("10H10S10M5S", (5, 15), True),
("10H10S10M5S10H", (5, 15), True),
("10H", None, True),
("10S", None, True),
("10S10H", None, True),
("5H10S10H", None, True),
],
)
def test_query_alignment_offsets(cigar_string: str, maybe_range: Optional[Tuple[int, int]], reverse: bool) -> None:
cig = Cigar.from_cigarstring(cigar_string)
if reverse:
cig = cig.reversed()
if maybe_range is None:
with pytest.raises(ValueError):
cig.query_alignment_offsets()
else:
ret = cig.query_alignment_offsets()
assert ret == maybe_range
Also applies to: 72-99
fgpyo/sam/__init__.py
Outdated
@@ -547,6 +547,46 @@ def length_on_target(self) -> int: | |||
"""Returns the length of the alignment on the target sequence.""" | |||
return sum([elem.length_on_target for elem in self.elements]) | |||
|
|||
def query_alignment_offsets(self) -> tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue Provide a complete type hint for the return value.
def query_alignment_offsets(self) -> tuple: | |
def query_alignment_offsets(self) -> tuple[int, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I do this I get an error:
fgpyo/sam/__init__.py:550: error: "tuple" is not subscriptable, use "typing.Tuple" instead [misc]
Found 1 error in 1 file (checked 52 source files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I switched to Tuple[int, int]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fgpyo/sam/__init__.py
Outdated
query. | ||
|
||
The resulting range will contain the range of positions in the SEQ string for | ||
the bases that are aligned. If no bases are aligned, the return value will be None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue I think this part of the docstring is stale - the current behavior is to throw an exception if no bases are aligned.
the bases that are aligned. If no bases are aligned, the return value will be None. | |
the bases that are aligned. |
fgpyo/sam/__init__.py
Outdated
Returns: | ||
A tuple (start, stop) containing the start and stop positions | ||
of the aligned part of the query. These offsets are 0-based and open-ended, with | ||
respect to the beginning of the query. If no bases are aligned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
respect to the beginning of the query. If no bases are aligned. | |
respect to the beginning of the query. |
fgpyo/sam/__init__.py
Outdated
of the aligned part of the query. These offsets are 0-based and open-ended, with | ||
respect to the beginning of the query. If no bases are aligned. | ||
|
||
Throws: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue The three special sections of a docstring are Args:
, Returns:
(or Yields:
), and Raises:
.
Throws: | |
Raises: |
Raises:
List all exceptions that are relevant to the interface followed by a description. Use a similar exception name + colon + space or newline and hanging indent style as described in Args:. You should not document exceptions that get raised if the API specified in the docstring is violated (because this would paradoxically make behavior under violation of the API part of the API).
https://google.github.io/styleguide/pyguide.html#383-functions-and-methods
tests/fgpyo/sam/test_cigar.py
Outdated
def test_get_alignments(cigar_string: str, maybe_range: Optional[tuple]) -> None: | ||
cig = Cigar.from_cigarstring(cigar_string) | ||
if not maybe_range: | ||
with pytest.raises(ValueError): | ||
cig.query_alignment_offsets() | ||
else: | ||
ret = cig.query_alignment_offsets() | ||
assert ret == maybe_range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion Why not have two separate test functions - one for the cases that return a tuple, and one for the cases that raise an error?
That simplifies a bit by getting rid of the Optional[tuple]
, which I think is an artifact of the old implementation that returned an Optional
instead of raising an exception. And you don't need to pair None
with every bad example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
tests/fgpyo/sam/test_cigar.py (3)
36-55
: LGTM! Consider adding a descriptive comment for each test case.The implementation of
test_query_alignment_offsets
is correct and covers various CIGAR string scenarios. To improve readability and maintainability, consider adding a brief comment for each test case explaining the CIGAR string's composition and why the expected range is correct.Here's an example of how you could add comments:
@pytest.mark.parametrize( ("cigar_string", "expected_range"), [ ("10M", (0, 10)), # 10 matches, straightforward case ("10M10I", (0, 20)), # 10 matches followed by 10 insertions ("10X10I", (0, 20)), # 10 mismatches followed by 10 insertions ("10X10D", (0, 10)), # 10 mismatches followed by 10 deletions (not counted in query) ("10=10D", (0, 10)), # 10 matches followed by 10 deletions (not counted in query) ("10S10M", (10, 20)), # 10 soft-clipped bases followed by 10 matches ("10H10M", (0, 10)), # 10 hard-clipped bases (not in query) followed by 10 matches ("10H10S10M", (10, 20)), # 10 hard-clipped, 10 soft-clipped, 10 matches ("10H10S10M5S", (10, 20)), # 10 hard-clipped, 10 soft-clipped, 10 matches, 5 soft-clipped ("10H10S10M5S10H", (10, 20)), # Same as above with 10 hard-clipped at end ("76I", (0, 76)), # 76 insertions, all counted in query ], )
58-76
: LGTM! Consider separating the tests for original and reversed CIGARs.The implementation of
test_query_alignment_offsets_failures
correctly tests for ValueError in cases where no bases are aligned. To improve clarity and make the tests more focused, consider separating the tests for original and reversed CIGAR strings into two different test functions.Here's a suggested refactor:
@pytest.mark.parametrize( "cigar_string", [ "10H", "10S", "10S10H", "5H10S10H", "76D", "10P76S", "50S1000N50S", ], ) def test_query_alignment_offsets_failures(cigar_string: str) -> None: cig = Cigar.from_cigarstring(cigar_string) with pytest.raises(ValueError): cig.query_alignment_offsets() @pytest.mark.parametrize( "cigar_string", [ "10H", "10S", "10S10H", "5H10S10H", "76D", "10P76S", "50S1000N50S", ], ) def test_query_alignment_offsets_failures_reversed(cigar_string: str) -> None: cig = Cigar.from_cigarstring(cigar_string) with pytest.raises(ValueError): cig.reversed().query_alignment_offsets()This separation makes it clearer that we're testing both the original and reversed CIGAR strings, and it allows for easier maintenance if the behavior for reversed CIGARs changes in the future.
79-98
: LGTM! Consider adding comments and aligning with the non-reversed test.The implementation of
test_query_alignment_offsets_reversed
is correct and covers various reversed CIGAR string scenarios. To improve consistency and readability, consider adding comments for each test case (as suggested for the non-reversed test) and aligning the test cases with those intest_query_alignment_offsets
.Here's an example of how you could improve the test:
@pytest.mark.parametrize( ("cigar_string", "expected_range"), [ ("10M", (0, 10)), # 10 matches, straightforward case ("10M10I", (0, 20)), # 10 matches followed by 10 insertions ("10X10I", (0, 20)), # 10 mismatches followed by 10 insertions ("10X10D", (0, 10)), # 10 mismatches followed by 10 deletions (not counted in query) ("10=10D", (0, 10)), # 10 matches followed by 10 deletions (not counted in query) ("10S10M", (0, 10)), # 10 soft-clipped bases followed by 10 matches (reversed) ("10H10M", (0, 10)), # 10 hard-clipped bases (not in query) followed by 10 matches ("10H10S10M", (0, 10)), # 10 hard-clipped, 10 soft-clipped, 10 matches (reversed) ("10H10S10M5S", (5, 15)), # 10 hard-clipped, 10 soft-clipped, 10 matches, 5 soft-clipped (reversed) ("10H10S10M5S10H", (5, 15)), # Same as above with 10 hard-clipped at end ("76I", (0, 76)), # 76 insertions, all counted in query ], ) def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: tuple) -> None: cig = Cigar.from_cigarstring(cigar_string) ret = cig.reversed().query_alignment_offsets() assert ret == expected_rangeThis aligns the test cases with the non-reversed test, making it easier to compare and understand the differences in behavior between regular and reversed CIGARs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- fgpyo/sam/init.py (1 hunks)
- tests/fgpyo/sam/test_cigar.py (1 hunks)
🧰 Additional context used
🔇 Additional comments (1)
fgpyo/sam/__init__.py (1)
550-588
: LGTM! Well-implemented method with clear logic and good error handling.The
query_alignment_offsets
method is well-implemented, with clear logic for calculating alignment offsets. The docstring is comprehensive, and the code is well-commented. Good job on raising aValueError
when no bases are aligned.Consider simplifying the return statement:
- ret = (start_offset, end_offset) - if start_offset == end_offset: - raise ValueError(f"Cigar {self.__str__()} has no aligned bases") - return ret + if start_offset == end_offset: + raise ValueError(f"Cigar {self} has no aligned bases") + return start_offset, end_offsetThis change reduces a variable assignment and uses the implicit
__str__
call.To ensure robustness, consider adding a test case for a CIGAR string with only clipping operations (e.g., "10S20H"). Run the following script to check if such a test exists:
If no results are found, consider adding such a test case to verify the
ValueError
is raised correctly.✅ Verification successful
Test Cases Verified
Found test cases for clipping-only CIGAR operations in
tests/fgpyo/sam/test_cigar.py
, ensuring proper handling of edge cases.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for test cases with only clipping operations in CIGAR rg -i 'def test.*query_alignment_offsets' tests/fgpyo/sam/test_cigar.py -A 10 | rg -i '([0-9]+[SH])+"\)'Length of output: 140
fgpyo/sam/__init__.py
Outdated
ret = (start_offset, end_offset) | ||
if start_offset == end_offset: | ||
raise ValueError(f"Cigar {self.__str__()} has no aligned bases") | ||
return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just?
ret = (start_offset, end_offset) | |
if start_offset == end_offset: | |
raise ValueError(f"Cigar {self.__str__()} has no aligned bases") | |
return ret | |
if not alignment_began: | |
raise ValueError(f"Cigar {self.__str__()} has no aligned bases") | |
return start_offset, end_offset |
tests/fgpyo/sam/test_cigar.py
Outdated
("76I", (0, 76)), | ||
], | ||
) | ||
def test_query_alignment_offsets(cigar_string: str, expected_range: tuple) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the type hints here, as well. Please try to avoid generic collection type hints without a parametrized type.
def test_query_alignment_offsets(cigar_string: str, expected_range: tuple) -> None: | |
def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None: |
tests/fgpyo/sam/test_cigar.py
Outdated
("10H10S10M5S10H", (5, 15)), | ||
], | ||
) | ||
def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: tuple) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: tuple) -> None: | |
def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: Tuple[int, int]) -> None: |
("76I", (0, 76)), | ||
], | ||
) | ||
def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None: | |
def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None: | |
"""query_alignment_offsets() should return the expected start and stop positions.""" |
) | ||
def test_query_alignment_offsets_reversed( | ||
cigar_string: str, expected_range: Tuple[int, int] | ||
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) -> None: | |
) -> None: | |
"""query_alignment_offsets() should return the expected start and stop positions on a reversed CIGAR.""" |
("50S1000N50S"), | ||
], | ||
) | ||
def test_query_alignment_offsets_failures(cigar_string: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_query_alignment_offsets_failures(cigar_string: str) -> None: | |
def test_query_alignment_offsets_failures(cigar_string: str) -> None: | |
"""query_alignment_offsets() should raise a ValueError if the CIGAR has no aligned positions.""" |