feat: add cigar utility method get_alignment_offsets #188

yfarjoun · 2024-09-19T00:35:25Z

add cigar utility method get_alignment_offsets to find the offsets of the aligned segment in the read
tests included

TimD1

Looks good to me, once the ruff/mypy CI stuff passes

fgpyo/sam/__init__.py

codecov · 2024-09-19T01:06:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.48%. Comparing base (8c3c152) to head (b90b605).
Report is 14 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #188      +/-   ##
==========================================
+ Coverage   89.40%   89.48%   +0.07%     
==========================================
  Files          18       18              
  Lines        2105     2120      +15     
  Branches      467      471       +4     
==========================================
+ Hits         1882     1897      +15     
  Misses        146      146              
  Partials       77       77

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yfarjoun · 2024-09-19T01:07:01Z

not sure what's going on the mkdocs. I had to comment out some lines to get it to compile.... if someone knows how to fix this please let me know...

msto

This will be a good addition to the package, thanks for adding it!

I think the documentation could be a bit clearer, and it looks like there's a bug when you have adjacent hard/soft clips

fgpyo/sam/__init__.py

tests/fgpyo/sam/test_cigar.py

yfarjoun · 2024-09-19T02:29:47Z

Thanks for the detailed review @msto , much appreciated!

mkdocs is giving me grief and I was unable to make it play nice.... if someone knows why it barfs when I change the indentation or remove the final "comment" line... please let me know. this is the best (passing) state I could put it into....

yfarjoun · 2024-09-19T02:30:44Z

fgpyo/sam/__init__.py

@@ -368,7 +368,7 @@ class CigarOp(enum.Enum):
        code (int): The `~pysam` cigar operator code.
        character (int): The single character cigar operator.
        consumes_query (bool): True if this operator consumes query bases, False otherwise.
-        consumes_target (bool): True if this operator consumes target bases, False otherwise.
+        consumes_reference (bool): True if this operator consumes reference bases, False otherwise.


This is an unrelated fix. if deemed imporatant I could extract it to a separate PR.

@yfarjoun This uses target intentionally - not all alignments are between a read and a reference - what if you align two reads vs. each other, or two contigs vs. each other?

@tfenne While not all alignments are between a read and a reference, the CIGAR specification uses the terms "query" and "reference". I agree with Yossi, I find it clearer to have our documentation consistent with spec.

Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the
alignment to step along the query sequence and the reference sequence respectively

I thought I was correcting a broken link.... if this is intentional, I'll bring it back and let's haved the discussion in a different PR (if at all)

100% on different PR. Though I'll also argue there for maintaining "target".

Little known fact is that CIGAR actually predates SAM/BAM and Google believes it was first documented as part of exonerate which uses "query" and "target" 😛

tfenne · 2024-09-19T11:27:11Z

fgpyo/sam/__init__.py

@@ -368,7 +368,7 @@ class CigarOp(enum.Enum):
        code (int): The `~pysam` cigar operator code.
        character (int): The single character cigar operator.
        consumes_query (bool): True if this operator consumes query bases, False otherwise.
-        consumes_target (bool): True if this operator consumes target bases, False otherwise.
+        consumes_reference (bool): True if this operator consumes reference bases, False otherwise.


@yfarjoun This uses target intentionally - not all alignments are between a read and a reference - what if you align two reads vs. each other, or two contigs vs. each other?

fgpyo/sam/__init__.py

tfenne · 2024-09-19T11:36:11Z

fgpyo/sam/__init__.py

+                If the Cigar contains no alignment operators that consume sequence bases, or
+                only clipping operators, the start and end offsets will be the same value
+                (indicating an empty region). This shared value will be the offset to the first
+                base consumed by a non-clipping operator or the length of the read sequence if
+                there is no such base.


I think that either a) this function should return Optional[...] and return None when invoked on a cigar without alignment operations, or raise an error in that situation. There is no aligned start/end if there are no alignment operations.

I'd argue in favor of Optional vs raising an error, but would be comfortable either way

fgpyo/sam/__init__.py

tests/fgpyo/sam/test_cigar.py

fgpyo/sam/__init__.py

tfenne

@yfarjoun: @msto and I caught up in person and would request the following changes to this PR:

Could you please remove the reverse argument, and in the docstring mention that if you have a CIGAR from a negative strand read you might want to cigar.reversed().query_alignment_offsets()?
Can we please stick to just a Tuple[int, int] for the return type?
Instead of returning Optional .. could it just raise an exception at the end if it was called on a pathological CIGAR for which we can't generate this? That seems exceptional since it would be an empty cigar or a cigar that is all-clipping.

tfenne · 2024-10-17T23:56:32Z

fgpyo/sam/__init__.py

+        """Gets the 0-based, end-exclusive positions of the first and last aligned base in the
+        query. The resulting range will contain the range of positions in the SEQ string for
+        the bases that are aligned. If no bases are aligned, the return value will be None.


Suggested change

"""Gets the 0-based, end-exclusive positions of the first and last aligned base in the

query. The resulting range will contain the range of positions in the SEQ string for

the bases that are aligned. If no bases are aligned, the return value will be None.

"""

Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

The resulting range will contain the range of positions in the SEQ string for

the bases that are aligned. If no bases are aligned, the return value will be None.

Suggested change is to adhere to Google style guidelines of a single summary sentence separated from the longer description.

yfarjoun · 2024-10-18T21:01:25Z

Thanks for the comments. done.

coderabbitai · 2024-10-18T21:01:37Z

Walkthrough

The changes introduced in the pull request include the addition of a new method, query_alignment_offsets, to the Cigar class located in fgpyo/sam/__init__.py. This method calculates the 0-based, end-exclusive positions of the first and last aligned bases in a query by iterating through the elements of the Cigar representation. It adjusts the start and end offsets based on the nature of the elements, specifically whether they are clipping or part of the alignment. If no bases are aligned, the method raises a ValueError. The implementation includes detailed docstrings for clarity.

Additionally, the test suite for the Cigar class has been expanded in tests/fgpyo/sam/test_cigar.py. Three new test functions have been added: test_query_alignment_offsets, test_query_alignment_offsets_failures, and test_query_alignment_offsets_reversed. These tests utilize parameterization to validate the functionality of the query_alignment_offsets method against various CIGAR strings, ensuring correct behavior and appropriate exception handling. Existing tests remain unchanged, maintaining the integrity of the overall test suite.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 367497d and b90b605.

📒 Files selected for processing (1)

tests/fgpyo/sam/test_cigar.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/fgpyo/sam/test_cigar.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

…of the aligned segment in the read - tests included

Co-authored-by: Matt Stone <[email protected]>

…hods)

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (3)

fgpyo/sam/__init__.py (2)
550-566: Improve docstring formatting and content.

The docstring has some formatting issues and could be more precise. Please consider the following improvements:

Adjust the indentation to align with the opening quotes.

Be more specific about the return value, mentioning it's a tuple of two integers.

Clarify the exception description.

Here's a suggested revision:
def query_alignment_offsets(self) -> tuple[int, int]:
    """Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

    The resulting range will contain the positions in the SEQ string for the bases that are
    aligned. If counting from the end of the query is desired, use
    `cigar.reversed().query_alignment_offsets()`.

    Returns:
        A tuple (start, end) containing the start and end positions of the aligned part of the
        query. These offsets are 0-based and open-ended, with respect to the beginning of the
        query sequence.

    Raises:
        ValueError: If the CIGAR string indicates no aligned bases.
    """
Also, consider using Python 3.9+ type hinting for the return value: tuple[int, int].

567-589: LGTM! Implementation looks correct with a minor suggestion.

The implementation of query_alignment_offsets looks correct. It properly handles clipping operators and calculates the alignment offsets accurately. The method also raises a ValueError when no bases are aligned, as specified in the docstring.

Minor suggestion: Consider adding a type hint for the element variable in the for loop for better code readability:
for element in self.elements:
to:
for element in self.elements:  # type: CigarElement
This change is optional but can improve code clarity.
tests/fgpyo/sam/test_cigar.py (1)
62-62: Consistent use of Tuple from the typing module

In the function definition, maybe_range is annotated as Optional[tuple]. For consistency and to leverage the benefits of type hints, consider using Optional[Tuple] from the typing module, which you've already imported.

Apply this diff to improve type annotation consistency:
 def test_get_alignments(cigar_string: str, maybe_range: Optional[tuple]) -> None:
+def test_get_alignments(cigar_string: str, maybe_range: Optional[Tuple]) -> None:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 8c3c152 and e2ba773.

📒 Files selected for processing (2)

fgpyo/sam/init.py (1 hunks)
tests/fgpyo/sam/test_cigar.py (2 hunks)

🧰 Additional context used

coderabbitai · 2024-10-18T21:07:07Z

fgpyo/sam/__init__.py

@@ -547,6 +547,46 @@ def length_on_target(self) -> int:
        """Returns the length of the alignment on the target sequence."""
        return sum([elem.length_on_target for elem in self.elements])

+    def query_alignment_offsets(self) -> tuple:


🛠️ Refactor suggestion

Consider renaming the method for clarity and consistency.

Based on past review comments and Python naming conventions, consider renaming the method to alignment_offsets and removing the query_ prefix. This follows the principle that getter methods typically don't use the get_ prefix in Python.

def alignment_offsets(self, reverse: bool = False) -> tuple[int, int]:

If you decide to keep the current name, ensure it's consistent with other method names in the class.

query here doesn't mean "get"

@yfarjoun, thank you for the clarification. I understand now that "query" refers to the query sequence in this context.

(*•.•)

✏️ Learnings added

Learnt from: yfarjoun PR: fulcrumgenomics/fgpyo#188 File: fgpyo/sam/__init__.py:550-550 Timestamp: 2024-10-21T13:26:33.869Z Learning: In this codebase, "query" in method names refers to the query sequence, not the verb "get".

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2024-10-18T21:07:07Z

tests/fgpyo/sam/test_cigar.py

+@pytest.mark.parametrize(
+    ("cigar_string", "maybe_range"),
+    {
+        ("10M", (0, 10)),
+        ("10M10I", (0, 20)),
+        ("10X10I", (0, 20)),
+        ("10X10D", (0, 10)),
+        ("10=10D", (0, 10)),
+        ("10S10M", (0, 10)),
+        ("10H10M", (0, 10)),
+        ("10H10S10M", (0, 10)),
+        ("10H10S10M5S", (5, 15)),
+        ("10H10S10M5S10H", (5, 15)),
+        ("10H", None),
+        ("10S", None),
+        ("10S10H", None),
+        ("5H10S10H", None),
+    },
+)


⚠️ Potential issue

Use square brackets instead of curly braces in @pytest.mark.parametrize

Similar to the previous function, the parameter set here is enclosed in curly braces {}. Replace them with square brackets [] to correctly create a list of tuples for test parameters.

Apply this diff to correct the parameterization:

@pytest.mark.parametrize( ("cigar_string", "maybe_range"), - { + [ ("10M", (0, 10)), ("10M10I", (0, 20)), ("10X10I", (0, 20)), ("10X10D", (0, 10)), ("10=10D", (0, 10)), ("10S10M", (0, 10)), ("10H10M", (0, 10)), ("10H10S10M", (0, 10)), ("10H10S10M5S", (5, 15)), ("10H10S10M5S10H", (5, 15)), ("10H", None), ("10S", None), ("10S10H", None), ("5H10S10H", None), - }, + ], ) def test_get_alignments_reversed(cigar_string: str, maybe_range: Optional[Tuple]) -> None:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@pytest.mark.parametrize(

("cigar_string", "maybe_range"),

{

("10M", (0, 10)),

("10M10I", (0, 20)),

("10X10I", (0, 20)),

("10X10D", (0, 10)),

("10=10D", (0, 10)),

("10S10M", (0, 10)),

("10H10M", (0, 10)),

("10H10S10M", (0, 10)),

("10H10S10M5S", (5, 15)),

("10H10S10M5S10H", (5, 15)),

("10H", None),

("10S", None),

("10S10H", None),

("5H10S10H", None),

},

)

@pytest.mark.parametrize(

("cigar_string", "maybe_range"),

[

("10M", (0, 10)),

("10M10I", (0, 20)),

("10X10I", (0, 20)),

("10X10D", (0, 10)),

("10=10D", (0, 10)),

("10S10M", (0, 10)),

("10H10M", (0, 10)),

("10H10S10M", (0, 10)),

("10H10S10M5S", (5, 15)),

("10H10S10M5S10H", (5, 15)),

("10H", None),

("10S", None),

("10S10H", None),

("5H10S10H", None),

],

)

coderabbitai · 2024-10-18T21:07:07Z

tests/fgpyo/sam/test_cigar.py

+@pytest.mark.parametrize(
+    ("cigar_string", "maybe_range"),
+    {
+        ("10M", (0, 10)),
+        ("10M10I", (0, 20)),
+        ("10X10I", (0, 20)),
+        ("10X10D", (0, 10)),
+        ("10=10D", (0, 10)),
+        ("10S10M", (10, 20)),
+        ("10H10M", (0, 10)),
+        ("10H10S10M", (10, 20)),
+        ("10H10S10M5S", (10, 20)),
+        ("10H10S10M5S10H", (10, 20)),
+        ("10H", None),
+        ("10S", None),
+        ("10S10H", None),
+        ("5H10S10H", None),
+        ("76D", None),
+        ("76I", (0, 76)),
+        ("10P76S", None),
+        ("50S1000N50S", None),
+    },
+)


⚠️ Potential issue

Use square brackets instead of curly braces in @pytest.mark.parametrize

In the @pytest.mark.parametrize decorator, the parameter set is enclosed in curly braces {}, which creates a set. Sets are unordered and do not allow duplicate elements, which can lead to unpredictable behavior or errors in tests. Instead, you should use square brackets [] to create a list of tuples.

Apply this diff to correct the parameterization:

@pytest.mark.parametrize( ("cigar_string", "maybe_range"), - { + [ ("10M", (0, 10)), ("10M10I", (0, 20)), ("10X10I", (0, 20)), ("10X10D", (0, 10)), ("10=10D", (0, 10)), ("10S10M", (10, 20)), ("10H10M", (0, 10)), ("10H10S10M", (10, 20)), ("10H10S10M5S", (10, 20)), ("10H10S10M5S10H", (10, 20)), ("10H", None), ("10S", None), ("10S10H", None), ("5H10S10H", None), ("76D", None), ("76I", (0, 76)), ("10P76S", None), ("50S1000N50S", None), - }, + ], ) def test_get_alignments(cigar_string: str, maybe_range: Optional[Tuple]) -> None:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@pytest.mark.parametrize(

("cigar_string", "maybe_range"),

{

("10M", (0, 10)),

("10M10I", (0, 20)),

("10X10I", (0, 20)),

("10X10D", (0, 10)),

("10=10D", (0, 10)),

("10S10M", (10, 20)),

("10H10M", (0, 10)),

("10H10S10M", (10, 20)),

("10H10S10M5S", (10, 20)),

("10H10S10M5S10H", (10, 20)),

("10H", None),

("10S", None),

("10S10H", None),

("5H10S10H", None),

("76D", None),

("76I", (0, 76)),

("10P76S", None),

("50S1000N50S", None),

},

)

@pytest.mark.parametrize(

("cigar_string", "maybe_range"),

[

("10M", (0, 10)),

("10M10I", (0, 20)),

("10X10I", (0, 20)),

("10X10D", (0, 10)),

("10=10D", (0, 10)),

("10S10M", (10, 20)),

("10H10M", (0, 10)),

("10H10S10M", (10, 20)),

("10H10S10M5S", (10, 20)),

("10H10S10M5S10H", (10, 20)),

("10H", None),

("10S", None),

("10S10H", None),

("5H10S10H", None),

("76D", None),

("76I", (0, 76)),

("10P76S", None),

("50S1000N50S", None),

],

)

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

tests/fgpyo/sam/test_cigar.py (1)
1-2: Combine imports from the typing module

You can consolidate the imports from the typing module into a single line for brevity and clarity.

Apply this diff to combine the import statements:
-from typing import Optional
-from typing import Tuple
+from typing import Optional, Tuple

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e2ba773 and c420c5e.

📒 Files selected for processing (1)

tests/fgpyo/sam/test_cigar.py (2 hunks)

🧰 Additional context used

coderabbitai · 2024-10-21T13:30:01Z

tests/fgpyo/sam/test_cigar.py

+@pytest.mark.parametrize(
+    ("cigar_string", "maybe_range"),
+    [
+        ("10M", (0, 10)),
+        ("10M10I", (0, 20)),
+        ("10X10I", (0, 20)),
+        ("10X10D", (0, 10)),
+        ("10=10D", (0, 10)),
+        ("10S10M", (10, 20)),
+        ("10H10M", (0, 10)),
+        ("10H10S10M", (10, 20)),
+        ("10H10S10M5S", (10, 20)),
+        ("10H10S10M5S10H", (10, 20)),
+        ("10H", None),
+        ("10S", None),
+        ("10S10H", None),
+        ("5H10S10H", None),
+        ("76D", None),
+        ("76I", (0, 76)),
+        ("10P76S", None),
+        ("50S1000N50S", None),
+    ],
+)
+def test_get_alignments(cigar_string: str, maybe_range: Optional[tuple]) -> None:
+    cig = Cigar.from_cigarstring(cigar_string)
+    if not maybe_range:
+        with pytest.raises(ValueError):
+            cig.query_alignment_offsets()
+    else:
+        ret = cig.query_alignment_offsets()
+        assert ret == maybe_range
+


🛠️ Refactor suggestion

Refactor tests to reduce code duplication

The test functions test_get_alignments and test_get_alignments_reversed share similar logic, differing mainly in whether they test the reversed CIGAR strings. Consider combining them into a single parameterized test function with an additional parameter to indicate reversal. This will improve maintainability and reduce redundancy.

Here's how you might refactor the tests:

@pytest.mark.parametrize( ("cigar_string", "maybe_range", "reverse"), [ # Forward CIGAR strings ("10M", (0, 10), False), ("10M10I", (0, 20), False), ("10X10I", (0, 20), False), ("10X10D", (0, 10), False), ("10=10D", (0, 10), False), ("10S10M", (10, 20), False), ("10H10M", (0, 10), False), ("10H10S10M", (10, 20), False), ("10H10S10M5S", (10, 20), False), ("10H10S10M5S10H", (10, 20), False), ("10H", None, False), ("10S", None, False), ("10S10H", None, False), ("5H10S10H", None, False), ("76D", None, False), ("76I", (0, 76), False), ("10P76S", None, False), ("50S1000N50S", None, False), # Reversed CIGAR strings ("10M", (0, 10), True), ("10M10I", (0, 20), True), ("10X10I", (0, 20), True), ("10X10D", (0, 10), True), ("10=10D", (0, 10), True), ("10S10M", (0, 10), True), ("10H10M", (0, 10), True), ("10H10S10M", (0, 10), True), ("10H10S10M5S", (5, 15), True), ("10H10S10M5S10H", (5, 15), True), ("10H", None, True), ("10S", None, True), ("10S10H", None, True), ("5H10S10H", None, True), ], ) def test_query_alignment_offsets(cigar_string: str, maybe_range: Optional[Tuple[int, int]], reverse: bool) -> None: cig = Cigar.from_cigarstring(cigar_string) if reverse: cig = cig.reversed() if maybe_range is None: with pytest.raises(ValueError): cig.query_alignment_offsets() else: ret = cig.query_alignment_offsets() assert ret == maybe_range

Also applies to: 72-99

msto · 2024-10-21T13:56:28Z

fgpyo/sam/__init__.py

@@ -547,6 +547,46 @@ def length_on_target(self) -> int:
        """Returns the length of the alignment on the target sequence."""
        return sum([elem.length_on_target for elem in self.elements])

+    def query_alignment_offsets(self) -> tuple:


issue Provide a complete type hint for the return value.

Suggested change

def query_alignment_offsets(self) -> tuple:

def query_alignment_offsets(self) -> tuple[int, int]:

when I do this I get an error:

fgpyo/sam/__init__.py:550: error: "tuple" is not subscriptable, use "typing.Tuple" instead [misc] Found 1 error in 1 file (checked 52 source files)

so I switched to Tuple[int, int]

That's fine. Generic type hints for the standard library collections were only added in 3.9¹, and fgpyo still supports 3.8. (See #187)

The main point was to document the shape and type of the tuple's contents.

Footnotes

https://peps.python.org/pep-0585/ ↩

msto · 2024-10-21T13:57:49Z

fgpyo/sam/__init__.py

+        query.
+
+        The resulting range will contain the range of positions in the SEQ string for
+        the bases that are aligned. If no bases are aligned, the return value will be None.


issue I think this part of the docstring is stale - the current behavior is to throw an exception if no bases are aligned.

Suggested change

the bases that are aligned. If no bases are aligned, the return value will be None.

the bases that are aligned.

msto · 2024-10-21T13:58:07Z

fgpyo/sam/__init__.py

+        Returns:
+            A tuple (start, stop) containing the start and stop positions
+                of the aligned part of the query. These offsets are 0-based and open-ended, with
+                respect to the beginning of the query. If no bases are aligned.


Suggested change

respect to the beginning of the query. If no bases are aligned.

respect to the beginning of the query.

msto · 2024-10-21T13:59:30Z

fgpyo/sam/__init__.py

+                of the aligned part of the query. These offsets are 0-based and open-ended, with
+                respect to the beginning of the query. If no bases are aligned.
+
+        Throws:


issue The three special sections of a docstring are Args:, Returns: (or Yields:), and Raises:.

Suggested change

Throws:

Raises:

Raises:
List all exceptions that are relevant to the interface followed by a description. Use a similar exception name + colon + space or newline and hanging indent style as described in Args:. You should not document exceptions that get raised if the API specified in the docstring is violated (because this would paradoxically make behavior under violation of the API part of the API).

https://google.github.io/styleguide/pyguide.html#383-functions-and-methods

msto · 2024-10-21T14:01:55Z

tests/fgpyo/sam/test_cigar.py

+def test_get_alignments(cigar_string: str, maybe_range: Optional[tuple]) -> None:
+    cig = Cigar.from_cigarstring(cigar_string)
+    if not maybe_range:
+        with pytest.raises(ValueError):
+            cig.query_alignment_offsets()
+    else:
+        ret = cig.query_alignment_offsets()
+        assert ret == maybe_range


suggestion Why not have two separate test functions - one for the cases that return a tuple, and one for the cases that raise an error?

That simplifies a bit by getting rid of the Optional[tuple], which I think is an artifact of the old implementation that returned an Optional instead of raising an exception. And you don't need to pair None with every bad example.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

tests/fgpyo/sam/test_cigar.py (3)

36-55: LGTM! Consider adding a descriptive comment for each test case.

The implementation of test_query_alignment_offsets is correct and covers various CIGAR string scenarios. To improve readability and maintainability, consider adding a brief comment for each test case explaining the CIGAR string's composition and why the expected range is correct.

Here's an example of how you could add comments:

@pytest.mark.parametrize(
    ("cigar_string", "expected_range"),
    [
        ("10M", (0, 10)),  # 10 matches, straightforward case
        ("10M10I", (0, 20)),  # 10 matches followed by 10 insertions
        ("10X10I", (0, 20)),  # 10 mismatches followed by 10 insertions
        ("10X10D", (0, 10)),  # 10 mismatches followed by 10 deletions (not counted in query)
        ("10=10D", (0, 10)),  # 10 matches followed by 10 deletions (not counted in query)
        ("10S10M", (10, 20)),  # 10 soft-clipped bases followed by 10 matches
        ("10H10M", (0, 10)),  # 10 hard-clipped bases (not in query) followed by 10 matches
        ("10H10S10M", (10, 20)),  # 10 hard-clipped, 10 soft-clipped, 10 matches
        ("10H10S10M5S", (10, 20)),  # 10 hard-clipped, 10 soft-clipped, 10 matches, 5 soft-clipped
        ("10H10S10M5S10H", (10, 20)),  # Same as above with 10 hard-clipped at end
        ("76I", (0, 76)),  # 76 insertions, all counted in query
    ],
)

58-76: LGTM! Consider separating the tests for original and reversed CIGARs.

The implementation of test_query_alignment_offsets_failures correctly tests for ValueError in cases where no bases are aligned. To improve clarity and make the tests more focused, consider separating the tests for original and reversed CIGAR strings into two different test functions.

Here's a suggested refactor:

@pytest.mark.parametrize(
    "cigar_string",
    [
        "10H",
        "10S",
        "10S10H",
        "5H10S10H",
        "76D",
        "10P76S",
        "50S1000N50S",
    ],
)
def test_query_alignment_offsets_failures(cigar_string: str) -> None:
    cig = Cigar.from_cigarstring(cigar_string)
    with pytest.raises(ValueError):
        cig.query_alignment_offsets()

@pytest.mark.parametrize(
    "cigar_string",
    [
        "10H",
        "10S",
        "10S10H",
        "5H10S10H",
        "76D",
        "10P76S",
        "50S1000N50S",
    ],
)
def test_query_alignment_offsets_failures_reversed(cigar_string: str) -> None:
    cig = Cigar.from_cigarstring(cigar_string)
    with pytest.raises(ValueError):
        cig.reversed().query_alignment_offsets()

This separation makes it clearer that we're testing both the original and reversed CIGAR strings, and it allows for easier maintenance if the behavior for reversed CIGARs changes in the future.

79-98: LGTM! Consider adding comments and aligning with the non-reversed test.

The implementation of test_query_alignment_offsets_reversed is correct and covers various reversed CIGAR string scenarios. To improve consistency and readability, consider adding comments for each test case (as suggested for the non-reversed test) and aligning the test cases with those in test_query_alignment_offsets.

Here's an example of how you could improve the test:

@pytest.mark.parametrize(
    ("cigar_string", "expected_range"),
    [
        ("10M", (0, 10)),  # 10 matches, straightforward case
        ("10M10I", (0, 20)),  # 10 matches followed by 10 insertions
        ("10X10I", (0, 20)),  # 10 mismatches followed by 10 insertions
        ("10X10D", (0, 10)),  # 10 mismatches followed by 10 deletions (not counted in query)
        ("10=10D", (0, 10)),  # 10 matches followed by 10 deletions (not counted in query)
        ("10S10M", (0, 10)),  # 10 soft-clipped bases followed by 10 matches (reversed)
        ("10H10M", (0, 10)),  # 10 hard-clipped bases (not in query) followed by 10 matches
        ("10H10S10M", (0, 10)),  # 10 hard-clipped, 10 soft-clipped, 10 matches (reversed)
        ("10H10S10M5S", (5, 15)),  # 10 hard-clipped, 10 soft-clipped, 10 matches, 5 soft-clipped (reversed)
        ("10H10S10M5S10H", (5, 15)),  # Same as above with 10 hard-clipped at end
        ("76I", (0, 76)),  # 76 insertions, all counted in query
    ],
)
def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: tuple) -> None:
    cig = Cigar.from_cigarstring(cigar_string)
    ret = cig.reversed().query_alignment_offsets()
    assert ret == expected_range

This aligns the test cases with the non-reversed test, making it easier to compare and understand the differences in behavior between regular and reversed CIGARs.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c420c5e and 3449432.

📒 Files selected for processing (2)

fgpyo/sam/init.py (1 hunks)
tests/fgpyo/sam/test_cigar.py (1 hunks)

🧰 Additional context used

🔇 Additional comments (1)

fgpyo/sam/__init__.py (1)
550-588: LGTM! Well-implemented method with clear logic and good error handling.

The query_alignment_offsets method is well-implemented, with clear logic for calculating alignment offsets. The docstring is comprehensive, and the code is well-commented. Good job on raising a ValueError when no bases are aligned.

Consider simplifying the return statement:
-        ret = (start_offset, end_offset)
-        if start_offset == end_offset:
-            raise ValueError(f"Cigar {self.__str__()} has no aligned bases")
-        return ret
+        if start_offset == end_offset:
+            raise ValueError(f"Cigar {self} has no aligned bases")
+        return start_offset, end_offset
This change reduces a variable assignment and uses the implicit __str__ call.

To ensure robustness, consider adding a test case for a CIGAR string with only clipping operations (e.g., "10S20H"). Run the following script to check if such a test exists:

If no results are found, consider adding such a test case to verify the ValueError is raised correctly.

✅ Verification successful

Test Cases Verified

Found test cases for clipping-only CIGAR operations in tests/fgpyo/sam/test_cigar.py, ensuring proper handling of edge cases.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for test cases with only clipping operations in CIGAR
rg -i 'def test.*query_alignment_offsets' tests/fgpyo/sam/test_cigar.py -A 10 | rg -i '([0-9]+[SH])+"\)'
Length of output: 140

tfenne · 2024-10-21T14:43:02Z

fgpyo/sam/__init__.py

+        ret = (start_offset, end_offset)
+        if start_offset == end_offset:
+            raise ValueError(f"Cigar {self.__str__()} has no aligned bases")
+        return ret


Why not just?

Suggested change

ret = (start_offset, end_offset)

if start_offset == end_offset:

raise ValueError(f"Cigar {self.__str__()} has no aligned bases")

return ret

if not alignment_began:

raise ValueError(f"Cigar {self.__str__()} has no aligned bases")

return start_offset, end_offset

msto · 2024-10-21T14:47:25Z

tests/fgpyo/sam/test_cigar.py

+        ("76I", (0, 76)),
+    ],
+)
+def test_query_alignment_offsets(cigar_string: str, expected_range: tuple) -> None:


Please fix the type hints here, as well. Please try to avoid generic collection type hints without a parametrized type.

Suggested change

def test_query_alignment_offsets(cigar_string: str, expected_range: tuple) -> None:

def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None:

msto · 2024-10-21T14:47:40Z

tests/fgpyo/sam/test_cigar.py

+        ("10H10S10M5S10H", (5, 15)),
+    ],
+)
+def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: tuple) -> None:


ditto

Suggested change

def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: tuple) -> None:

def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: Tuple[int, int]) -> None:

fgpyo/sam/__init__.py

too much

msto · 2024-10-21T15:46:57Z

tests/fgpyo/sam/test_cigar.py

+        ("76I", (0, 76)),
+    ],
+)
+def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None:


Suggested change

def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None:

def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None:

"""query_alignment_offsets() should return the expected start and stop positions."""

msto · 2024-10-21T15:47:29Z

tests/fgpyo/sam/test_cigar.py

+)
+def test_query_alignment_offsets_reversed(
+    cigar_string: str, expected_range: Tuple[int, int]
+) -> None:


Suggested change

) -> None:

) -> None:

"""query_alignment_offsets() should return the expected start and stop positions on a reversed CIGAR."""

msto · 2024-10-21T15:47:47Z

tests/fgpyo/sam/test_cigar.py

+        ("50S1000N50S"),
+    ],
+)
+def test_query_alignment_offsets_failures(cigar_string: str) -> None:


Suggested change

def test_query_alignment_offsets_failures(cigar_string: str) -> None:

def test_query_alignment_offsets_failures(cigar_string: str) -> None:

"""query_alignment_offsets() should raise a ValueError if the CIGAR has no aligned positions."""

yfarjoun requested review from nh13 and tfenne as code owners September 19, 2024 00:35

TimD1 approved these changes Sep 19, 2024

View reviewed changes

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved

msto requested changes Sep 19, 2024

View reviewed changes

yfarjoun force-pushed the yf_add_cigar_utility_methods branch from 36be5bb to f371b6c Compare September 19, 2024 02:24

yfarjoun requested a review from msto September 19, 2024 02:24

yfarjoun commented Sep 19, 2024

View reviewed changes

tfenne requested changes Sep 19, 2024

View reviewed changes

yfarjoun changed the title ~~- add cigar utility method get_alignment_offsets~~ Feat: add cigar utility method get_alignment_offsets Sep 19, 2024

yfarjoun requested a review from tfenne September 19, 2024 16:32

yfarjoun force-pushed the yf_add_cigar_utility_methods branch from 6ab8245 to b07c3e4 Compare September 19, 2024 16:43

msto requested changes Sep 19, 2024

View reviewed changes

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved

fgpyo/sam/__init__.py Show resolved Hide resolved

msto changed the title ~~Feat: add cigar utility method get_alignment_offsets~~ feat: add cigar utility method get_alignment_offsets Sep 19, 2024

msto reviewed Oct 3, 2024

View reviewed changes

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved

tfenne requested changes Oct 17, 2024

View reviewed changes

tfenne assigned yfarjoun Oct 17, 2024

yfarjoun and others added 9 commits October 18, 2024 17:02

- add cigar utility method get_alignment_offsets to find the offsets …

3388c59

…of the aligned segment in the read - tests included

Update fgpyo/sam/__init__.py

41cc962

Co-authored-by: Matt Stone <[email protected]>

Update tests/fgpyo/sam/test_cigar.py

e682dcf

Co-authored-by: Matt Stone <[email protected]>

- responded to review comments.

33452b9

responding to review comments

8549205

docs

c9bf15e

rename

f1edfc1

reformat again

af6a3a0

added "RangeOfBases class to hold the results" (with some utility met…

8111455

…hods)

yfarjoun added 3 commits October 18, 2024 17:02

types...

593ed57

responding to review comments.

6a548cd

documentation

e2ba773

yfarjoun force-pushed the yf_add_cigar_utility_methods branch from e9c0b25 to e2ba773 Compare October 18, 2024 21:02

coderabbitai bot requested changes Oct 18, 2024

View reviewed changes

update thanks to code-rabbit ai

c420c5e

coderabbitai bot previously requested changes Oct 21, 2024

View reviewed changes

msto requested changes Oct 21, 2024

View reviewed changes

respond to review comments

3449432

coderabbitai bot reviewed Oct 21, 2024

View reviewed changes

tfenne approved these changes Oct 21, 2024

View reviewed changes

msto requested changes Oct 21, 2024

View reviewed changes

yfarjoun added 3 commits October 21, 2024 10:54

Tuple[int, int] in tests as well

9ab330a

incorporate a comment

01e940a

ruff

253bde2

msto reviewed Oct 21, 2024

View reviewed changes

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved

msto reviewed Oct 21, 2024

View reviewed changes

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved

pretty little things

367497d

msto approved these changes Oct 21, 2024

View reviewed changes

document tests

b90b605

yfarjoun merged commit 12c2d31 into main Oct 24, 2024
9 checks passed

yfarjoun deleted the yf_add_cigar_utility_methods branch October 24, 2024 15:58

	def query_alignment_offsets(self) -> tuple:
	def query_alignment_offsets(self) -> tuple[int, int]:

	the bases that are aligned. If no bases are aligned, the return value will be None.
	the bases that are aligned.

	respect to the beginning of the query. If no bases are aligned.
	respect to the beginning of the query.

	def test_query_alignment_offsets(cigar_string: str, expected_range: tuple) -> None:
	def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None:

	def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: tuple) -> None:
	def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: Tuple[int, int]) -> None:

	def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None:
	def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None:
	"""query_alignment_offsets() should return the expected start and stop positions."""

	) -> None:
	) -> None:
	"""query_alignment_offsets() should return the expected start and stop positions on a reversed CIGAR."""

	def test_query_alignment_offsets_failures(cigar_string: str) -> None:
	def test_query_alignment_offsets_failures(cigar_string: str) -> None:
	"""query_alignment_offsets() should raise a ValueError if the CIGAR has no aligned positions."""

feat: add cigar utility method get_alignment_offsets #188

feat: add cigar utility method get_alignment_offsets #188

Conversation

yfarjoun commented Sep 19, 2024

TimD1 left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 19, 2024 • edited Loading

Codecov Report

yfarjoun commented Sep 19, 2024

msto left a comment

Choose a reason for hiding this comment

yfarjoun commented Sep 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tfenne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yfarjoun commented Oct 18, 2024

coderabbitai bot commented Oct 18, 2024 • edited Loading

Walkthrough

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Oct 21, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 18, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 18, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Oct 21, 2024

Choose a reason for hiding this comment

msto Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msto Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 19, 2024 •

edited

Loading

coderabbitai bot commented Oct 18, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot Oct 18, 2024 •

edited

Loading

msto Oct 21, 2024 •

edited

Loading

msto Oct 21, 2024 •

edited

Loading