Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add cigar utility method get_alignment_offsets #188

Merged
merged 19 commits into from
Oct 24, 2024

Conversation

yfarjoun
Copy link
Contributor

  • add cigar utility method get_alignment_offsets to find the offsets of the aligned segment in the read
  • tests included

Copy link
Contributor

@TimD1 TimD1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, once the ruff/mypy CI stuff passes

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Sep 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.48%. Comparing base (8c3c152) to head (b90b605).
Report is 14 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #188      +/-   ##
==========================================
+ Coverage   89.40%   89.48%   +0.07%     
==========================================
  Files          18       18              
  Lines        2105     2120      +15     
  Branches      467      471       +4     
==========================================
+ Hits         1882     1897      +15     
  Misses        146      146              
  Partials       77       77              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yfarjoun
Copy link
Contributor Author

not sure what's going on the mkdocs. I had to comment out some lines to get it to compile.... if someone knows how to fix this please let me know...

Copy link
Contributor

@msto msto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a good addition to the package, thanks for adding it!

I think the documentation could be a bit clearer, and it looks like there's a bug when you have adjacent hard/soft clips

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
@yfarjoun yfarjoun force-pushed the yf_add_cigar_utility_methods branch from 36be5bb to f371b6c Compare September 19, 2024 02:24
@yfarjoun yfarjoun requested a review from msto September 19, 2024 02:24
@yfarjoun
Copy link
Contributor Author

Thanks for the detailed review @msto , much appreciated!

mkdocs is giving me grief and I was unable to make it play nice.... if someone knows why it barfs when I change the indentation or remove the final "comment" line... please let me know. this is the best (passing) state I could put it into....

@@ -368,7 +368,7 @@ class CigarOp(enum.Enum):
code (int): The `~pysam` cigar operator code.
character (int): The single character cigar operator.
consumes_query (bool): True if this operator consumes query bases, False otherwise.
consumes_target (bool): True if this operator consumes target bases, False otherwise.
consumes_reference (bool): True if this operator consumes reference bases, False otherwise.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated fix. if deemed imporatant I could extract it to a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfarjoun This uses target intentionally - not all alignments are between a read and a reference - what if you align two reads vs. each other, or two contigs vs. each other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfenne While not all alignments are between a read and a reference, the CIGAR specification uses the terms "query" and "reference". I agree with Yossi, I find it clearer to have our documentation consistent with spec.

Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the
alignment to step along the query sequence and the reference sequence respectively

Screenshot 2024-09-19 at 11 23 12 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I was correcting a broken link.... if this is intentional, I'll bring it back and let's haved the discussion in a different PR (if at all)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% on different PR. Though I'll also argue there for maintaining "target".

Little known fact is that CIGAR actually predates SAM/BAM and Google believes it was first documented as part of exonerate which uses "query" and "target" 😛

@@ -368,7 +368,7 @@ class CigarOp(enum.Enum):
code (int): The `~pysam` cigar operator code.
character (int): The single character cigar operator.
consumes_query (bool): True if this operator consumes query bases, False otherwise.
consumes_target (bool): True if this operator consumes target bases, False otherwise.
consumes_reference (bool): True if this operator consumes reference bases, False otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfarjoun This uses target intentionally - not all alignments are between a read and a reference - what if you align two reads vs. each other, or two contigs vs. each other?

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Comment on lines 563 to 567
If the Cigar contains no alignment operators that consume sequence bases, or
only clipping operators, the start and end offsets will be the same value
(indicating an empty region). This shared value will be the offset to the first
base consumed by a non-clipping operator or the length of the read sequence if
there is no such base.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that either a) this function should return Optional[...] and return None when invoked on a cigar without alignment operations, or raise an error in that situation. There is no aligned start/end if there are no alignment operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue in favor of Optional vs raising an error, but would be comfortable either way

fgpyo/sam/__init__.py Show resolved Hide resolved
tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
@yfarjoun yfarjoun changed the title - add cigar utility method get_alignment_offsets Feat: add cigar utility method get_alignment_offsets Sep 19, 2024
@yfarjoun yfarjoun requested a review from tfenne September 19, 2024 16:32
@yfarjoun yfarjoun force-pushed the yf_add_cigar_utility_methods branch from 6ab8245 to b07c3e4 Compare September 19, 2024 16:43
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Show resolved Hide resolved
@msto msto changed the title Feat: add cigar utility method get_alignment_offsets feat: add cigar utility method get_alignment_offsets Sep 19, 2024
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfarjoun: @msto and I caught up in person and would request the following changes to this PR:

  1. Could you please remove the reverse argument, and in the docstring mention that if you have a CIGAR from a negative strand read you might want to cigar.reversed().query_alignment_offsets()?
  2. Can we please stick to just a Tuple[int, int] for the return type?
  3. Instead of returning Optional .. could it just raise an exception at the end if it was called on a pathological CIGAR for which we can't generate this? That seems exceptional since it would be an empty cigar or a cigar that is all-clipping.

Comment on lines 592 to 594
"""Gets the 0-based, end-exclusive positions of the first and last aligned base in the
query. The resulting range will contain the range of positions in the SEQ string for
the bases that are aligned. If no bases are aligned, the return value will be None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Gets the 0-based, end-exclusive positions of the first and last aligned base in the
query. The resulting range will contain the range of positions in the SEQ string for
the bases that are aligned. If no bases are aligned, the return value will be None.
"""
Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.
The resulting range will contain the range of positions in the SEQ string for
the bases that are aligned. If no bases are aligned, the return value will be None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change is to adhere to Google style guidelines of a single summary sentence separated from the longer description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@yfarjoun
Copy link
Contributor Author

Thanks for the comments. done.

Copy link
Contributor

coderabbitai bot commented Oct 18, 2024

Walkthrough

The changes introduced in the pull request include the addition of a new method, query_alignment_offsets, to the Cigar class located in fgpyo/sam/__init__.py. This method calculates the 0-based, end-exclusive positions of the first and last aligned bases in a query by iterating through the elements of the Cigar representation. It adjusts the start and end offsets based on the nature of the elements, specifically whether they are clipping or part of the alignment. If no bases are aligned, the method raises a ValueError. The implementation includes detailed docstrings for clarity.

Additionally, the test suite for the Cigar class has been expanded in tests/fgpyo/sam/test_cigar.py. Three new test functions have been added: test_query_alignment_offsets, test_query_alignment_offsets_failures, and test_query_alignment_offsets_reversed. These tests utilize parameterization to validate the functionality of the query_alignment_offsets method against various CIGAR strings, ensuring correct behavior and appropriate exception handling. Existing tests remain unchanged, maintaining the integrity of the overall test suite.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 367497d and b90b605.

📒 Files selected for processing (1)
  • tests/fgpyo/sam/test_cigar.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/fgpyo/sam/test_cigar.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@yfarjoun yfarjoun force-pushed the yf_add_cigar_utility_methods branch from e9c0b25 to e2ba773 Compare October 18, 2024 21:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (3)
fgpyo/sam/__init__.py (2)

550-566: Improve docstring formatting and content.

The docstring has some formatting issues and could be more precise. Please consider the following improvements:

  1. Adjust the indentation to align with the opening quotes.
  2. Be more specific about the return value, mentioning it's a tuple of two integers.
  3. Clarify the exception description.

Here's a suggested revision:

def query_alignment_offsets(self) -> tuple[int, int]:
    """Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

    The resulting range will contain the positions in the SEQ string for the bases that are
    aligned. If counting from the end of the query is desired, use
    `cigar.reversed().query_alignment_offsets()`.

    Returns:
        A tuple (start, end) containing the start and end positions of the aligned part of the
        query. These offsets are 0-based and open-ended, with respect to the beginning of the
        query sequence.

    Raises:
        ValueError: If the CIGAR string indicates no aligned bases.
    """

Also, consider using Python 3.9+ type hinting for the return value: tuple[int, int].


567-589: LGTM! Implementation looks correct with a minor suggestion.

The implementation of query_alignment_offsets looks correct. It properly handles clipping operators and calculates the alignment offsets accurately. The method also raises a ValueError when no bases are aligned, as specified in the docstring.

Minor suggestion: Consider adding a type hint for the element variable in the for loop for better code readability:

for element in self.elements:

to:

for element in self.elements:  # type: CigarElement

This change is optional but can improve code clarity.

tests/fgpyo/sam/test_cigar.py (1)

62-62: Consistent use of Tuple from the typing module

In the function definition, maybe_range is annotated as Optional[tuple]. For consistency and to leverage the benefits of type hints, consider using Optional[Tuple] from the typing module, which you've already imported.

Apply this diff to improve type annotation consistency:

 def test_get_alignments(cigar_string: str, maybe_range: Optional[tuple]) -> None:
+def test_get_alignments(cigar_string: str, maybe_range: Optional[Tuple]) -> None:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 8c3c152 and e2ba773.

📒 Files selected for processing (2)
  • fgpyo/sam/init.py (1 hunks)
  • tests/fgpyo/sam/test_cigar.py (2 hunks)
🧰 Additional context used

@@ -547,6 +547,46 @@ def length_on_target(self) -> int:
"""Returns the length of the alignment on the target sequence."""
return sum([elem.length_on_target for elem in self.elements])

def query_alignment_offsets(self) -> tuple:
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider renaming the method for clarity and consistency.

Based on past review comments and Python naming conventions, consider renaming the method to alignment_offsets and removing the query_ prefix. This follows the principle that getter methods typically don't use the get_ prefix in Python.

def alignment_offsets(self, reverse: bool = False) -> tuple[int, int]:

If you decide to keep the current name, ensure it's consistent with other method names in the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query here doesn't mean "get"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfarjoun, thank you for the clarification. I understand now that "query" refers to the query sequence in this context.

(*•.•)


✏️ Learnings added
Learnt from: yfarjoun
PR: fulcrumgenomics/fgpyo#188
File: fgpyo/sam/__init__.py:550-550
Timestamp: 2024-10-21T13:26:33.869Z
Learning: In this codebase, "query" in method names refers to the query sequence, not the verb "get".

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines 72 to 90
@pytest.mark.parametrize(
("cigar_string", "maybe_range"),
{
("10M", (0, 10)),
("10M10I", (0, 20)),
("10X10I", (0, 20)),
("10X10D", (0, 10)),
("10=10D", (0, 10)),
("10S10M", (0, 10)),
("10H10M", (0, 10)),
("10H10S10M", (0, 10)),
("10H10S10M5S", (5, 15)),
("10H10S10M5S10H", (5, 15)),
("10H", None),
("10S", None),
("10S10H", None),
("5H10S10H", None),
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use square brackets instead of curly braces in @pytest.mark.parametrize

Similar to the previous function, the parameter set here is enclosed in curly braces {}. Replace them with square brackets [] to correctly create a list of tuples for test parameters.

Apply this diff to correct the parameterization:

 @pytest.mark.parametrize(
     ("cigar_string", "maybe_range"),
-    {
+    [
         ("10M", (0, 10)),
         ("10M10I", (0, 20)),
         ("10X10I", (0, 20)),
         ("10X10D", (0, 10)),
         ("10=10D", (0, 10)),
         ("10S10M", (0, 10)),
         ("10H10M", (0, 10)),
         ("10H10S10M", (0, 10)),
         ("10H10S10M5S", (5, 15)),
         ("10H10S10M5S10H", (5, 15)),
         ("10H", None),
         ("10S", None),
         ("10S10H", None),
         ("5H10S10H", None),
-    },
+    ],
 )
 def test_get_alignments_reversed(cigar_string: str, maybe_range: Optional[Tuple]) -> None:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@pytest.mark.parametrize(
("cigar_string", "maybe_range"),
{
("10M", (0, 10)),
("10M10I", (0, 20)),
("10X10I", (0, 20)),
("10X10D", (0, 10)),
("10=10D", (0, 10)),
("10S10M", (0, 10)),
("10H10M", (0, 10)),
("10H10S10M", (0, 10)),
("10H10S10M5S", (5, 15)),
("10H10S10M5S10H", (5, 15)),
("10H", None),
("10S", None),
("10S10H", None),
("5H10S10H", None),
},
)
@pytest.mark.parametrize(
("cigar_string", "maybe_range"),
[
("10M", (0, 10)),
("10M10I", (0, 20)),
("10X10I", (0, 20)),
("10X10D", (0, 10)),
("10=10D", (0, 10)),
("10S10M", (0, 10)),
("10H10M", (0, 10)),
("10H10S10M", (0, 10)),
("10H10S10M5S", (5, 15)),
("10H10S10M5S10H", (5, 15)),
("10H", None),
("10S", None),
("10S10H", None),
("5H10S10H", None),
],
)

Comment on lines 39 to 61
@pytest.mark.parametrize(
("cigar_string", "maybe_range"),
{
("10M", (0, 10)),
("10M10I", (0, 20)),
("10X10I", (0, 20)),
("10X10D", (0, 10)),
("10=10D", (0, 10)),
("10S10M", (10, 20)),
("10H10M", (0, 10)),
("10H10S10M", (10, 20)),
("10H10S10M5S", (10, 20)),
("10H10S10M5S10H", (10, 20)),
("10H", None),
("10S", None),
("10S10H", None),
("5H10S10H", None),
("76D", None),
("76I", (0, 76)),
("10P76S", None),
("50S1000N50S", None),
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use square brackets instead of curly braces in @pytest.mark.parametrize

In the @pytest.mark.parametrize decorator, the parameter set is enclosed in curly braces {}, which creates a set. Sets are unordered and do not allow duplicate elements, which can lead to unpredictable behavior or errors in tests. Instead, you should use square brackets [] to create a list of tuples.

Apply this diff to correct the parameterization:

 @pytest.mark.parametrize(
     ("cigar_string", "maybe_range"),
-    {
+    [
         ("10M", (0, 10)),
         ("10M10I", (0, 20)),
         ("10X10I", (0, 20)),
         ("10X10D", (0, 10)),
         ("10=10D", (0, 10)),
         ("10S10M", (10, 20)),
         ("10H10M", (0, 10)),
         ("10H10S10M", (10, 20)),
         ("10H10S10M5S", (10, 20)),
         ("10H10S10M5S10H", (10, 20)),
         ("10H", None),
         ("10S", None),
         ("10S10H", None),
         ("5H10S10H", None),
         ("76D", None),
         ("76I", (0, 76)),
         ("10P76S", None),
         ("50S1000N50S", None),
-    },
+    ],
 )
 def test_get_alignments(cigar_string: str, maybe_range: Optional[Tuple]) -> None:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@pytest.mark.parametrize(
("cigar_string", "maybe_range"),
{
("10M", (0, 10)),
("10M10I", (0, 20)),
("10X10I", (0, 20)),
("10X10D", (0, 10)),
("10=10D", (0, 10)),
("10S10M", (10, 20)),
("10H10M", (0, 10)),
("10H10S10M", (10, 20)),
("10H10S10M5S", (10, 20)),
("10H10S10M5S10H", (10, 20)),
("10H", None),
("10S", None),
("10S10H", None),
("5H10S10H", None),
("76D", None),
("76I", (0, 76)),
("10P76S", None),
("50S1000N50S", None),
},
)
@pytest.mark.parametrize(
("cigar_string", "maybe_range"),
[
("10M", (0, 10)),
("10M10I", (0, 20)),
("10X10I", (0, 20)),
("10X10D", (0, 10)),
("10=10D", (0, 10)),
("10S10M", (10, 20)),
("10H10M", (0, 10)),
("10H10S10M", (10, 20)),
("10H10S10M5S", (10, 20)),
("10H10S10M5S10H", (10, 20)),
("10H", None),
("10S", None),
("10S10H", None),
("5H10S10H", None),
("76D", None),
("76I", (0, 76)),
("10P76S", None),
("50S1000N50S", None),
],
)

coderabbitai[bot]
coderabbitai bot previously requested changes Oct 21, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
tests/fgpyo/sam/test_cigar.py (1)

1-2: Combine imports from the typing module

You can consolidate the imports from the typing module into a single line for brevity and clarity.

Apply this diff to combine the import statements:

-from typing import Optional
-from typing import Tuple
+from typing import Optional, Tuple
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e2ba773 and c420c5e.

📒 Files selected for processing (1)
  • tests/fgpyo/sam/test_cigar.py (2 hunks)
🧰 Additional context used

Comment on lines 39 to 70
@pytest.mark.parametrize(
("cigar_string", "maybe_range"),
[
("10M", (0, 10)),
("10M10I", (0, 20)),
("10X10I", (0, 20)),
("10X10D", (0, 10)),
("10=10D", (0, 10)),
("10S10M", (10, 20)),
("10H10M", (0, 10)),
("10H10S10M", (10, 20)),
("10H10S10M5S", (10, 20)),
("10H10S10M5S10H", (10, 20)),
("10H", None),
("10S", None),
("10S10H", None),
("5H10S10H", None),
("76D", None),
("76I", (0, 76)),
("10P76S", None),
("50S1000N50S", None),
],
)
def test_get_alignments(cigar_string: str, maybe_range: Optional[tuple]) -> None:
cig = Cigar.from_cigarstring(cigar_string)
if not maybe_range:
with pytest.raises(ValueError):
cig.query_alignment_offsets()
else:
ret = cig.query_alignment_offsets()
assert ret == maybe_range

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor tests to reduce code duplication

The test functions test_get_alignments and test_get_alignments_reversed share similar logic, differing mainly in whether they test the reversed CIGAR strings. Consider combining them into a single parameterized test function with an additional parameter to indicate reversal. This will improve maintainability and reduce redundancy.

Here's how you might refactor the tests:

@pytest.mark.parametrize(
    ("cigar_string", "maybe_range", "reverse"),
    [
        # Forward CIGAR strings
        ("10M", (0, 10), False),
        ("10M10I", (0, 20), False),
        ("10X10I", (0, 20), False),
        ("10X10D", (0, 10), False),
        ("10=10D", (0, 10), False),
        ("10S10M", (10, 20), False),
        ("10H10M", (0, 10), False),
        ("10H10S10M", (10, 20), False),
        ("10H10S10M5S", (10, 20), False),
        ("10H10S10M5S10H", (10, 20), False),
        ("10H", None, False),
        ("10S", None, False),
        ("10S10H", None, False),
        ("5H10S10H", None, False),
        ("76D", None, False),
        ("76I", (0, 76), False),
        ("10P76S", None, False),
        ("50S1000N50S", None, False),
        # Reversed CIGAR strings
        ("10M", (0, 10), True),
        ("10M10I", (0, 20), True),
        ("10X10I", (0, 20), True),
        ("10X10D", (0, 10), True),
        ("10=10D", (0, 10), True),
        ("10S10M", (0, 10), True),
        ("10H10M", (0, 10), True),
        ("10H10S10M", (0, 10), True),
        ("10H10S10M5S", (5, 15), True),
        ("10H10S10M5S10H", (5, 15), True),
        ("10H", None, True),
        ("10S", None, True),
        ("10S10H", None, True),
        ("5H10S10H", None, True),
    ],
)
def test_query_alignment_offsets(cigar_string: str, maybe_range: Optional[Tuple[int, int]], reverse: bool) -> None:
    cig = Cigar.from_cigarstring(cigar_string)
    if reverse:
        cig = cig.reversed()
    if maybe_range is None:
        with pytest.raises(ValueError):
            cig.query_alignment_offsets()
    else:
        ret = cig.query_alignment_offsets()
        assert ret == maybe_range

Also applies to: 72-99

@@ -547,6 +547,46 @@ def length_on_target(self) -> int:
"""Returns the length of the alignment on the target sequence."""
return sum([elem.length_on_target for elem in self.elements])

def query_alignment_offsets(self) -> tuple:
Copy link
Contributor

@msto msto Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue Provide a complete type hint for the return value.

Suggested change
def query_alignment_offsets(self) -> tuple:
def query_alignment_offsets(self) -> tuple[int, int]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I do this I get an error:

fgpyo/sam/__init__.py:550: error: "tuple" is not subscriptable, use "typing.Tuple" instead  [misc]
Found 1 error in 1 file (checked 52 source files)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I switched to Tuple[int, int]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. Generic type hints for the standard library collections were only added in 3.91, and fgpyo still supports 3.8. (See #187)

The main point was to document the shape and type of the tuple's contents.

Footnotes

  1. https://peps.python.org/pep-0585/

query.

The resulting range will contain the range of positions in the SEQ string for
the bases that are aligned. If no bases are aligned, the return value will be None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue I think this part of the docstring is stale - the current behavior is to throw an exception if no bases are aligned.

Suggested change
the bases that are aligned. If no bases are aligned, the return value will be None.
the bases that are aligned.

Returns:
A tuple (start, stop) containing the start and stop positions
of the aligned part of the query. These offsets are 0-based and open-ended, with
respect to the beginning of the query. If no bases are aligned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
respect to the beginning of the query. If no bases are aligned.
respect to the beginning of the query.

of the aligned part of the query. These offsets are 0-based and open-ended, with
respect to the beginning of the query. If no bases are aligned.

Throws:
Copy link
Contributor

@msto msto Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue The three special sections of a docstring are Args:, Returns: (or Yields:), and Raises:.

Suggested change
Throws:
Raises:

Raises:
List all exceptions that are relevant to the interface followed by a description. Use a similar exception name + colon + space or newline and hanging indent style as described in Args:. You should not document exceptions that get raised if the API specified in the docstring is violated (because this would paradoxically make behavior under violation of the API part of the API).

https://google.github.io/styleguide/pyguide.html#383-functions-and-methods

Comment on lines 62 to 69
def test_get_alignments(cigar_string: str, maybe_range: Optional[tuple]) -> None:
cig = Cigar.from_cigarstring(cigar_string)
if not maybe_range:
with pytest.raises(ValueError):
cig.query_alignment_offsets()
else:
ret = cig.query_alignment_offsets()
assert ret == maybe_range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion Why not have two separate test functions - one for the cases that return a tuple, and one for the cases that raise an error?

That simplifies a bit by getting rid of the Optional[tuple], which I think is an artifact of the old implementation that returned an Optional instead of raising an exception. And you don't need to pair None with every bad example.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
tests/fgpyo/sam/test_cigar.py (3)

36-55: LGTM! Consider adding a descriptive comment for each test case.

The implementation of test_query_alignment_offsets is correct and covers various CIGAR string scenarios. To improve readability and maintainability, consider adding a brief comment for each test case explaining the CIGAR string's composition and why the expected range is correct.

Here's an example of how you could add comments:

@pytest.mark.parametrize(
    ("cigar_string", "expected_range"),
    [
        ("10M", (0, 10)),  # 10 matches, straightforward case
        ("10M10I", (0, 20)),  # 10 matches followed by 10 insertions
        ("10X10I", (0, 20)),  # 10 mismatches followed by 10 insertions
        ("10X10D", (0, 10)),  # 10 mismatches followed by 10 deletions (not counted in query)
        ("10=10D", (0, 10)),  # 10 matches followed by 10 deletions (not counted in query)
        ("10S10M", (10, 20)),  # 10 soft-clipped bases followed by 10 matches
        ("10H10M", (0, 10)),  # 10 hard-clipped bases (not in query) followed by 10 matches
        ("10H10S10M", (10, 20)),  # 10 hard-clipped, 10 soft-clipped, 10 matches
        ("10H10S10M5S", (10, 20)),  # 10 hard-clipped, 10 soft-clipped, 10 matches, 5 soft-clipped
        ("10H10S10M5S10H", (10, 20)),  # Same as above with 10 hard-clipped at end
        ("76I", (0, 76)),  # 76 insertions, all counted in query
    ],
)

58-76: LGTM! Consider separating the tests for original and reversed CIGARs.

The implementation of test_query_alignment_offsets_failures correctly tests for ValueError in cases where no bases are aligned. To improve clarity and make the tests more focused, consider separating the tests for original and reversed CIGAR strings into two different test functions.

Here's a suggested refactor:

@pytest.mark.parametrize(
    "cigar_string",
    [
        "10H",
        "10S",
        "10S10H",
        "5H10S10H",
        "76D",
        "10P76S",
        "50S1000N50S",
    ],
)
def test_query_alignment_offsets_failures(cigar_string: str) -> None:
    cig = Cigar.from_cigarstring(cigar_string)
    with pytest.raises(ValueError):
        cig.query_alignment_offsets()

@pytest.mark.parametrize(
    "cigar_string",
    [
        "10H",
        "10S",
        "10S10H",
        "5H10S10H",
        "76D",
        "10P76S",
        "50S1000N50S",
    ],
)
def test_query_alignment_offsets_failures_reversed(cigar_string: str) -> None:
    cig = Cigar.from_cigarstring(cigar_string)
    with pytest.raises(ValueError):
        cig.reversed().query_alignment_offsets()

This separation makes it clearer that we're testing both the original and reversed CIGAR strings, and it allows for easier maintenance if the behavior for reversed CIGARs changes in the future.


79-98: LGTM! Consider adding comments and aligning with the non-reversed test.

The implementation of test_query_alignment_offsets_reversed is correct and covers various reversed CIGAR string scenarios. To improve consistency and readability, consider adding comments for each test case (as suggested for the non-reversed test) and aligning the test cases with those in test_query_alignment_offsets.

Here's an example of how you could improve the test:

@pytest.mark.parametrize(
    ("cigar_string", "expected_range"),
    [
        ("10M", (0, 10)),  # 10 matches, straightforward case
        ("10M10I", (0, 20)),  # 10 matches followed by 10 insertions
        ("10X10I", (0, 20)),  # 10 mismatches followed by 10 insertions
        ("10X10D", (0, 10)),  # 10 mismatches followed by 10 deletions (not counted in query)
        ("10=10D", (0, 10)),  # 10 matches followed by 10 deletions (not counted in query)
        ("10S10M", (0, 10)),  # 10 soft-clipped bases followed by 10 matches (reversed)
        ("10H10M", (0, 10)),  # 10 hard-clipped bases (not in query) followed by 10 matches
        ("10H10S10M", (0, 10)),  # 10 hard-clipped, 10 soft-clipped, 10 matches (reversed)
        ("10H10S10M5S", (5, 15)),  # 10 hard-clipped, 10 soft-clipped, 10 matches, 5 soft-clipped (reversed)
        ("10H10S10M5S10H", (5, 15)),  # Same as above with 10 hard-clipped at end
        ("76I", (0, 76)),  # 76 insertions, all counted in query
    ],
)
def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: tuple) -> None:
    cig = Cigar.from_cigarstring(cigar_string)
    ret = cig.reversed().query_alignment_offsets()
    assert ret == expected_range

This aligns the test cases with the non-reversed test, making it easier to compare and understand the differences in behavior between regular and reversed CIGARs.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c420c5e and 3449432.

📒 Files selected for processing (2)
  • fgpyo/sam/init.py (1 hunks)
  • tests/fgpyo/sam/test_cigar.py (1 hunks)
🧰 Additional context used
🔇 Additional comments (1)
fgpyo/sam/__init__.py (1)

550-588: LGTM! Well-implemented method with clear logic and good error handling.

The query_alignment_offsets method is well-implemented, with clear logic for calculating alignment offsets. The docstring is comprehensive, and the code is well-commented. Good job on raising a ValueError when no bases are aligned.

Consider simplifying the return statement:

-        ret = (start_offset, end_offset)
-        if start_offset == end_offset:
-            raise ValueError(f"Cigar {self.__str__()} has no aligned bases")
-        return ret
+        if start_offset == end_offset:
+            raise ValueError(f"Cigar {self} has no aligned bases")
+        return start_offset, end_offset

This change reduces a variable assignment and uses the implicit __str__ call.

To ensure robustness, consider adding a test case for a CIGAR string with only clipping operations (e.g., "10S20H"). Run the following script to check if such a test exists:

If no results are found, consider adding such a test case to verify the ValueError is raised correctly.

✅ Verification successful

Test Cases Verified

Found test cases for clipping-only CIGAR operations in tests/fgpyo/sam/test_cigar.py, ensuring proper handling of edge cases.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for test cases with only clipping operations in CIGAR
rg -i 'def test.*query_alignment_offsets' tests/fgpyo/sam/test_cigar.py -A 10 | rg -i '([0-9]+[SH])+"\)'

Length of output: 140

Comment on lines 585 to 588
ret = (start_offset, end_offset)
if start_offset == end_offset:
raise ValueError(f"Cigar {self.__str__()} has no aligned bases")
return ret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just?

Suggested change
ret = (start_offset, end_offset)
if start_offset == end_offset:
raise ValueError(f"Cigar {self.__str__()} has no aligned bases")
return ret
if not alignment_began:
raise ValueError(f"Cigar {self.__str__()} has no aligned bases")
return start_offset, end_offset

("76I", (0, 76)),
],
)
def test_query_alignment_offsets(cigar_string: str, expected_range: tuple) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the type hints here, as well. Please try to avoid generic collection type hints without a parametrized type.

Suggested change
def test_query_alignment_offsets(cigar_string: str, expected_range: tuple) -> None:
def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None:

("10H10S10M5S10H", (5, 15)),
],
)
def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: tuple) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: tuple) -> None:
def test_query_alignment_offsets_reversed(cigar_string: str, expected_range: Tuple[int, int]) -> None:

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
("76I", (0, 76)),
],
)
def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None:
def test_query_alignment_offsets(cigar_string: str, expected_range: Tuple[int, int]) -> None:
"""query_alignment_offsets() should return the expected start and stop positions."""

)
def test_query_alignment_offsets_reversed(
cigar_string: str, expected_range: Tuple[int, int]
) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> None:
) -> None:
"""query_alignment_offsets() should return the expected start and stop positions on a reversed CIGAR."""

("50S1000N50S"),
],
)
def test_query_alignment_offsets_failures(cigar_string: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_query_alignment_offsets_failures(cigar_string: str) -> None:
def test_query_alignment_offsets_failures(cigar_string: str) -> None:
"""query_alignment_offsets() should raise a ValueError if the CIGAR has no aligned positions."""

@yfarjoun yfarjoun merged commit 12c2d31 into main Oct 24, 2024
9 checks passed
@yfarjoun yfarjoun deleted the yf_add_cigar_utility_methods branch October 24, 2024 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants