Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Release v16.0.0 #1491

Merged
merged 43 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
6a64ddc
remove gatk3
khurrammaqbool May 7, 2024
d800054
fix: Update scheduler script (#1372)
ivadym May 13, 2024
718aad0
changelog
khurrammaqbool May 30, 2024
5eaa7c6
push container
khurrammaqbool May 30, 2024
b8f8af6
stop push container
khurrammaqbool May 30, 2024
19eb398
changelog
khurrammaqbool May 31, 2024
828cf3a
update readthedocs
khurrammaqbool Jun 3, 2024
993b7ca
Merge pull request #1432 from Clinical-Genomics/feat/remove_gatk3
khurrammaqbool Jun 3, 2024
bce33da
add msisensorpro container
khurrammaqbool Jun 5, 2024
8929af5
split commands
khurrammaqbool Jun 5, 2024
45e704e
add ca-certificates
khurrammaqbool Jun 5, 2024
9893456
add path to INSTALL
khurrammaqbool Jun 5, 2024
59c6a59
remove version
khurrammaqbool Jun 5, 2024
98a7520
add container test
khurrammaqbool Jun 5, 2024
28547d5
fix typo
khurrammaqbool Jun 5, 2024
c0f33a5
refactor
khurrammaqbool Jun 12, 2024
7d529e6
fix command
khurrammaqbool Jun 13, 2024
38770a9
Merge pull request #1444 from Clinical-Genomics/feat/add_msisensor_co…
khurrammaqbool Jun 14, 2024
45ff291
chore: Update vcf2cytosure container (#1456)
ivadym Jun 25, 2024
f173c27
feat: update multiqc to 1.22.3 (#1441)
mathiasbio Jun 26, 2024
15762ab
Merge master into develop
ivadym Jun 26, 2024
6f48313
feat: add msisensorpro TN (#1454)
khurrammaqbool Jun 26, 2024
8ab0e4d
fix: update MSI table (#1459)
khurrammaqbool Jun 28, 2024
bb03e67
fix: CNVkit container (#1457)
ivadym Jul 2, 2024
aa6fef7
feat: add Sentieon path argument to config (#1461)
mathiasbio Jul 3, 2024
117ff7e
feat: add msi tn to storage (#1483)
khurrammaqbool Oct 11, 2024
4937d2d
chore: add QC criteria for lymphoma_MRD (#1479)
mathiasbio Oct 16, 2024
f654750
feat: deduplicate with UMIs (#1358)
mathiasbio Oct 16, 2024
ed830b1
fix: msi container (#1486)
khurrammaqbool Oct 16, 2024
5e7bb88
fix: somalier container (#1487)
khurrammaqbool Oct 16, 2024
fedcb8f
Merge branch 'develop' of github.com:Clinical-Genomics/BALSAMIC into …
mathiasbio Oct 17, 2024
0743161
fix: broken cache doc links (#1488)
mathiasbio Oct 17, 2024
2007528
Merge branch 'develop' of github.com:Clinical-Genomics/BALSAMIC into …
mathiasbio Oct 17, 2024
d3dc977
fix: umi coverage qc (#1490)
mathiasbio Oct 17, 2024
ce94f8d
Merge branch 'develop' of github.com:Clinical-Genomics/BALSAMIC into …
mathiasbio Oct 17, 2024
061c4bb
chore: update doc tool versions (#1489)
mathiasbio Oct 17, 2024
6e63310
Merge branch 'develop' of github.com:Clinical-Genomics/BALSAMIC into …
mathiasbio Oct 17, 2024
6b2e222
v16.0.0 changelog
mathiasbio Oct 17, 2024
d4b1de3
add multiqc to release
mathiasbio Oct 18, 2024
af52a45
fix manta path in container
mathiasbio Oct 19, 2024
03ddaf0
fix: vardict memory error (#1492)
khurrammaqbool Oct 21, 2024
7fc67a6
fix: Increase vardict tumor only cores allocation to 18 (#1495)
khurrammaqbool Oct 23, 2024
5699764
fix: tnscope found in (#1497)
mathiasbio Oct 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/black_linter.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ jobs:
- uses: psf/black@stable
with:
options: "--check --verbose"
version: "22.3.0"
version: "23.7.0"
2 changes: 1 addition & 1 deletion .github/workflows/docker_build_publish_develop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
strategy:
fail-fast: true
matrix:
container-name: [align_qc, annotate, ascatNgs, cadd, cnvkit, cnvpytor, coverage_qc, delly, gatk, htslib, purecn, somalier, varcall_py3, varcall_py27, vcf2cytosure]
container-name: [align_qc, annotate, ascatNgs, cadd, cnvkit, cnvpytor, coverage_qc, delly, gatk, htslib, multiqc, msisensorpro, purecn, somalier, varcall_py3, varcall_py27, vcf2cytosure]
steps:
- name: Git checkout
id: git_checkout
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/docker_build_publish_release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
strategy:
fail-fast: true
matrix:
container-name: [align_qc, annotate, ascatNgs, cadd, cnvkit, cnvpytor, coverage_qc, delly, gatk, htslib, purecn, somalier, varcall_py3, varcall_py27, vcf2cytosure]
container-name: [align_qc, annotate, ascatNgs, cadd, cnvkit, cnvpytor, coverage_qc, delly, gatk, htslib, msisensorpro, multiqc, purecn, somalier, varcall_py3, varcall_py27, vcf2cytosure]
steps:
- name: Git checkout
id: git_checkout
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/urls_check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ jobs:
id: git_checkout
uses: actions/checkout@v3
- name: Link Checker
uses: lycheeverse/lychee-action@v1.8.0
uses: lycheeverse/lychee-action@v2.0.2
with:
args: --verbose './BALSAMIC/constants/cache.py' './docs/*.rst'
args: --max-redirects 10 --verbose './BALSAMIC/constants/cache.py' './docs/*.rst'
fail: true
37 changes: 37 additions & 0 deletions BALSAMIC/assets/scripts/cap_base_quality_in_bam.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import click
import pysam
import numpy as np


@click.command()
@click.argument("input_bam", type=click.Path(exists=True))
@click.argument("output_bam", type=click.Path())
@click.option(
"--max-quality",
default=70,
type=int,
help="Maximum quality value to cap to.",
)
def cap_base_qualities(input_bam: str, output_bam: str, max_quality: int):
"""
Cap the base qualities in a BAM file.

Args:
input_bam (str): Input BAM file path.
output_bam (str): Output BAM file path.
max_quality (int): Maximum quality value to cap to.
"""
# Open input BAM file for reading
samfile = pysam.AlignmentFile(input_bam, "rb")
out_bam = pysam.AlignmentFile(output_bam, "wb", header=samfile.header)
for read in samfile.fetch():
qualities = np.array(read.query_qualities)
capped_qualities = np.minimum(qualities, max_quality)
# Update the base qualities in the read
read.query_qualities = capped_qualities.tolist()
# Write the modified read to the output BAM file
out_bam.write(read)


if __name__ == "__main__":
cap_base_qualities()
39 changes: 29 additions & 10 deletions BALSAMIC/assets/scripts/collect_qc_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,14 @@ def get_qc_supported_capture_kit(capture_kit, metrics: List[str]) -> str:
if k != "default":
available_panel_beds.append(k)

return next((i for i in available_panel_beds if i in capture_kit), None)
return next(
(
i
for i in available_panel_beds
if re.search(rf"{re.escape(i)}(?=_\d)", capture_kit)
),
None,
)


def get_requested_metrics(config: dict, metrics: dict) -> dict:
Expand Down Expand Up @@ -186,41 +193,53 @@ def get_metric_condition(
return req_metrics


def get_sample_id(multiqc_key: str) -> str:
"""Returns extracted sample ID from MultiQC data JSON key.

Example of possible sample-formats below from "report_saved_raw_data":
tumor.ACCXXXXXX
tumor.ACCXXXXXX_FR
ACCXXXXXX_align_sort_HMYLNDSXX_ACCXXXXXX_S165_L001

Returns
str: The extracted sample ID with the ACCXXXXXX format.
"""
if "_align_sort_" in multiqc_key:
return multiqc_key.split("_")[0]
return multiqc_key.split(".")[1].split("_")[0]


def get_multiqc_metrics(config: dict, multiqc_data: dict) -> list:
"""Extracts and returns the requested metrics from a multiqc JSON file"""

requested_metrics = get_requested_metrics(config, METRICS)

def extract(data, output_metrics, sample=None, source=None):
def extract(data, output_metrics, multiqc_key=None, source=None):
"""Recursively fetch metrics data from a nested multiqc JSON"""

if isinstance(data, dict):
for k in data:
# Ignore UMI and reverse reads metrics
if "umi" not in k:
if k in requested_metrics:
# example of possible sample-formats below from "report_saved_raw_data":
# tumor.ACCXXXXXX
# tumor.ACCXXXXXX_FR
# extracted below for id to: ACCXXXXXX
output_metrics.append(
Metric(
id=sample.split(".")[1].split("_")[0],
id=get_sample_id(multiqc_key),
input=get_multiqc_data_source(
multiqc_data, sample, source
multiqc_data, multiqc_key, source
),
name=k,
step=source,
value=data[k],
condition=get_metric_condition(
config,
requested_metrics,
sample.split(".")[1].split("_")[0],
get_sample_id(multiqc_key),
k,
),
).model_dump()
)
extract(data[k], output_metrics, k, sample)
extract(data[k], output_metrics, k, multiqc_key)

return output_metrics

Expand Down
44 changes: 44 additions & 0 deletions BALSAMIC/assets/scripts/extend_bedfile.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import click


@click.command()
@click.argument("input_bedfile", type=click.Path(exists=True))
@click.argument("output_bedfile", type=click.Path())
@click.option(
"--extend-to-min-region-size",
default=100,
help="Will extend regions shorter than the specified size to this minimum size.",
)
def extend_bedfile(
input_bedfile: str, output_bedfile: str, extend_to_min_region_size: int
):
"""
Process a BED file to ensure regions are at least a minimum size.

Args:
input_bedfile (str): Input BED file path.
output_bedfile (str): Output BED file path.
min_region_size (int): Minimum region size to enforce.
"""
with open(input_bedfile, "r") as infile, open(output_bedfile, "w") as outfile:
for line in infile:
fields = line.strip().split("\t")

chrom: str = fields[0]
start = int(fields[1])
end = int(fields[2])

region_length: int = end - start
if region_length < extend_to_min_region_size:
center = (start + end) // 2
half_size = extend_to_min_region_size // 2
start = max(0, center - half_size)
end = center + half_size
if extend_to_min_region_size % 2 != 0:
end += 1

outfile.write(f"{chrom}\t{start}\t{end}\n")


if __name__ == "__main__":
extend_bedfile()
78 changes: 78 additions & 0 deletions BALSAMIC/assets/scripts/immediate_submit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
"""Script to submit jobs to a cluster."""
import shutil
from typing import Any, Dict, List, Optional

import click
from snakemake import utils

from BALSAMIC.commands.options import (
OPTION_BENCHMARK,
OPTION_CLUSTER_ACCOUNT,
OPTION_CLUSTER_MAIL,
OPTION_CLUSTER_MAIL_TYPE,
OPTION_CLUSTER_PROFILE,
OPTION_CLUSTER_QOS,
)
from BALSAMIC.constants.cluster import QOS, ClusterProfile
from BALSAMIC.models.scheduler import Scheduler


@click.command()
@click.argument("case_id", nargs=1, required=True, type=click.STRING)
@click.argument("dependencies", nargs=-1, type=click.STRING)
@click.argument("job_script", nargs=1, type=click.Path(exists=True, resolve_path=True))
@OPTION_CLUSTER_ACCOUNT
@OPTION_BENCHMARK
@OPTION_CLUSTER_MAIL_TYPE
@OPTION_CLUSTER_MAIL
@OPTION_CLUSTER_PROFILE
@OPTION_CLUSTER_QOS
@click.option(
"--log-dir",
type=click.Path(exists=True, resolve_path=True),
required=True,
help="Logging directory path",
)
@click.option(
"--script-dir",
type=click.Path(exists=True, resolve_path=True),
required=True,
help="Script directory path",
)
def immediate_submit(
account: str,
case_id: str,
job_script: str,
log_dir: str,
profile: ClusterProfile,
script_dir: str,
benchmark: Optional[bool] = False,
dependencies: Optional[List[str]] = None,
mail_type: Optional[str] = None,
mail_user: Optional[str] = None,
qos: Optional[QOS] = QOS.LOW,
) -> None:
"""
Submits jobs to the cluster. Each job is submitted sequentially, and their respective job IDs are collected
from the output. These job IDs are then forwarded as dependencies to the subsequent jobs.
"""
job_script: str = shutil.copy2(src=job_script, dst=script_dir)
job_properties: Dict[str, Any] = utils.read_job_properties(job_script)
scheduler: Scheduler = Scheduler(
account=account,
benchmark=benchmark,
case_id=case_id,
dependencies=dependencies,
job_properties=job_properties,
job_script=job_script,
log_dir=log_dir,
mail_type=mail_type,
mail_user=mail_user,
profile=profile,
qos=qos,
)
scheduler.submit_job()


if __name__ == "__main__":
immediate_submit()
109 changes: 109 additions & 0 deletions BALSAMIC/assets/scripts/modify_tnscope_infofield.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#!/usr/bin/env python
import vcfpy
import click
import sys
import logging
from typing import List, Optional

LOG = logging.getLogger(__name__)


def summarize_ad_to_dp(ad_list):
"""
Summarizes the AD (allelic depth) field into total DP (read depth).

Parameters:
ad_list (list): List of read depths supporting each allele, [ref_depth, alt1_depth, alt2_depth, ...]

Returns:
int: Total read depth (DP) across all alleles.
"""
if ad_list is None:
return 0 # Return 0 if AD field is not present
return sum(ad_list)


@click.command()
@click.argument("input_vcf", type=click.Path(exists=True))
@click.argument("output_vcf", type=click.Path())
def process_vcf(input_vcf: str, output_vcf: str):
"""
Processes the input VCF file and writes the updated information to the output VCF file.

INPUT_VCF: Path to the input VCF file.
OUTPUT_VCF: Path to the output VCF file.
"""

# Open the input VCF file
reader: vcfpy.Reader = vcfpy.Reader.from_path(input_vcf)

# Ensure the sample name is 'TUMOR'
sample_name: str = reader.header.samples.names[0]
if sample_name != "TUMOR":
LOG.warning(
f"Error: The first sample is named '{sample_name}', but 'TUMOR' is expected."
)
sys.exit(1)

# Add AF and DP fields to the header if not already present
if "AF" not in reader.header.info_ids():
reader.header.add_info_line(
vcfpy.OrderedDict(
[
("ID", "AF"),
("Number", "A"),
("Type", "Float"),
("Description", "Allele Frequency"),
]
)
)

if "DP" not in reader.header.info_ids():
reader.header.add_info_line(
vcfpy.OrderedDict(
[
("ID", "DP"),
("Number", "1"),
("Type", "Integer"),
("Description", "Total Depth"),
]
)
)

# Open the output VCF file for writing
with vcfpy.Writer.from_path(output_vcf, reader.header) as writer:
# Loop through each record (variant)
for record in reader:
# Get the TUMOR sample data
sample_index: int = reader.header.samples.names.index(sample_name)
tumor_call: vcfpy.Call = record.calls[sample_index]

# Check and process AD field
tumor_ad: Optional[List[int]] = tumor_call.data.get(
"AD", None
) # AD is a list [ref_count, alt_count]
if tumor_ad is None:
LOG.warning(
f"Warning: AD field is missing for record at position {record.POS} on {record.CHROM}"
)
else:
record.INFO["DP"] = summarize_ad_to_dp(tumor_ad)

# Check and process AF field
tumor_af: Optional[float] = tumor_call.data.get("AF", None)
if tumor_af is None:
LOG.warning(
f"Warning: AF field is missing for record at position {record.POS} on {record.CHROM}"
)
record.INFO["AF"] = [0.0] # Default AF to 0.0 if missing
else:
record.INFO["AF"] = [tumor_af] # Wrap AF in a list

# Write the updated record to the output VCF file
writer.write_record(record)

click.echo(f"VCF file processed and saved to {output_vcf}")


if __name__ == "__main__":
process_vcf()
Loading
Loading