Merge pull request #47 from eastgenomics/development

1.2 - development -> main
eastgenomics · Jul 7, 2021 · 755255b · 755255b
2 parents c0fcfea + d283759
commit 755255b
Showing 10 changed files with 1,708 additions and 21,667 deletions.
diff --git a/README.md b/README.md
@@ -13,7 +13,7 @@ Athena is a tool to generate coverage statistics for NGS data, and combine these
 
 Dependencies may be installed from the requirements.txt file using ```pip install -r requirements.txt```.
 This should contains all the required python packages required to generate coverage statistics and reports.
-For optional calculating of variant coverage from VCFs, [BEDtools][bedtools-url] is also required to be installed.
+In addition, [BEDtools][bedtools-url] is also required to be installed and on path.
 
 Tested on Ubuntu 18.04.4 and macOS 10.15.4
 
@@ -22,35 +22,52 @@ Tested on Ubuntu 18.04.4 and macOS 10.15.4
 It is written to take in per base coverage data (as output from tools such as [mosdepth][mosdepth-url]) as input to calculate coverage for target regions defined in a bed file. <br></br>
 
 The general workflow for generating the statistics and report is as follows: <br>
-- Annotate each region of the bed file with the gene, exon and per base coverage data using `annotate_bed.sh`
+- Annotate each region of the bed file with the gene, exon and per base coverage data using `annotate_bed.py`
 - Generate per exon and per gene statistics using `coverage_stats_single.py`
 - Generate HTML coverage report with `coverage_report_single.py`
 
 For DNAnexus cloud platform users, an Athena [dx applet][dx-url] has also been built.
 
 
+### Expected file formats
+
+As a minimum, Athena requires 3 input files. These are a bed file for the gene panel, a file of transcript information and the output of your coverage tool (mosdepth, samtools etc.). These files MUST have the following columns:
+
+- panel bed file: `chromosome  start  end  transcript`
+- transcript file: `chromosome  start  end  gene  transcript  exon`
+- coverage file: `chromosome  start  end  coverage`
+
+n.b. the process for creating the transcript file may be found [here][transcript-file-url].
+
 ### Annotating BED file
-The BED file containing regions of interest is first required to be annotated with gene, exon and coverage information prior to analysis. This may be done using [BEDtools intersect][bedtools-intersect-url], with a file containing transcript to gene and exon information, and then the per base coverage data. <br>
+The BED file containing regions of interest is first required to be annotated with gene, exon and coverage information prior to analysis. This may be done using [BEDtools intersect][bedtools-intersect-url], with a file containing transcript to gene and exon information, and then the per base coverage data. Currently, 100% overlap is required between coordinates in the panel bed file and the transcript annotation file, therefore you must ensure any added flank regions etc. are the same.<br>
 
-Included is a Bash script (`annotate_bed.sh`) to perform the required BED file annotation.
+Included is a Python script (`annotate_bed.py`) to perform the required BED file annotation.
 
 Expected inputs:
 
 ```
--i : Input panel bed file; must have ONLY the following 4 columns chromosome, start position, end position, gene/transcript.
--g : Exons nirvana file, contains required gene and exon information.
--b : Per base coverage file (output from mosdepth or similar).
+-p / --panel_bed : Input panel bed file; must have ONLY the following 4 columns chromosome, start position, end position, gene/transcript
+
+-t / --transcript_file : Transcript annotation file, contains required gene and exon information. Must have ONLY the following 6 columns:
+chromosome, start, end, gene, transcript, exon
+
+-c / --coverage_file : Per base coverage file (output from mosdepth or similar)
+
+-s / -chunk_size : (optional) nrows to split per-base coverage file for intersecting, with <16GB RAM can lead to bedtools intersect failing. Reccomended values: 16GB RAM -> 20000000; 8GB RAM -> 10000000
+
+-n / --output_name : (optional) Prefix for naming output file, if not given will use name from per base coverage file
 
 Example usage:
 
-$ annotate_bed.sh -i panel_bed_file.bed -g exons_nirvana -b {input_file}.per_base.bed
+$ annotate_bed.py -p panel_bed_file.bed -t transcript_file.tsv -c {input_file}.per_base.bed
 ```
 <br>
 This wraps the bedtools intersect commands below. These commands are given as an example, the output file column ordering must match that given in /data/example example_annotated_bed for calculating coverage statistics:
 <br>
 
 ```
-$ bedtools intersect -a beds/sorted_bed_file.bed -b beds/exons_nirvana2010_no_PAR_Y_noflank.bed -wa -wb | awk 'OFS="\t" {if ($4 == $9) print}' | cut -f 1,2,3,8,9,10 > sample1_genes_exons.bed
+$ bedtools intersect -a beds/sorted_bed_file.bed -b beds/exons_nirvana2010_no_PAR_Y_noflank.bed -wa -wb -f 1.0 -r | awk 'OFS="\t" {if ($4 == $9) print}' | cut -f 1,2,3,8,9,10 > sample1_genes_exons.bed
 
     - sorted_bed_file.bed -- bed file defining regions of interest (columns: chromosome, start, end, transcript)
     - exons_nirvana2010_no_PAR_Y.bed -- a bed file containing transcript -> exon and gene information
@@ -108,13 +125,6 @@ $ python3 bin/coverage_report_single.py --gene_stats output/sample1-exon-coverag
 ```
 
 
-### For development
-
-Features to be developed:
-- Generate run level statistics from multiple samples
-- Generate run level report from multiple samples
-- Add interactive elements to tables to increase useability (i.e sorting, filtering, searching)
-
 Any bugs or suggestions for improvements please raise an issue.
 
 
@@ -130,3 +140,4 @@ Any bugs or suggestions for improvements please raise an issue.
 [mosdepth-url]: https://github.com/brentp/mosdepth
 
 [dx-url]: https://github.com/eastgenomics/eggd_athena
+[transcript-file-url]: https://cuhbioinformatics.atlassian.net/wiki/spaces/P/pages/2241101840/Generating+transcripts+file+for+Athena
diff --git a/bin/annotate_bed.py b/bin/annotate_bed.py
@@ -0,0 +1,253 @@
+"""
+Script to annotate a panel bed file with transcript information and per
+base coverage data.
+
+Requires: bedtools
+
+Jethro Rainford
+20/06/2021
+"""
+
+import argparse
+import os
+import pandas as pd
+from pathlib import Path
+import pybedtools as bedtools
+
+from load_data import loadData
+
+
+class annotateBed():
+
+    def add_transcript_info(self, panel_bed, transcript_info_df):
+        """
+        Use pybedtools to annotate panel bed file with coverage data
+        Args:
+            - panel_bed (df): panel bed file regions df
+            - transcript_info_df (df): transcript info file df
+        Returns:
+            - bed_w_transcript (df): panel bed file with transcript information
+        """
+        print("calling bedtools to add transcript info")
+
+        # get total number of transcripts before to ensure none are dropped
+        panel_transcripts = panel_bed.transcript.unique().tolist()
+
+        # turn dfs into BedTools objects
+        bed = bedtools.BedTool.from_dataframe(panel_bed)
+        transcript_info = bedtools.BedTool.from_dataframe(transcript_info_df)
+
+        # intersecting panel bed file with transcript/gene/exon information
+        # requires 100% overlap on panel -> transcript coordinates
+        bed_w_transcript = bed.intersect(
+            transcript_info, wa=True, wb=True, F=1.0
+        )
+
+        # convert pybedtools object to df
+        bed_w_transcript = bed_w_transcript.to_dataframe(names=[
+            "p_chrom", "p_start", "p_end", "p_transcript",
+            "t_chrom", "t_start", "t_end", "t_gene", "t_transcript", "t_exon"
+        ])
+
+        # check for empty file
+        assert len(bed_w_transcript.index) > 0, """Empty file returned from
+            intersecting panel bed and transcript file. Check if flanks are
+            being used as 100% coordinate overlap is currently required."""
+
+        # panel bed file defines transcript to use, filter transcript file for
+        # just those transcripts
+        bed_w_transcript = bed_w_transcript[
+            bed_w_transcript["p_transcript"] == bed_w_transcript["t_transcript"]
+        ]
+
+        # drop duplicate columns
+        bed_w_transcript = bed_w_transcript.drop(columns=[
+            't_chrom', 't_start', 't_end', 'p_transcript'
+        ])
+
+        intersect_transcripts = bed_w_transcript.t_transcript.unique().tolist()
+
+        # ensure no transcripts dropped from panel due to missing from
+        # transcripts file
+        assert len(panel_transcripts) == len(intersect_transcripts), (
+            f"Transcript(s) dropped from panel during intersecting with "
+            f"transcript file. Total before {len(panel_transcripts)}. Total "
+            f"after {len(intersect_transcripts)}. Dropped transcripts: "
+            f"{set(panel_transcripts) - set(intersect_transcripts)}"
+        )
+
+        return bed_w_transcript
+
+
+    def add_coverage(self, bed_w_transcript, coverage_df, chunks=False):
+        """
+        Use pybedtools to add coverage bin data to selected panel regions
+        Args:
+            - bed_w_transcript (df): panel bed file with transcript information
+            - coverage_df (df / list): coverage bin data df / list of dfs if
+                chunks value passed
+        Returns:
+            - bed_w_coverage (df): panel bed with transcript and coverage info
+        """
+        print("calling bedtools to add coverage info")
+
+        # turn dfs into BedTools objects
+        bed_w_transcript = bedtools.BedTool.from_dataframe(bed_w_transcript)
+
+        col_names = [
+            "t_chrom", "t_start", "t_end", "t_gene", "t_transcript", "t_exon",
+            "c_chrom", "cov_start", "cov_end", "cov"
+        ]
+
+        if not chunks:
+            # per-base coverage all in one df
+            coverage_df = bedtools.BedTool.from_dataframe(coverage_df)
+
+            bed_w_coverage = bed_w_transcript.intersect(
+                coverage_df, wa=True, wb=True
+            )
+            bed_w_coverage = bed_w_coverage.to_dataframe(names=col_names)
+        else:
+            # coverage data in chunks, loop over each df and intersect
+            bed_w_coverage = pd.DataFrame(columns=col_names)
+
+            for num, df in enumerate(coverage_df):
+                print(f"intersecting {num + 1}/{len(coverage_df)} coverage chunks")
+                # read each to bedtools object, intersect and add back to df
+                chunk_df = bedtools.BedTool.from_dataframe(df)
+
+                bed_w_coverage_chunk = bed_w_transcript.intersect(
+                    chunk_df, wa=True, wb=True
+                )
+
+                bed_w_coverage_chunk = bed_w_coverage_chunk.to_dataframe(
+                    names=col_names
+                )
+
+                bed_w_coverage = pd.concat(
+                    [bed_w_coverage, bed_w_coverage_chunk],
+                    ignore_index=True
+                )
+
+        # check again for empty output of bedtools, can happen due to memory
+        # maxing out and doesn't seem to raise an exception...
+        assert len(bed_w_coverage) > 0, """Error intersecting with coverage
+            data, empty file generated. Is this the correct coverage data for
+            the panel used? bedtools may also have reached memory limit and
+            died, try re-running with --chunk_size 1000000"""
+
+        # drop duplicate chromosome col and rename
+        bed_w_coverage.drop(columns=["c_chrom"], inplace=True)
+
+        bed_w_coverage.columns = [
+            "chrom", "exon_start", "exon_end", "gene", "tx", "exon",
+            "cov_start", "cov_end", "cov"
+        ]
+
+        return bed_w_coverage
+
+
+def write_file(bed_w_coverage, outfile):
+    """
+    Write annotated bed to file
+    Args:
+        - bed_w_coverage (df): bed file with transcript and coverage info
+        - output_prefix (str): prefix for naming output file
+    Outputs: annotated_bed.tsv
+    """
+    # tiny function but want this separate for writing a wrapper script later
+    bed_w_coverage.to_csv(outfile, sep="\t", header=False, index=False)
+    print(f"annotated bed file written to {outfile}")
+
+
+def parse_args():
+    """
+    Parse cmd line arguments
+
+    Args: None
+
+    Returns:
+        - args (arguments): args passed from cmd line
+    """
+    parser = argparse.ArgumentParser(
+        description='Annotate panel bed file with transcript & coverage data.'
+    )
+    parser.add_argument(
+        '--panel_bed', '-p',
+        help='panel bed file'
+    )
+    parser.add_argument(
+        '--transcript_file', '-t',
+        help='file with gene and exon information'
+    )
+    parser.add_argument(
+        '--coverage_file', '-c',
+        help='per base coverage data file'
+    )
+    parser.add_argument(
+        '--chunk_size', '-s', type=int,
+        help='number lines to read per-base coverage file in one go'
+    )
+    parser.add_argument(
+        '--output_name', '-n',
+        help='name preifx for output file, if none will use coverage file'
+    )
+
+    args = parser.parse_args()
+
+    return args
+
+
+def main():
+    annotate = annotateBed()
+    load = loadData()  # class of functions for reading in data
+
+    args = parse_args()
+
+    if not args.output_name:
+        # output name not defined, use sample identifier from coverage file
+        args.output_name = Path(args.coverage_file).name.split('_')[0]
+
+    # set dir for writing to
+    bin_dir = os.path.dirname(os.path.abspath(__file__))
+    out_dir = os.path.join(bin_dir, "../output/")
+    outfile_name = f"{args.output_name}_annotated.bed"
+    outfile = os.path.join(out_dir, outfile_name)
+
+    # read in files
+    panel_bed_df = load.read_panel_bed(args.panel_bed)
+    transcript_info_df = load.read_transcript_info(args.transcript_file)
+    pb_coverage_df = load.read_coverage_data(
+        args.coverage_file, args.chunk_size
+    )
+
+    # add transcript info
+    bed_w_transcript = annotate.add_transcript_info(
+        panel_bed_df, transcript_info_df
+    )
+
+    # add coverage
+    if args.chunk_size:
+        # per-base coverage split to multiple dfs to limit memory usage
+        bed_w_coverage = annotate.add_coverage(
+            bed_w_transcript, pb_coverage_df, chunks=True
+        )
+    else:
+        bed_w_coverage = annotate.add_coverage(
+            bed_w_transcript, pb_coverage_df, chunks=False
+        )
+
+    # sense check generated file isn't empty, should be caught earlier
+    assert len(bed_w_coverage.index) > 0, (
+        'An error has occured: annotated bed file is empty. This is likely ',
+        'due to an error in regions defined in bed file (i.e. different ',
+        'transcripts to those in the transcripts file). Start debugging by ',
+        'intersecting files manually...'
+    )
+
+    write_file(bed_w_coverage, outfile)
+
+
+if __name__ == "__main__":
+
+    main()
diff --git a/bin/annotate_bed.sh b/bin/annotate_bed.sh
diff --git a/bin/coverage_report_single.py b/bin/coverage_report_single.py
diff --git a/bin/coverage_stats_single.py b/bin/coverage_stats_single.py
@@ -7,6 +7,7 @@
 """
 
 import argparse
+from functools import partial
 import os
 import re
 import sys
@@ -165,17 +166,15 @@ def cov_stats(self, data, thresholds):
                 )
                 coords = list(np.unique(coords))
 
-                if len(coords) > 1:
-                    # more than one region for exon in bed, exit as will
-                    # be incorrectly calculated
-                    print(
-                        "More than one region is present in the bed file for "
-                        "exon {} of {}: {}.\n".format(exon, gene, coords),
-                        "Currently each exon MUST be in one pair of start / "
-                        "end coordinates else coverage values will be "
-                        "incorrect for those regions. Exiting now."
-                    )
-                    sys.exit()
+                # if more than one region for exon in bed, exit as will
+                # be incorrectly calculated
+                assert len(coords) == 1, (
+                    "More than one region is present in the bed file for "
+                    "exon {} of {}: {}.\n".format(exon, gene, coords),
+                    "Currently each exon MUST be in one pair of start / "
+                    "end coordinates else coverage values will be "
+                    "incorrect for those regions. Exiting now."
+                )
 
                 start = exon_cov.iloc[0]
                 end = exon_cov.iloc[-1]
@@ -280,11 +279,9 @@ def summary_stats(self, cov_stats, thresholds):
 
             # calculate gene coverage values
             min = gene_cov["min"].min()
-            mean = sum(
-                [x * y for x, y in zip(
-                    gene_cov["mean"], gene_cov["exon_frac"]
-                )]
-            )
+            mean = round(sum(
+                [x * y for x, y in zip(gene_cov["mean"], gene_cov["exon_frac"])]
+            ), 12)
             max = gene_cov["max"].max()
 
             # average coverage % at given thresholds, round to 12 dp to
@@ -468,11 +465,22 @@ def main():
         # slices, add each to pool with threshold values and
         # concatenates together when finished
         print("Generating per base exon stats")
+        try:
+            # map threshold arg to function since same is passed to all, then
+            # use imap to pass each df to worker to generate stats
+            stats_w_arg = partial(single.cov_stats, thresholds=thresholds)
+            pool_output = pool.imap_unordered(stats_w_arg, split_dfs)
+        except AssertionError:
+            # more than one region for each exon found => exit
+            pool.close()
+            pool.terminate()
+        else:
+            cov_stats = pd.concat(pool_output, ignore_index=True)
 
-        cov_stats = pd.concat(
-            pool.starmap(
-                single.cov_stats, map(lambda e: (e, thresholds), split_dfs)
-            ), ignore_index=True)
+            # imap_unordered() returns everything out of order (funnily enough)
+            # sort by gene and exon to be nicely formatted
+            cov_stats.exon = cov_stats.exon.astype(int)
+            cov_stats = cov_stats.sort_values(['gene', 'exon'])
 
     # split up output coverage stats df for multiprocessing
     split_stats_dfs = np.asanyarray(
@@ -489,6 +497,9 @@ def main():
                     lambda e: (e, thresholds), split_stats_dfs
                 )), ignore_index=True)
 
+    cov_summary = cov_summary.sort_values(['gene'])
+    cov_summary = cov_summary.drop(columns=["exon"])
+
     # write tables to output files
     single.write_outfiles(
         cov_stats, cov_summary, args.outfile, flagstat, build

diff --git a/bin/load_data.py b/bin/load_data.py
@@ -1,8 +1,9 @@
-from pathlib import Path
 import os
+import pandas as pd
+from pathlib import Path
 import sys
 
-import pandas as pd
+from version import VERSION
 
 
 class loadData():
@@ -11,8 +12,11 @@ def __init__(self):
             "chrom": str,
             "exon_start": 'Int64',
             "exon_end": 'Int64',
+            "start": 'Int64',
+            "end": 'Int64',
             "gene": str,
             "tx": str,
+            "transcript": str,
             "exon": 'Int64',
             "exon_len": 'Int64',
             "min": 'Int64',
@@ -36,6 +40,96 @@ def filter_dtypes(self, df):
         return {k: v for k, v in self.dtypes.items() if k in df.columns}
 
 
+    def read_panel_bed(self, bed_file):
+        """
+        Read in panel bed file to dataframe
+        Args: bed_file (file): file handler of bed file
+        Returns: panel_bed_df (df): df of panel bed file
+        """
+        print("reading panel bed file")
+        panel_bed = pd.read_csv(
+            bed_file, sep="\t", dtype=self.dtypes, names=[
+                "chrom", "start", "end", "transcript"
+            ]
+        )
+
+        # strip chr from chrom if present
+        panel_bed["chrom"] = panel_bed["chrom"].apply(
+            lambda x: str(x).replace("chr", "")
+        )
+
+        return panel_bed
+
+
+    def read_transcript_info(self, transcript_file):
+        """
+        Read in file that contains transcript -> gene symbol, exon no.
+        Args: transcript_file (file): file handler
+        Returns: transcript_info_df (df): df of transcript info
+        """
+        print("reading transcript information file")
+        transcript_info_df = pd.read_csv(
+            transcript_file, sep="\t", dtype=self.dtypes, names=[
+                "chrom", "start", "end", "gene", "transcript", "exon"
+            ]
+        )
+
+        # strip chr from chrom if present
+        transcript_info_df["chrom"] = transcript_info_df["chrom"].apply(
+            lambda x: str(x).replace("chr", "")
+        )
+
+        return transcript_info_df
+
+
+    def read_coverage_data(self, coverage_file, chunk_size=None):
+        """
+        Read in per-base coverage file (i.e. output of mosdepth)
+        Args:
+            - coverage_file (file): file handler
+            - chunk_size (int): this can be 50 million + line file, use chunks
+                to read in df and return an iterable to stop bedtools breaking
+        Returns: pb_coverage_df (df / list): df of coverage data / list of dfs
+                if chunk_size value passed
+        """
+        print("reading coverage file, this might take a while...")
+
+        if chunk_size:
+            # build list of dataframes split by given chunk size
+            pb_coverage_df = []
+
+            for df in pd.read_csv(
+                coverage_file, sep="\t", compression="infer",
+                dtype=self.dtypes,
+                chunksize=chunk_size, names=["chrom", "start", "end", "cov"]
+            ):
+                # strip chr prefix if present
+                df["chrom"] = df["chrom"].apply(
+                    lambda x: str(x).replace("chr", "")
+                )
+                # add to list of chunk dfs
+                pb_coverage_df.append(df)
+
+            print(
+                f"per-base coverage data read in to {len(pb_coverage_df)} "
+                f"of {chunk_size} rows"
+            )
+        else:
+            # read file into one df
+            pb_coverage_df = pd.read_csv(
+                coverage_file, sep="\t", compression="infer",
+                dtype=self.dtypes,
+                names=["chrom", "start", "end", "cov"]
+            )
+
+            # strip chr from chrom if present
+            pb_coverage_df["chrom"] = pb_coverage_df["chrom"].apply(
+                lambda x: str(x).replace("chr", "")
+            )
+
+        return pb_coverage_df
+
+
     def read_exon_stats(self, exon_stats):
         """
         Read exon stats file from coverage_single_stats into df
@@ -269,8 +363,7 @@ def get_snp_vcfs(snp_vcfs):
         if snp_vcfs:
             # get names of SNP vcfs used to display in report
             vcfs = ", ".join([Path(x).stem for x in snp_vcfs])
-            vcfs = "<br>VCF(s) of known variants included in report: <b>{}</b>\
-                </br>".format(vcfs)
+            vcfs = f"VCF(s) of known variants included in report: <b>{vcfs}</b>"
         else:
             vcfs = ""
 
@@ -280,25 +373,13 @@ def get_snp_vcfs(snp_vcfs):
     @staticmethod
     def get_athena_ver():
         """
-        Attempt to get version of Athena from dir name to display in
-        report footer, will only work for zip/tar
+        Return version of Athena to display in report footer
 
         Args: None
         Returns:
             - version (str): version of Athena
         """
-        bin_dir = os.path.dirname(os.path.abspath(__file__))
-
-        try:
-            path = str(os.path.join(bin_dir, "../")).split("/")
-            version = [s for s in path if "athena" in s][0].split("-")[1]
-            version = "({})".format(version)
-        except Exception:
-            print("Error getting version from dir name, continuing.")
-            version = ""
-            pass
-
-        return version
+        return VERSION
 
 
     @staticmethod

diff --git a/bin/version.py b/bin/version.py
@@ -0,0 +1 @@
+VERSION = '1.2.0'
diff --git a/data/example/Example_coverage_report.html b/data/example/Example_coverage_report.html
diff --git a/data/static/js/jquery-3.5.1.min.js b/data/static/js/jquery-3.5.1.min.js
diff --git a/data/templates/single_template.html b/data/templates/single_template.html