diff --git a/README.md b/README.md index 36311834..f8201a5a 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ Athena is a tool to generate coverage statistics for NGS data, and combine these Dependencies may be installed from the requirements.txt file using ```pip install -r requirements.txt```. This should contains all the required python packages required to generate coverage statistics and reports. -For optional calculating of variant coverage from VCFs, [BEDtools][bedtools-url] is also required to be installed. +In addition, [BEDtools][bedtools-url] is also required to be installed and on path. Tested on Ubuntu 18.04.4 and macOS 10.15.4 @@ -22,35 +22,52 @@ Tested on Ubuntu 18.04.4 and macOS 10.15.4 It is written to take in per base coverage data (as output from tools such as [mosdepth][mosdepth-url]) as input to calculate coverage for target regions defined in a bed file.

The general workflow for generating the statistics and report is as follows:
-- Annotate each region of the bed file with the gene, exon and per base coverage data using `annotate_bed.sh` +- Annotate each region of the bed file with the gene, exon and per base coverage data using `annotate_bed.py` - Generate per exon and per gene statistics using `coverage_stats_single.py` - Generate HTML coverage report with `coverage_report_single.py` For DNAnexus cloud platform users, an Athena [dx applet][dx-url] has also been built. +### Expected file formats + +As a minimum, Athena requires 3 input files. These are a bed file for the gene panel, a file of transcript information and the output of your coverage tool (mosdepth, samtools etc.). These files MUST have the following columns: + +- panel bed file: `chromosome start end transcript` +- transcript file: `chromosome start end gene transcript exon` +- coverage file: `chromosome start end coverage` + +n.b. the process for creating the transcript file may be found [here][transcript-file-url]. + ### Annotating BED file -The BED file containing regions of interest is first required to be annotated with gene, exon and coverage information prior to analysis. This may be done using [BEDtools intersect][bedtools-intersect-url], with a file containing transcript to gene and exon information, and then the per base coverage data.
+The BED file containing regions of interest is first required to be annotated with gene, exon and coverage information prior to analysis. This may be done using [BEDtools intersect][bedtools-intersect-url], with a file containing transcript to gene and exon information, and then the per base coverage data. Currently, 100% overlap is required between coordinates in the panel bed file and the transcript annotation file, therefore you must ensure any added flank regions etc. are the same.
-Included is a Bash script (`annotate_bed.sh`) to perform the required BED file annotation. +Included is a Python script (`annotate_bed.py`) to perform the required BED file annotation. Expected inputs: ``` --i : Input panel bed file; must have ONLY the following 4 columns chromosome, start position, end position, gene/transcript. --g : Exons nirvana file, contains required gene and exon information. --b : Per base coverage file (output from mosdepth or similar). +-p / --panel_bed : Input panel bed file; must have ONLY the following 4 columns chromosome, start position, end position, gene/transcript + +-t / --transcript_file : Transcript annotation file, contains required gene and exon information. Must have ONLY the following 6 columns: +chromosome, start, end, gene, transcript, exon + +-c / --coverage_file : Per base coverage file (output from mosdepth or similar) + +-s / -chunk_size : (optional) nrows to split per-base coverage file for intersecting, with <16GB RAM can lead to bedtools intersect failing. Reccomended values: 16GB RAM -> 20000000; 8GB RAM -> 10000000 + +-n / --output_name : (optional) Prefix for naming output file, if not given will use name from per base coverage file Example usage: -$ annotate_bed.sh -i panel_bed_file.bed -g exons_nirvana -b {input_file}.per_base.bed +$ annotate_bed.py -p panel_bed_file.bed -t transcript_file.tsv -c {input_file}.per_base.bed ```
This wraps the bedtools intersect commands below. These commands are given as an example, the output file column ordering must match that given in /data/example example_annotated_bed for calculating coverage statistics:
``` -$ bedtools intersect -a beds/sorted_bed_file.bed -b beds/exons_nirvana2010_no_PAR_Y_noflank.bed -wa -wb | awk 'OFS="\t" {if ($4 == $9) print}' | cut -f 1,2,3,8,9,10 > sample1_genes_exons.bed +$ bedtools intersect -a beds/sorted_bed_file.bed -b beds/exons_nirvana2010_no_PAR_Y_noflank.bed -wa -wb -f 1.0 -r | awk 'OFS="\t" {if ($4 == $9) print}' | cut -f 1,2,3,8,9,10 > sample1_genes_exons.bed - sorted_bed_file.bed -- bed file defining regions of interest (columns: chromosome, start, end, transcript) - exons_nirvana2010_no_PAR_Y.bed -- a bed file containing transcript -> exon and gene information @@ -108,13 +125,6 @@ $ python3 bin/coverage_report_single.py --gene_stats output/sample1-exon-coverag ``` -### For development - -Features to be developed: -- Generate run level statistics from multiple samples -- Generate run level report from multiple samples -- Add interactive elements to tables to increase useability (i.e sorting, filtering, searching) - Any bugs or suggestions for improvements please raise an issue. @@ -130,3 +140,4 @@ Any bugs or suggestions for improvements please raise an issue. [mosdepth-url]: https://github.com/brentp/mosdepth [dx-url]: https://github.com/eastgenomics/eggd_athena +[transcript-file-url]: https://cuhbioinformatics.atlassian.net/wiki/spaces/P/pages/2241101840/Generating+transcripts+file+for+Athena diff --git a/bin/annotate_bed.py b/bin/annotate_bed.py new file mode 100644 index 00000000..39f5af3a --- /dev/null +++ b/bin/annotate_bed.py @@ -0,0 +1,253 @@ +""" +Script to annotate a panel bed file with transcript information and per +base coverage data. + +Requires: bedtools + +Jethro Rainford +20/06/2021 +""" + +import argparse +import os +import pandas as pd +from pathlib import Path +import pybedtools as bedtools + +from load_data import loadData + + +class annotateBed(): + + def add_transcript_info(self, panel_bed, transcript_info_df): + """ + Use pybedtools to annotate panel bed file with coverage data + Args: + - panel_bed (df): panel bed file regions df + - transcript_info_df (df): transcript info file df + Returns: + - bed_w_transcript (df): panel bed file with transcript information + """ + print("calling bedtools to add transcript info") + + # get total number of transcripts before to ensure none are dropped + panel_transcripts = panel_bed.transcript.unique().tolist() + + # turn dfs into BedTools objects + bed = bedtools.BedTool.from_dataframe(panel_bed) + transcript_info = bedtools.BedTool.from_dataframe(transcript_info_df) + + # intersecting panel bed file with transcript/gene/exon information + # requires 100% overlap on panel -> transcript coordinates + bed_w_transcript = bed.intersect( + transcript_info, wa=True, wb=True, F=1.0 + ) + + # convert pybedtools object to df + bed_w_transcript = bed_w_transcript.to_dataframe(names=[ + "p_chrom", "p_start", "p_end", "p_transcript", + "t_chrom", "t_start", "t_end", "t_gene", "t_transcript", "t_exon" + ]) + + # check for empty file + assert len(bed_w_transcript.index) > 0, """Empty file returned from + intersecting panel bed and transcript file. Check if flanks are + being used as 100% coordinate overlap is currently required.""" + + # panel bed file defines transcript to use, filter transcript file for + # just those transcripts + bed_w_transcript = bed_w_transcript[ + bed_w_transcript["p_transcript"] == bed_w_transcript["t_transcript"] + ] + + # drop duplicate columns + bed_w_transcript = bed_w_transcript.drop(columns=[ + 't_chrom', 't_start', 't_end', 'p_transcript' + ]) + + intersect_transcripts = bed_w_transcript.t_transcript.unique().tolist() + + # ensure no transcripts dropped from panel due to missing from + # transcripts file + assert len(panel_transcripts) == len(intersect_transcripts), ( + f"Transcript(s) dropped from panel during intersecting with " + f"transcript file. Total before {len(panel_transcripts)}. Total " + f"after {len(intersect_transcripts)}. Dropped transcripts: " + f"{set(panel_transcripts) - set(intersect_transcripts)}" + ) + + return bed_w_transcript + + + def add_coverage(self, bed_w_transcript, coverage_df, chunks=False): + """ + Use pybedtools to add coverage bin data to selected panel regions + Args: + - bed_w_transcript (df): panel bed file with transcript information + - coverage_df (df / list): coverage bin data df / list of dfs if + chunks value passed + Returns: + - bed_w_coverage (df): panel bed with transcript and coverage info + """ + print("calling bedtools to add coverage info") + + # turn dfs into BedTools objects + bed_w_transcript = bedtools.BedTool.from_dataframe(bed_w_transcript) + + col_names = [ + "t_chrom", "t_start", "t_end", "t_gene", "t_transcript", "t_exon", + "c_chrom", "cov_start", "cov_end", "cov" + ] + + if not chunks: + # per-base coverage all in one df + coverage_df = bedtools.BedTool.from_dataframe(coverage_df) + + bed_w_coverage = bed_w_transcript.intersect( + coverage_df, wa=True, wb=True + ) + bed_w_coverage = bed_w_coverage.to_dataframe(names=col_names) + else: + # coverage data in chunks, loop over each df and intersect + bed_w_coverage = pd.DataFrame(columns=col_names) + + for num, df in enumerate(coverage_df): + print(f"intersecting {num + 1}/{len(coverage_df)} coverage chunks") + # read each to bedtools object, intersect and add back to df + chunk_df = bedtools.BedTool.from_dataframe(df) + + bed_w_coverage_chunk = bed_w_transcript.intersect( + chunk_df, wa=True, wb=True + ) + + bed_w_coverage_chunk = bed_w_coverage_chunk.to_dataframe( + names=col_names + ) + + bed_w_coverage = pd.concat( + [bed_w_coverage, bed_w_coverage_chunk], + ignore_index=True + ) + + # check again for empty output of bedtools, can happen due to memory + # maxing out and doesn't seem to raise an exception... + assert len(bed_w_coverage) > 0, """Error intersecting with coverage + data, empty file generated. Is this the correct coverage data for + the panel used? bedtools may also have reached memory limit and + died, try re-running with --chunk_size 1000000""" + + # drop duplicate chromosome col and rename + bed_w_coverage.drop(columns=["c_chrom"], inplace=True) + + bed_w_coverage.columns = [ + "chrom", "exon_start", "exon_end", "gene", "tx", "exon", + "cov_start", "cov_end", "cov" + ] + + return bed_w_coverage + + +def write_file(bed_w_coverage, outfile): + """ + Write annotated bed to file + Args: + - bed_w_coverage (df): bed file with transcript and coverage info + - output_prefix (str): prefix for naming output file + Outputs: annotated_bed.tsv + """ + # tiny function but want this separate for writing a wrapper script later + bed_w_coverage.to_csv(outfile, sep="\t", header=False, index=False) + print(f"annotated bed file written to {outfile}") + + +def parse_args(): + """ + Parse cmd line arguments + + Args: None + + Returns: + - args (arguments): args passed from cmd line + """ + parser = argparse.ArgumentParser( + description='Annotate panel bed file with transcript & coverage data.' + ) + parser.add_argument( + '--panel_bed', '-p', + help='panel bed file' + ) + parser.add_argument( + '--transcript_file', '-t', + help='file with gene and exon information' + ) + parser.add_argument( + '--coverage_file', '-c', + help='per base coverage data file' + ) + parser.add_argument( + '--chunk_size', '-s', type=int, + help='number lines to read per-base coverage file in one go' + ) + parser.add_argument( + '--output_name', '-n', + help='name preifx for output file, if none will use coverage file' + ) + + args = parser.parse_args() + + return args + + +def main(): + annotate = annotateBed() + load = loadData() # class of functions for reading in data + + args = parse_args() + + if not args.output_name: + # output name not defined, use sample identifier from coverage file + args.output_name = Path(args.coverage_file).name.split('_')[0] + + # set dir for writing to + bin_dir = os.path.dirname(os.path.abspath(__file__)) + out_dir = os.path.join(bin_dir, "../output/") + outfile_name = f"{args.output_name}_annotated.bed" + outfile = os.path.join(out_dir, outfile_name) + + # read in files + panel_bed_df = load.read_panel_bed(args.panel_bed) + transcript_info_df = load.read_transcript_info(args.transcript_file) + pb_coverage_df = load.read_coverage_data( + args.coverage_file, args.chunk_size + ) + + # add transcript info + bed_w_transcript = annotate.add_transcript_info( + panel_bed_df, transcript_info_df + ) + + # add coverage + if args.chunk_size: + # per-base coverage split to multiple dfs to limit memory usage + bed_w_coverage = annotate.add_coverage( + bed_w_transcript, pb_coverage_df, chunks=True + ) + else: + bed_w_coverage = annotate.add_coverage( + bed_w_transcript, pb_coverage_df, chunks=False + ) + + # sense check generated file isn't empty, should be caught earlier + assert len(bed_w_coverage.index) > 0, ( + 'An error has occured: annotated bed file is empty. This is likely ', + 'due to an error in regions defined in bed file (i.e. different ', + 'transcripts to those in the transcripts file). Start debugging by ', + 'intersecting files manually...' + ) + + write_file(bed_w_coverage, outfile) + + +if __name__ == "__main__": + + main() diff --git a/bin/annotate_bed.sh b/bin/annotate_bed.sh deleted file mode 100755 index 08fdb160..00000000 --- a/bin/annotate_bed.sh +++ /dev/null @@ -1,71 +0,0 @@ -#!/bin/bash - -input_bed="" -gene_file="" -bp_coverage="" - -Help() -{ - # Display Help - echo " - This script may be used to perform bedtools intersect commands to generate the required annotated bed file - for generating single sample coverage statistics." - echo "" - echo "Usage:" - echo "" - echo "-i Input panel bed file; must have columns chromosome, start position, end position, transcript." - echo "-g Exons nirvana file, contains required gene and exon information." - echo "-b Per base coverage file (output from mosdepth or similar)." - echo "-h Print this Help." - echo "" -} - -# display help message on -h -while getopts ":i:g:b:h" option; do - case $option in - i) input_bed="$OPTARG" - ;; - g) gene_file="$OPTARG" - ;; - b) bp_coverage="$OPTARG" - ;; - h) # display Help - Help - exit 1 - ;; - \?) # incorrect option - echo "Error: Invalid option, please see usage below." - Help - exit - ;; - \*) # incorrect option - ;; - esac -done - -# check for missing args -if [ -z $input_bed ] || - [ -z $gene_file ] || - [ -z $bp_coverage ]; then - - echo "Error: Missing arguments, please see usage below." - Help - exit 0 -fi - -# check for empty and non existent input files -for file in $input_bed $gene_file $bp_coverage; do - [ ! -s $file ] && echo "$file does not exist or is empty. Exiting." && exit; -done - -# name outfile from mosdepth coverage file -outfile=$(basename $bp_coverage) -outfile=${outfile/.per-base.bed.gz/_annotated.bed} - -# add gene and exon annotation to panel bed file from exons nirvana tsv -bedtools intersect -f 1.0 -r -a $input_bed -b $gene_file -wa -wb | awk 'OFS="\t" {if ($4 == $9) print}' | cut -f 1,2,3,8,9,10 > ${tmp}.txt - -# add coverage annotation from per base coverage bed file -bedtools intersect -wa -wb -a $tmp.txt -b $bp_coverage | cut -f 1,2,3,4,5,6,8,9,10 > $outfile - -echo "Done. Output file: " $outfile diff --git a/bin/coverage_report_single.py b/bin/coverage_report_single.py index 463a9f10..fccd95d6 100644 --- a/bin/coverage_report_single.py +++ b/bin/coverage_report_single.py @@ -50,15 +50,15 @@ def img2str(self, plt): - img (str): HTML formatted string of plot """ buffer = BytesIO() - plt.savefig(buffer, format='png') + plt.savefig(buffer, format='png', dpi=65, transparent=True) buffer.seek(0) image_png = buffer.getvalue() buffer.close() graphic = base64.b64encode(image_png) data_uri = graphic.decode('utf-8') - img_tag = "".format( - data_uri + img_tag = ( + f"" ) return img_tag @@ -66,36 +66,27 @@ def img2str(self, plt): def low_exon_plot(self, low_raw_cov): """ - Plot bp coverage of exon, used for those where coverage is given - threshold + Generate array of low exon plot values to pass into report Args: - low_raw_cov (df): df of raw coverage for exons with low coverage - - threshold (int): defined threshold level (default: 20) Returns: - - fig (figure): plots of low coverage regions + - low_exon_plots (str): list of plot values in div tags """ - print("Generating plots of low covered regions") + if len(low_raw_cov.index) == 0: + # empty df passed, likely from difference in total cores and plots + return # get list of tuples of genes and exons to define plots genes = low_raw_cov.drop_duplicates( ["gene", "exon"])[["gene", "exon"]].values.tolist() genes = [tuple(exon) for exon in genes] - if len(genes) == 0: - # everything above threshold, don't generate plots - fig = "

All regions in panel above threshold, no plots\ - to show.

" - - return fig - # sort list of genes/exons by gene and exon genes = sorted(genes, key=lambda element: (element[0], element[1])) - plot_titles = [str(x[0]) + " exon: " + str(int(x[1])) for x in genes] - low_raw_cov["exon_len"] =\ low_raw_cov["exon_end"] - low_raw_cov["exon_start"] @@ -103,31 +94,9 @@ def low_exon_plot(self, low_raw_cov): low_raw_cov["cov_end"] + low_raw_cov["cov_start"]) / 2 )) - # set no. rows to no. of plots / no of columns to define grid - columns = 4 - rows = math.ceil(len(genes) / 4) - - # variable height depeendent on no. of plots - v_space = (1 / rows) * 0.25 - - # define grid to add plots to - fig = make_subplots( - rows=rows, cols=columns, print_grid=False, - horizontal_spacing=0.04, vertical_spacing=v_space, - subplot_titles=plot_titles - ) - - # counter for grid - row_no = 1 - col_no = 1 + low_exon_plots = [] # array to add string data of plots to for gene in genes: - # make plot for each gene / exon - if row_no // 5 == 1: - # counter for grid, gets to 5th entry & starts new row - col_no += 1 - row_no = 1 - # get rows for current gene and exon exon_cov = low_raw_cov.loc[( low_raw_cov["gene"] == gene[0] @@ -165,59 +134,23 @@ def low_exon_plot(self, low_raw_cov): pos_row, ignore_index=True ) - # build list of first and last point for threshold line - xval = [x for x in range( - exon_cov_unbinned["cov_start"].iloc[0], - exon_cov_unbinned["cov_end"].iloc[-1] - )] - xval = xval[::len(xval) - 1] - yval = [self.threshold] * 2 - - # info field for hovering on plot line - label = 'position: %{x}
coverage: %{y}' - - # generate plot and threshold line to display - if sum(exon_cov_unbinned["cov"]) != 0: - plot = go.Scatter( - x=exon_cov_unbinned["cov_start"], y=exon_cov_unbinned["cov"], - mode="lines", - hovertemplate=label - ) - else: - # if any plots have no coverage, just display empty plot - # very hacky way by making data point transparent but - # ¯\_(ツ)_/¯ - plot = go.Scatter( - x=exon_cov_unbinned["cov_start"], y=exon_cov_unbinned["cov"], - mode="markers", marker={"opacity": 0} - ) - - threshold_line = go.Scatter( - x=xval, y=yval, hoverinfo='skip', mode="lines", - line=dict(color='rgb(205, 12, 24)', width=1) - ) + if sum(exon_cov_unbinned["cov"]) == 0: + continue - # add to subplot grid - fig.add_trace(plot, col_no, row_no) - fig.add_trace(threshold_line, col_no, row_no) + # build div str of plot data to pass to template + x_vals = str(exon_cov_unbinned['cov_start'].tolist()).strip('[]') + y_vals = str(exon_cov_unbinned['cov'].tolist()).strip('[]') + title = f"{gene[0]} exon {gene[1]}" - row_no = row_no + 1 + gene_data = ( + f"""'
{title},{x_vals},{y_vals}
'""" + ) - # set height of grid by no. rows and scale value of 325 - height = (rows * 300) + 150 + low_exon_plots.append(gene_data) - # update plot formatting - fig.update_xaxes(nticks=3, ticks="", showgrid=True, tickformat=',d') - fig.update_yaxes(title='coverage', title_standoff=0) - fig.update_xaxes(title='exon position', color='#FFFFFF') - fig["layout"].update( - height=height, showlegend=False, margin=dict(l=50, r=0) - ) + low_exon_plots = ','.join(low_exon_plots) - # write plots to html string - fig = fig.to_html(full_html=False) - - return fig + return low_exon_plots def all_gene_plots(self, raw_coverage): @@ -230,7 +163,7 @@ def all_gene_plots(self, raw_coverage): - threshold (int): defined threshold level (default: 20) Returns: - - all-plots (figure): grid of all full gene- exon plots + - all-plots (str): str of lists of all plots with gene symbol """ all_plots = "" @@ -253,7 +186,7 @@ def all_gene_plots(self, raw_coverage): # make subplot grid size of no. of exons, height variable # splits large genes to several rows and maintains height - height = math.ceil(len(exons) / 30) * 4 + height = math.ceil(len(exons) / 30) * 4.5 fig = plt.figure(figsize=(30, height)) # generate grid with space for each exon @@ -272,7 +205,7 @@ def all_gene_plots(self, raw_coverage): axs = axs.flatten() - fig.suptitle(gene, fontweight="bold") + fig.suptitle(gene, fontweight="bold", fontsize=14) count = 0 for exon in exons: @@ -311,7 +244,7 @@ def all_gene_plots(self, raw_coverage): # threshold line axs[count].plot( [0, 100], [self.threshold_val, self.threshold_val], - color='red', linestyle='-', linewidth=2 + color='red', linestyle='-', linewidth=2, rasterized=True ) else: axs[count].plot( @@ -323,16 +256,21 @@ def all_gene_plots(self, raw_coverage): axs[count].plot( [exon_cov["exon_start"], exon_cov["exon_end"]], [self.threshold_val, self.threshold_val], color='red', - linestyle='-', linewidth=1 + linestyle='-', linewidth=1, rasterized=True ) # add labels xlab = str( exon_cov["exon_end"].iloc[0] - exon_cov["exon_start"].iloc[0] - ) + "\nbp" + ) + " bp" + + if len(exons) > 20: + # drop bp to new line for better spacing + xlab = xlab.replace("bp", "\nbp") + axs[count].title.set_text(exon) - axs[count].set_xlabel(xlab) + axs[count].set_xlabel(xlab, fontsize=13) count += 1 @@ -340,6 +278,7 @@ def all_gene_plots(self, raw_coverage): for i in range(column_no * rows): if i in [x * column_no for x in range(rows)]: # first plot of line, keep ticks and labels + axs[i].tick_params(axis='y', labelsize=12) continue else: axs[i].yaxis.set_ticks_position('none') @@ -352,11 +291,15 @@ def all_gene_plots(self, raw_coverage): plt.ylim(bottom=0, top=ymax) # remove outer white margins - fig.tight_layout(h_pad=1.2) + fig.tight_layout(h_pad=1.4) # convert plot png to html string and append to one string img = self.img2str(plt) - all_plots += img + "

" + + # add img to str list with gene symbol for filtering in table + # expects to be a string of lists to write in report + img_str = f'["{gene}", "{img}" ], ' + all_plots += img_str plt.close(fig) @@ -404,6 +347,7 @@ def summary_gene_plot(self, cov_summary): genes100pct = len(summary_data.iloc[:-100]) summary_data = summary_data.iloc[-100:] + # generate the plot plt.bar( summary_data["gene"], [int(x) for x in summary_data[self.threshold]], @@ -411,8 +355,8 @@ def summary_gene_plot(self, cov_summary): ) if genes100pct is not None: - genes100pct = str(genes100pct) # more than 100 genes, add title inc. 100% covered not shown + genes100pct = str(genes100pct) axs.set_title( r"$\bf{" + genes100pct + "}$" + " genes covered 100% at " + r"$\bf{" + self.threshold + "}$" + @@ -427,7 +371,7 @@ def summary_gene_plot(self, cov_summary): plt.text(1.005, 0.91, '95%', transform=axs.transAxes) # plot formatting - axs.tick_params(labelsize=6, length=0) + axs.tick_params(labelsize=8, length=0) plt.xticks(rotation=55, color="#565656") # adjust whole plot margins @@ -446,13 +390,34 @@ def summary_gene_plot(self, cov_summary): plt.legend( handles=[green, orange, red], loc='upper center', - bbox_to_anchor=(0.5, -0.1), - fancybox=True, shadow=True, ncol=12, fontsize=12 + bbox_to_anchor=(0.5, -0.18), + fancybox=True, shadow=True, ncol=12, fontsize=14 ) vals = np.arange(0, 110, 10).tolist() plt.yticks(vals, vals) - axs.tick_params(axis='both', which='major', labelsize=8) + + if len(summary_data.index) > 250: + # x axis label overlap on lots of genes => skip every third + # shouldn't be a huge amount more than this to stop overlap + axs.set_xticks(axs.get_xticks()[::3]) + axs.tick_params(axis='both', which='major', labelsize=10) + plt.figtext( + 0.5, 0.1, + "Some gene labels are not shown due to high number of genes", + ha="center", fontsize=12 + ) + elif len(summary_data.index) > 125: + # x axis label overlap on lots of genes => skip every second + axs.set_xticks(axs.get_xticks()[::2]) + axs.tick_params(axis='both', which='major', labelsize=10) + plt.figtext( + 0.5, 0.1, + "Some gene labels are not shown due to high number of genes", + ha="center", fontsize=12 + ) + else: + axs.tick_params(axis='both', which='major', labelsize=10) plt.xlabel("") plt.ylabel("% coverage ({})".format(self.threshold), fontsize=11) @@ -471,7 +436,10 @@ def summary_gene_plot(self, cov_summary): class styleTables(): """Functions for styling tables for displaying in report""" - def __init__(self, cov_stats, cov_summary, threshold, threshold_cols, vals): + def __init__( + self, cov_stats: pd.DataFrame, cov_summary: pd.DataFrame, + threshold: str, threshold_cols: list, vals: list + ) -> None: self.cov_stats = cov_stats self.cov_summary = cov_summary self.threshold = threshold @@ -496,13 +464,9 @@ def style_sub_threshold(self): """ Styling of sub threshold stats df for displaying in report - Args: - - cov_stats (df): df of per exon coverage stats - - threshold (str): low coverage threshold value - - threshold_cols (list): threshold values for coverage Returns: - - sub_threshold_stats (str): HTML formatted str of cov stats - table + - sub_threshold_stats (list): list of sub threshold coverage stats + - low_exon_columns (list): list of column headers for report - gene_issues (int): total number of genes under threshold - exon_issues (int): total number of exons under threshold """ @@ -510,22 +474,26 @@ def style_sub_threshold(self): "gene", "tx", "chrom", "exon", "exon_len", "exon_start", "exon_end", "min", "mean", "max" ] - column.extend(self.threshold_cols) + dtypes = { + 'gene': str, 'tx': str, 'chrom': str, 'exon': int, 'exon_len': int, + 'exon_start': int, 'exon_end': int, 'min': int, 'mean': float, + 'max': int + } + + for col in self.threshold_cols: + # add dtype for threshold columns + dtypes[col] = float + sub_threshold = pd.DataFrame(columns=column) + sub_threshold.astype(dtype=dtypes) # get all exons with <100% coverage at threshold for i, row in self.cov_stats.iterrows(): if int(row[self.threshold]) < 100: sub_threshold = sub_threshold.append(row, ignore_index=True) - # pandas is terrible and forces floats, change back to int - dtypes = { - 'chrom': str, 'exon': int, 'exon_len': int, 'exon_start': int, - 'exon_end': int, 'min': int, 'max': int - } - if not sub_threshold.empty: # some low covered regions identified sub_threshold = sub_threshold.astype(dtypes) @@ -538,6 +506,9 @@ def style_sub_threshold(self): # reset index to fix formatting sub_threshold_stats = sub_threshold_stats.reindex(self.vals, axis=1) sub_threshold_stats.reset_index(inplace=True) + sub_threshold_stats.index = np.arange( + 1, len(sub_threshold_stats.index) + 1 + ) gene_issues = len(list(set(sub_threshold_stats["gene"].tolist()))) exon_issues = len(sub_threshold_stats["exon"]) @@ -553,64 +524,29 @@ def style_sub_threshold(self): columns=self.column_names ) - # reindex & set to begin at 1 - sub_threshold_stats.index = np.arange( - 1, len(sub_threshold_stats.index) + 1 - ) + # add index as column to have numbered rows in report + sub_threshold_stats.insert(0, ' ', sub_threshold_stats.index) - # create slices of sub_threshold stats df to add styling to - slice_ranges = { - "x0": (10, 0), "x10": (30, 10), "x30": (50, 30), "x50": (70, 50), - "x70": (90, 70), "x90": (95, 90), "x95": (99, 95), "x99": (101, 99) - } + # threshold_cols -> list of strings, add mean to loop over for rounding + round_cols = ['Mean'] + self.threshold_cols - sub_slice = {} + # limit to 2dp using math.floor, use of round() with + # 2dp may lead to inaccuracy such as 99.99 => 100.00 + for col in round_cols: + sub_threshold_stats[col] = sub_threshold_stats[col].map( + lambda col: math.floor(col * 100) / 100 + ) - for key, val in slice_ranges.items(): - sub_slice[key] = pd.IndexSlice[sub_threshold_stats.loc[( - sub_threshold_stats[self.threshold] < val[0] - ) & ( - sub_threshold_stats[ - self.threshold] >= val[1])].index, self.threshold] - - # df column index of threshold - col_idx = sub_threshold_stats.columns.get_loc(self.threshold) - - # make dict for rounding coverage columns to 2dp - rnd = {} - for col in list(sub_threshold_stats.columns[10:]): - rnd[col] = '{0:.2f}%' - - # set threshold column widths as a fraction of 40% table width - t_width = str(40 / len(self.threshold_cols)) + "%" - - # apply colours to coverage cell based on value, 0 is given solid red - s = sub_threshold_stats.style.apply(lambda x: [ - "background-color: #b30000" if x[self.threshold] == 0 and - idx == col_idx else "" for idx, v in enumerate(x) - ], axis=1)\ - .bar(subset=sub_slice["x0"], color='#b30000', vmin=0, vmax=100)\ - .bar(subset=sub_slice["x10"], color='#990000', vmin=0, vmax=100)\ - .bar(subset=sub_slice["x30"], color='#C82538', vmin=0, vmax=100)\ - .bar(subset=sub_slice["x50"], color='#FF4500', vmin=0, vmax=100)\ - .bar(subset=sub_slice["x70"], color='#FF4500', vmin=0, vmax=100)\ - .bar(subset=sub_slice["x90"], color='#FF4500', vmin=0, vmax=100)\ - .bar(subset=sub_slice["x95"], color='#FFBF00', vmin=0, vmax=100)\ - .bar(subset=sub_slice["x99"], color='#007600', vmin=0, vmax=100)\ - .format(rnd)\ - .set_table_attributes('table border="1"\ - class="dataframe table table-hover table-bordered"')\ - .set_uuid("low_exon_table")\ - .set_properties(**{'font-size': '0.85vw', 'table-layout': 'auto'})\ - .set_properties(subset=self.threshold_cols, **{'width': t_width})\ - - sub_threshold_stats["Mean"] = sub_threshold_stats["Mean"].apply( - lambda x: int(x) - ) + # generate list of dicts with column headers for styling + low_exon_columns = [] - sub_threshold_stats = s.render() + for col in sub_threshold_stats.columns: + low_exon_columns.append({'title': col}) - return sub_threshold_stats, gene_issues, exon_issues + # convert df to list of lists to pass to report + sub_threshold_stats = sub_threshold_stats.values.tolist() + + return sub_threshold_stats, low_exon_columns, gene_issues, exon_issues def style_total_stats(self): @@ -635,6 +571,7 @@ def style_total_stats(self): total_stats = total_stats.reindex(self.vals, axis=1) total_stats.reset_index(inplace=True) total_stats.index = np.arange(1, len(total_stats.index) + 1) + total_stats.insert(0, 'index', total_stats.index) total_stats = total_stats.rename(columns=self.column_names) @@ -646,15 +583,9 @@ def style_total_stats(self): total_stats[col] = total_stats[col].map( lambda col: math.floor(col * 100) / 100 ) - # CSS table class for styling tables - style = ( - '', - '
' - ) - total_stats = total_stats.to_html(justify='left').replace( - style[0], style[1] - ) + # turn gene stats table into list of lists + total_stats = total_stats.values.tolist() return total_stats @@ -666,74 +597,42 @@ def style_cov_summary(self): - cov_summary (df): df of gene coverage stats - threshold_cols (list): list of threshold values Returns: - - gene_stats (str): HTML formatted str of gene summary df + - gene_stats (list): HTML formatted str of gene summary df - total_genes (int): total number of genes """ # rename columns for displaying in report - cov_summary = self.cov_summary.drop(columns=["exon"]) - cov_summary = cov_summary.rename(columns=self.column_names) + gene_stats = self.cov_summary.copy() + gene_stats = gene_stats.rename(columns=self.column_names) # get values to display in report - total_genes = len(cov_summary["Gene"].tolist()) + total_genes = len(gene_stats["Gene"].tolist()) # limit to 2dp using math.floor, use of round() with # 2dp may lead to inaccuracy such as 99.99 => 100.00 round_cols = ['Mean'] + self.threshold_cols for col in round_cols: - cov_summary[col] = cov_summary[col].map( + gene_stats[col] = gene_stats[col].map( lambda col: math.floor(col * 100) / 100 ) # reset index to start at 1 - cov_summary.index = np.arange(1, len(cov_summary.index) + 1) + gene_stats.index = np.arange(1, len(gene_stats.index) + 1) + gene_stats.insert(0, 'index', gene_stats.index) - # CSS table class for styling tables - style = ( - '
', - '
' - ) - - # generate HTML strings from table objects to write to report - gene_stats = cov_summary.to_html(justify='left').replace( - style[0], style[1] - ) + # turn gene stats table into list of lists + gene_stats = gene_stats.values.tolist() return gene_stats, total_genes - def check_snp_tables(self, snps_low_cov, snps_high_cov): - """ - Check styled tables, if empty replace with appropriate text - - Args: - - snps_low_cov (str): HTML styled table of low covered snps - - snps_high_cov (str): HTML styled table of covered snps - - Returns: - - snps_low_cov (str): HTML styled table of low covered snps - - snps_high_cov (str): HTML styled table of covered snps - """ - if snps_low_cov is None: - snps_low_cov = "No variants with coverage < {}".format( - self.threshold - ) - - if snps_high_cov is None: - snps_high_cov = "No variants covered at {}/b>".format( - self.threshold - ) - - return snps_low_cov, snps_high_cov - - def style_snps_cov(self, snps_cov): """ Add styling to tables of SNPs covered above / beneath threshold Args: - snps_cov (df): df of snps above / below threshold Returns: - - snps_cov (str): HTML formatted str of snps df + - snps_cov (list): list of snps to render in report - total_snps (int): total number of snps in df """ if not snps_cov.empty: @@ -746,27 +645,16 @@ def style_snps_cov(self, snps_cov): uuid = "var_low_cov" snps_cov.index = np.arange(1, len(snps_cov.index) + 1) - total_snps = len(snps_cov.index) - snps_cov = snps_cov.style\ - .set_table_attributes( - 'class="dataframe table table-striped"')\ - .set_uuid(uuid)\ - .set_properties(**{ - 'font-size': '0.80vw', 'table-layout': 'auto' - })\ - .set_properties(subset=["VCF", "Gene"], **{'width': '10%'})\ - .set_properties(subset=["Exon"], **{'width': '7.5%'})\ - .set_properties(subset=["Chromosome"], **{'width': '10%'})\ - .set_properties(subset=["Position"], **{'width': '12.5%'})\ - .set_properties(subset=["Ref"], **{'width': '20%'})\ - .set_properties(subset=["Alt"], **{'width': '20%'})\ - .set_properties(subset=["Coverage"], **{'width': '10%'}) - - snps_cov = snps_cov.render() + # reset index to start at 1 + snps_cov.index = np.arange(1, len(snps_cov.index) + 1) + snps_cov.insert(0, 'index', snps_cov.index) + + # turn gene stats table into list of lists + snps_cov = snps_cov.values.tolist() else: - snps_cov = None + snps_cov = [] total_snps = 0 return snps_cov, total_snps @@ -779,52 +667,22 @@ def style_snps_no_cov(snps_no_cov): Args: - snps_no_cov (df): df of snps with no coverage values Returns: - - snps_no_cov (str): HTML formatted str of snps with no cov + - snps_no_cov (list): list of snps for rendering in report - snps_out_panel (int): total number snps with no cov """ # if variants from vcf found that span exon boundaries if not snps_no_cov.empty: - # manually add div and styling around rendered table, allows - # to be fully absent from the report if the table is empty - snps_no_cov.index = np.arange(1, len(snps_no_cov) + 1) - # get number of variants to display in report snps_out_panel = len(snps_no_cov.index) - html_string = snps_no_cov.style\ - .set_table_attributes( - 'class="dataframe table table-striped"')\ - .set_uuid("var_no_cov")\ - .set_properties(**{ - 'font-size': '0.80vw', 'table-layout': 'auto' - })\ - .set_properties(subset=["VCF"], **{ - 'width': '7.5%' - })\ - .set_properties(subset=[ - "Chromosome", "Position", "Ref", "Alt" - ], **{'width': '10%'}) - - html_string = html_string.render() - - snps_no_cov = """ -
Variants included in the first table below either fully\ - or partially span panel region(s). These are most likely\ - large structural variants and as such do not have\ - coverage data available. See the "info" column for details\ - on the variant. -
-
Table of variants spanning panel regions(s)   - -
-
- {} -
-
- """.format(html_string) + # reset index to start at 1 + snps_no_cov.index = np.arange(1, len(snps_no_cov.index) + 1) + snps_no_cov.insert(0, 'index', snps_no_cov.index) + + # turn gene stats table into list of lists + snps_no_cov = snps_no_cov.values.tolist() else: - snps_no_cov = "" + snps_no_cov = [] snps_out_panel = 0 return snps_no_cov, snps_out_panel @@ -1073,8 +931,8 @@ def write_summary(self, cov_summary, threshold, panel_pct_coverage): summary_text = """
  • Clinical report summary:
  • -
    """ sub90 = "" @@ -1087,7 +945,10 @@ def write_summary(self, cov_summary, threshold, panel_pct_coverage): if gene[self.threshold] < 90: # build string of genes with <90% coverage at threshold - sub90 += "{} ({}); ".format(gene["gene"], gene["tx"]) + sub90 += ( + f'{gene["gene"]} ({gene["tx"]}) ' + f'{math.floor(gene[self.threshold] * 100)/100.0}%; ' + ) summary_text = summary_text.strip(" ;") + "." @@ -1097,20 +958,15 @@ def write_summary(self, cov_summary, threshold, panel_pct_coverage): else: sub90 = sub90.strip(" ;") + "." - summary_text += """ -

    Genes with coverage at {} less than 90%: - {}""".format(self.threshold, sub90) - - summary_text += """ -

    {} % of this panel was sequenced to a depth of {} or - greater.
    """.format(pct_cov, self.threshold) + summary_text += ( + f"

    Genes with coverage at {self.threshold} " + f"less than 90%: {sub90}" + ) - # add closing div and copy button for summary text - summary_text += """
    -
    """ + summary_text += ( + f"

    {pct_cov} % of this panel was sequenced to a depth " + f"of {self.threshold} or greater.
    " + ) return summary_text @@ -1153,13 +1009,25 @@ def generate_report(self, cov_stats, cov_summary, snps_low_cov, vals = ["min", "mean", "max"] vals.extend(threshold_cols) + # generate html formatted list of table headings for tables + gene_table_headings = [" ", "Gene", "Transcript"] + vals + exon_table_headings = gene_table_headings.copy() + exon_table_headings[3:3] = ['Chr', 'Exon', 'Length', 'Start', 'End'] + + gene_table_headings = "\n".join( + [f"{x.capitalize()}" for x in gene_table_headings] + ) + exon_table_headings = "\n".join( + [f"{x.capitalize()}" for x in exon_table_headings] + ) + styling = styleTables( cov_stats, cov_summary, self.threshold, threshold_cols, vals ) calculate = calculateValues(self.threshold) # apply styling to tables for displaying in report - sub_threshold_stats, gene_issues,\ + sub_threshold_stats, low_exon_columns, gene_issues,\ exon_issues = styling.style_sub_threshold() total_stats = styling.style_total_stats() @@ -1168,9 +1036,6 @@ def generate_report(self, cov_stats, cov_summary, snps_low_cov, snps_low_cov, snps_not_covered = styling.style_snps_cov(snps_low_cov) snps_high_cov, snps_covered = styling.style_snps_cov(snps_high_cov) snps_no_cov, snps_out_panel = styling.style_snps_no_cov(snps_no_cov) - snps_low_cov, snps_high_cov = styling.check_snp_tables( - snps_low_cov, snps_high_cov - ) # get values to display in report fully_covered_genes = total_genes - gene_issues @@ -1188,6 +1053,8 @@ def generate_report(self, cov_stats, cov_summary, snps_low_cov, report_vals["fully_covered_genes"] = str(fully_covered_genes) report_vals["gene_issues"] = str(gene_issues) report_vals["threshold"] = self.threshold + report_vals["gene_table_headings"] = gene_table_headings + report_vals["exon_table_headings"] = exon_table_headings report_vals["exon_issues"] = str(exon_issues) report_vals["build"] = build report_vals["panel"] = panel @@ -1205,8 +1072,8 @@ def generate_report(self, cov_stats, cov_summary, snps_low_cov, # add tables & plots to template html_string = self.build_report( html_template, total_stats, gene_stats, sub_threshold_stats, - snps_low_cov, snps_high_cov, snps_no_cov, fig, all_plots, - summary_plot, report_vals, bootstrap + low_exon_columns, snps_low_cov, snps_high_cov, snps_no_cov, fig, + all_plots, summary_plot, report_vals, bootstrap ) # write output report to file @@ -1214,7 +1081,7 @@ def generate_report(self, cov_stats, cov_summary, snps_low_cov, def build_report(self, html_template, total_stats, gene_stats, - sub_threshold_stats, snps_low_cov, snps_high_cov, + sub_threshold_stats, low_exon_columns, snps_low_cov, snps_high_cov, snps_no_cov, fig, all_plots, summary_plot, report_vals, bootstrap ): @@ -1230,7 +1097,7 @@ def build_report(self, html_template, total_stats, gene_stats, - snsp_high_cov (df): table of snps with cov > threshold - snps_no_cov (df): variants that span exon boundaries (i.e SVs) - fig (figure): grid of low coverage exon plots (plotly) - - all-plots (figure): grid of all full gene- exon plots + - all-plots (figure): grid of all full gene-exon plots - summary_plot (figure): gene summary plot - % at threshold - report_vals (dict): values to display in report text Returns: @@ -1261,14 +1128,17 @@ def build_report(self, html_template, total_stats, gene_stats, fully_covered_genes=report_vals["fully_covered_genes"], name=report_vals["name"], sub_threshold_stats=sub_threshold_stats, + low_exon_columns=low_exon_columns, low_cov_plots=fig, all_plots=all_plots, summary_plot=summary_plot, gene_stats=gene_stats, + gene_table_headings=report_vals["gene_table_headings"], + exon_table_headings=report_vals["exon_table_headings"], total_stats=total_stats, - snps_high_cov=snps_high_cov, - snps_low_cov=snps_low_cov, - snps_no_cov=snps_no_cov, + snps_high_cov_data=snps_high_cov, + snps_low_cov_data=snps_low_cov, + snps_no_cov_data=snps_no_cov, total_snps=report_vals["total_snps"], snps_covered=report_vals["snps_covered"], snps_pct_covered=report_vals["snps_pct_covered"], @@ -1308,6 +1178,7 @@ def write_report(self, html_string, output_name): file = open(outfile, 'w') file.write(html_string) file.close() + print(f"Output report written to {outfile}") def load_files(load, threshold, exon_stats, gene_stats, raw_coverage, snp_vcfs, panel): @@ -1517,43 +1388,68 @@ def main(): # generate summary plot summary_plot = plots.summary_gene_plot(cov_summary) - # generate plot of sub optimal regions - fig = plots.low_exon_plot(low_raw_cov) + if len(cov_summary.index) < int(args.limit) or int(args.limit) == -1: + # generate plots of each full gene + print("Generating full gene plots") + if num_cores == 1: + # specified one core, generate plots slowly + all_plots = plots.all_gene_plots(raw_coverage) + else: + raw_coverage = raw_coverage.sort_values( + ["gene", "exon"], ascending=[True, True] + ) + # get unique list of genes + genes = raw_coverage.drop_duplicates(["gene"])["gene"].values.tolist() - if num_cores == 1: - print("blarg") - all_plots = plots.all_gene_plots(raw_coverage) + # split gene list equally for seperate processes + gene_array = np.array_split(np.array(genes), num_cores) - elif len(cov_summary.index) < int(args.limit) or int(args.limit) == -1: - # generate plots of each full gene - print("Generating full gene plots") + # split df into seperate dfs by genes in each list + split_dfs = np.asanyarray( + [raw_coverage[raw_coverage["gene"].isin(x)] for x in gene_array], + dtype=object + ) - raw_coverage = raw_coverage.sort_values( - ["gene", "exon"], ascending=[True, True] - ) + with multiprocessing.Pool(num_cores) as pool: + # use a pool to spawn multiple processes + # uses number of cores defined and splits processing of df + # slices, add each to pool with threshold values + all_plots = pool.map(plots.all_gene_plots, split_dfs) + all_plots = "".join(all_plots) + else: + all_plots = "
    Full gene plots have been omitted from this report\ + due to the high number of genes in the panel.
    " + + if len(low_raw_cov.index) > 0: + # some low covered regions, generate plots + print("Generating plots of low covered regions") # get unique list of genes - genes = raw_coverage.drop_duplicates(["gene"])["gene"].values.tolist() + genes = low_raw_cov.drop_duplicates(["gene"])["gene"].values.tolist() + print(f"Plots for {len(genes)} to generate") # split gene list equally for seperate processes gene_array = np.array_split(np.array(genes), num_cores) # split df into seperate dfs by genes in each list split_dfs = np.asanyarray( - [raw_coverage[raw_coverage["gene"].isin(x)] for x in gene_array], + [low_raw_cov[low_raw_cov["gene"].isin(x)] for x in gene_array], dtype=object ) with multiprocessing.Pool(num_cores) as pool: # use a pool to spawn multiple processes # uses number of cores defined and splits processing of df - # slices, add each to pool with threshold values and - # concatenates together when finished - all_plots = ''.join(pool.map(plots.all_gene_plots, split_dfs)) + # slices, add each to pool with threshold values + fig = pool.map(plots.low_exon_plot, split_dfs) + + # can return None => remove before joining + fig = [fig_str for fig_str in fig if fig_str] + fig = ",".join(fig) else: - all_plots = "
    Full gene plots have been omitted from this report\ - due to the high number of genes in the panel.
    " + fig = "

    All regions in panel above threshold, no plots\ + to show.

    " if args.summary: # summary text to be included diff --git a/bin/coverage_stats_single.py b/bin/coverage_stats_single.py index 119b455c..f31acfb3 100644 --- a/bin/coverage_stats_single.py +++ b/bin/coverage_stats_single.py @@ -7,6 +7,7 @@ """ import argparse +from functools import partial import os import re import sys @@ -165,17 +166,15 @@ def cov_stats(self, data, thresholds): ) coords = list(np.unique(coords)) - if len(coords) > 1: - # more than one region for exon in bed, exit as will - # be incorrectly calculated - print( - "More than one region is present in the bed file for " - "exon {} of {}: {}.\n".format(exon, gene, coords), - "Currently each exon MUST be in one pair of start / " - "end coordinates else coverage values will be " - "incorrect for those regions. Exiting now." - ) - sys.exit() + # if more than one region for exon in bed, exit as will + # be incorrectly calculated + assert len(coords) == 1, ( + "More than one region is present in the bed file for " + "exon {} of {}: {}.\n".format(exon, gene, coords), + "Currently each exon MUST be in one pair of start / " + "end coordinates else coverage values will be " + "incorrect for those regions. Exiting now." + ) start = exon_cov.iloc[0] end = exon_cov.iloc[-1] @@ -280,11 +279,9 @@ def summary_stats(self, cov_stats, thresholds): # calculate gene coverage values min = gene_cov["min"].min() - mean = sum( - [x * y for x, y in zip( - gene_cov["mean"], gene_cov["exon_frac"] - )] - ) + mean = round(sum( + [x * y for x, y in zip(gene_cov["mean"], gene_cov["exon_frac"])] + ), 12) max = gene_cov["max"].max() # average coverage % at given thresholds, round to 12 dp to @@ -468,11 +465,22 @@ def main(): # slices, add each to pool with threshold values and # concatenates together when finished print("Generating per base exon stats") + try: + # map threshold arg to function since same is passed to all, then + # use imap to pass each df to worker to generate stats + stats_w_arg = partial(single.cov_stats, thresholds=thresholds) + pool_output = pool.imap_unordered(stats_w_arg, split_dfs) + except AssertionError: + # more than one region for each exon found => exit + pool.close() + pool.terminate() + else: + cov_stats = pd.concat(pool_output, ignore_index=True) - cov_stats = pd.concat( - pool.starmap( - single.cov_stats, map(lambda e: (e, thresholds), split_dfs) - ), ignore_index=True) + # imap_unordered() returns everything out of order (funnily enough) + # sort by gene and exon to be nicely formatted + cov_stats.exon = cov_stats.exon.astype(int) + cov_stats = cov_stats.sort_values(['gene', 'exon']) # split up output coverage stats df for multiprocessing split_stats_dfs = np.asanyarray( @@ -489,6 +497,9 @@ def main(): lambda e: (e, thresholds), split_stats_dfs )), ignore_index=True) + cov_summary = cov_summary.sort_values(['gene']) + cov_summary = cov_summary.drop(columns=["exon"]) + # write tables to output files single.write_outfiles( cov_stats, cov_summary, args.outfile, flagstat, build diff --git a/bin/load_data.py b/bin/load_data.py index 9c4a8377..e20d28f4 100644 --- a/bin/load_data.py +++ b/bin/load_data.py @@ -1,8 +1,9 @@ -from pathlib import Path import os +import pandas as pd +from pathlib import Path import sys -import pandas as pd +from version import VERSION class loadData(): @@ -11,8 +12,11 @@ def __init__(self): "chrom": str, "exon_start": 'Int64', "exon_end": 'Int64', + "start": 'Int64', + "end": 'Int64', "gene": str, "tx": str, + "transcript": str, "exon": 'Int64', "exon_len": 'Int64', "min": 'Int64', @@ -36,6 +40,96 @@ def filter_dtypes(self, df): return {k: v for k, v in self.dtypes.items() if k in df.columns} + def read_panel_bed(self, bed_file): + """ + Read in panel bed file to dataframe + Args: bed_file (file): file handler of bed file + Returns: panel_bed_df (df): df of panel bed file + """ + print("reading panel bed file") + panel_bed = pd.read_csv( + bed_file, sep="\t", dtype=self.dtypes, names=[ + "chrom", "start", "end", "transcript" + ] + ) + + # strip chr from chrom if present + panel_bed["chrom"] = panel_bed["chrom"].apply( + lambda x: str(x).replace("chr", "") + ) + + return panel_bed + + + def read_transcript_info(self, transcript_file): + """ + Read in file that contains transcript -> gene symbol, exon no. + Args: transcript_file (file): file handler + Returns: transcript_info_df (df): df of transcript info + """ + print("reading transcript information file") + transcript_info_df = pd.read_csv( + transcript_file, sep="\t", dtype=self.dtypes, names=[ + "chrom", "start", "end", "gene", "transcript", "exon" + ] + ) + + # strip chr from chrom if present + transcript_info_df["chrom"] = transcript_info_df["chrom"].apply( + lambda x: str(x).replace("chr", "") + ) + + return transcript_info_df + + + def read_coverage_data(self, coverage_file, chunk_size=None): + """ + Read in per-base coverage file (i.e. output of mosdepth) + Args: + - coverage_file (file): file handler + - chunk_size (int): this can be 50 million + line file, use chunks + to read in df and return an iterable to stop bedtools breaking + Returns: pb_coverage_df (df / list): df of coverage data / list of dfs + if chunk_size value passed + """ + print("reading coverage file, this might take a while...") + + if chunk_size: + # build list of dataframes split by given chunk size + pb_coverage_df = [] + + for df in pd.read_csv( + coverage_file, sep="\t", compression="infer", + dtype=self.dtypes, + chunksize=chunk_size, names=["chrom", "start", "end", "cov"] + ): + # strip chr prefix if present + df["chrom"] = df["chrom"].apply( + lambda x: str(x).replace("chr", "") + ) + # add to list of chunk dfs + pb_coverage_df.append(df) + + print( + f"per-base coverage data read in to {len(pb_coverage_df)} " + f"of {chunk_size} rows" + ) + else: + # read file into one df + pb_coverage_df = pd.read_csv( + coverage_file, sep="\t", compression="infer", + dtype=self.dtypes, + names=["chrom", "start", "end", "cov"] + ) + + # strip chr from chrom if present + pb_coverage_df["chrom"] = pb_coverage_df["chrom"].apply( + lambda x: str(x).replace("chr", "") + ) + + return pb_coverage_df + + def read_exon_stats(self, exon_stats): """ Read exon stats file from coverage_single_stats into df @@ -269,8 +363,7 @@ def get_snp_vcfs(snp_vcfs): if snp_vcfs: # get names of SNP vcfs used to display in report vcfs = ", ".join([Path(x).stem for x in snp_vcfs]) - vcfs = "
    VCF(s) of known variants included in report: {}\ -
    ".format(vcfs) + vcfs = f"VCF(s) of known variants included in report: {vcfs}" else: vcfs = "" @@ -280,25 +373,13 @@ def get_snp_vcfs(snp_vcfs): @staticmethod def get_athena_ver(): """ - Attempt to get version of Athena from dir name to display in - report footer, will only work for zip/tar + Return version of Athena to display in report footer Args: None Returns: - version (str): version of Athena """ - bin_dir = os.path.dirname(os.path.abspath(__file__)) - - try: - path = str(os.path.join(bin_dir, "../")).split("/") - version = [s for s in path if "athena" in s][0].split("-")[1] - version = "({})".format(version) - except Exception: - print("Error getting version from dir name, continuing.") - version = "" - pass - - return version + return VERSION @staticmethod diff --git a/bin/version.py b/bin/version.py new file mode 100644 index 00000000..ee65984a --- /dev/null +++ b/bin/version.py @@ -0,0 +1 @@ +VERSION = '1.2.0' diff --git a/data/example/Example_coverage_report.html b/data/example/Example_coverage_report.html index c0d224c5..7f9097e5 100644 --- a/data/example/Example_coverage_report.html +++ b/data/example/Example_coverage_report.html @@ -2,7 +2,31 @@ - @@ -79,7 +136,7 @@

    Report details

    It contains the following sections: $summary_plot -

    +




    -

    Exons with sub-optimal coverage

    -
    +

    Exons with <100% coverage at $threshold

    Of the $total_genes genes included in the panel, $exon_issues exons in - $gene_issues genes had sub optimal-coverage.
    -
    + $gene_issues genes had sub optimal-coverage.

    - - $sub_threshold_stats -
    -

    +
    + + - To assess what portion of the exon(s) have sub-optimal coverage, the plots below display - an interactive window on hovering, showing the coverage at that current base.
    - The top right also contains a toolbar, with functions such as panning and zooming. +
    +
    + +
    +

    + + To assess what portion of the exon(s) have sub-optimal coverage, the plots below display + an interactive window on hovering, showing the coverage at that current base.
    + The top right also contains a toolbar, with functions such as panning and zooming. + +

    + + + + + + + + +
    + +




    - $low_cov_plots -

    Per gene coverage summary

    @@ -132,40 +205,50 @@

    Per gene coverage summary

    The table below contains metrics for each gene averaged across all of its exons. Plots of each gene are also included below.

    - -
    - - $gene_stats -
    -
    -
    - -
    - $all_plots -
    + + + + + $gene_table_headings + + +
    + +

    + + + + + + + + +
    +
    -

    +




    - -

    Coverage for all regions of all genes

    + +

    Per exon coverage


    - The following section provides coverage metrics for each exon of each gene.
    - This is a very large table with comprehensive coverage metrics of all target regions. + The following section provides coverage metrics for every exon of each gene.

    - -
    -
    - - $total_stats -
    -
    + + + + $exon_table_headings + + +
    -

    + +


    +

    Coverage of Known Variants


    Below are tables giving coverage of known variants. The low coverage table gives those not covered at $threshold, @@ -175,26 +258,72 @@

    Coverage of Known Variants

  • $snps_covered ($snps_pct_covered %) were covered at or above $threshold
  • $snps_not_covered ($snps_pct_not_covered %) variants were not covered at $threshold
  • $snps_out_panel ($snps_pct_out_panel %) variants spanned region boundaries
  • -
    - $vcfs -
    +
  • $vcfs
  • +

    + - $snps_no_cov - Table of variants with low coverage (< $threshold)   - -
    - - $snps_low_cov +
    + Variants spanning gene / exon boundaries (e.g. larger structural variants) + +
    + + + + + + + + + + +
    VCFChromosomePositionRefAltInfo
    -
    -
    - Table of variants with high coverage (>= $threshold)   - -
    - - $snps_high_cov + + +


    + + +
    + Table of variants with low coverage (< $threshold) +
    + + + + + + + + + + + + +
    VCFGeneExonChromosomePositionRefAltCoverage
    + +


    + +
    + Table of variants with high coverage (>= $threshold)   + + + + + + + + + + + + + + + +
    VCFGeneExonChromosomePositionRefAltCoverage
    +
    +

    @@ -205,17 +334,31 @@

    Coverage of Known Variants

    Report created with Athena $version $logo - + + + + + + + + + - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - \ No newline at end of file +