Skip to content

Commit

Permalink
Merge pull request #47 from eastgenomics/development
Browse files Browse the repository at this point in the history
1.2 - development -> main
mattgarner authored Jul 7, 2021
2 parents c0fcfea + d283759 commit 755255b
Showing 10 changed files with 1,708 additions and 21,667 deletions.
43 changes: 27 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -13,7 +13,7 @@ Athena is a tool to generate coverage statistics for NGS data, and combine these

Dependencies may be installed from the requirements.txt file using ```pip install -r requirements.txt```.
This should contains all the required python packages required to generate coverage statistics and reports.
For optional calculating of variant coverage from VCFs, [BEDtools][bedtools-url] is also required to be installed.
In addition, [BEDtools][bedtools-url] is also required to be installed and on path.

Tested on Ubuntu 18.04.4 and macOS 10.15.4

@@ -22,35 +22,52 @@ Tested on Ubuntu 18.04.4 and macOS 10.15.4
It is written to take in per base coverage data (as output from tools such as [mosdepth][mosdepth-url]) as input to calculate coverage for target regions defined in a bed file. <br></br>

The general workflow for generating the statistics and report is as follows: <br>
- Annotate each region of the bed file with the gene, exon and per base coverage data using `annotate_bed.sh`
- Annotate each region of the bed file with the gene, exon and per base coverage data using `annotate_bed.py`
- Generate per exon and per gene statistics using `coverage_stats_single.py`
- Generate HTML coverage report with `coverage_report_single.py`

For DNAnexus cloud platform users, an Athena [dx applet][dx-url] has also been built.


### Expected file formats

As a minimum, Athena requires 3 input files. These are a bed file for the gene panel, a file of transcript information and the output of your coverage tool (mosdepth, samtools etc.). These files MUST have the following columns:

- panel bed file: `chromosome start end transcript`
- transcript file: `chromosome start end gene transcript exon`
- coverage file: `chromosome start end coverage`

n.b. the process for creating the transcript file may be found [here][transcript-file-url].

### Annotating BED file
The BED file containing regions of interest is first required to be annotated with gene, exon and coverage information prior to analysis. This may be done using [BEDtools intersect][bedtools-intersect-url], with a file containing transcript to gene and exon information, and then the per base coverage data. <br>
The BED file containing regions of interest is first required to be annotated with gene, exon and coverage information prior to analysis. This may be done using [BEDtools intersect][bedtools-intersect-url], with a file containing transcript to gene and exon information, and then the per base coverage data. Currently, 100% overlap is required between coordinates in the panel bed file and the transcript annotation file, therefore you must ensure any added flank regions etc. are the same.<br>

Included is a Bash script (`annotate_bed.sh`) to perform the required BED file annotation.
Included is a Python script (`annotate_bed.py`) to perform the required BED file annotation.

Expected inputs:

```
-i : Input panel bed file; must have ONLY the following 4 columns chromosome, start position, end position, gene/transcript.
-g : Exons nirvana file, contains required gene and exon information.
-b : Per base coverage file (output from mosdepth or similar).
-p / --panel_bed : Input panel bed file; must have ONLY the following 4 columns chromosome, start position, end position, gene/transcript
-t / --transcript_file : Transcript annotation file, contains required gene and exon information. Must have ONLY the following 6 columns:
chromosome, start, end, gene, transcript, exon
-c / --coverage_file : Per base coverage file (output from mosdepth or similar)
-s / -chunk_size : (optional) nrows to split per-base coverage file for intersecting, with <16GB RAM can lead to bedtools intersect failing. Reccomended values: 16GB RAM -> 20000000; 8GB RAM -> 10000000
-n / --output_name : (optional) Prefix for naming output file, if not given will use name from per base coverage file
Example usage:
$ annotate_bed.sh -i panel_bed_file.bed -g exons_nirvana -b {input_file}.per_base.bed
$ annotate_bed.py -p panel_bed_file.bed -t transcript_file.tsv -c {input_file}.per_base.bed
```
<br>
This wraps the bedtools intersect commands below. These commands are given as an example, the output file column ordering must match that given in /data/example example_annotated_bed for calculating coverage statistics:
<br>

```
$ bedtools intersect -a beds/sorted_bed_file.bed -b beds/exons_nirvana2010_no_PAR_Y_noflank.bed -wa -wb | awk 'OFS="\t" {if ($4 == $9) print}' | cut -f 1,2,3,8,9,10 > sample1_genes_exons.bed
$ bedtools intersect -a beds/sorted_bed_file.bed -b beds/exons_nirvana2010_no_PAR_Y_noflank.bed -wa -wb -f 1.0 -r | awk 'OFS="\t" {if ($4 == $9) print}' | cut -f 1,2,3,8,9,10 > sample1_genes_exons.bed
- sorted_bed_file.bed -- bed file defining regions of interest (columns: chromosome, start, end, transcript)
- exons_nirvana2010_no_PAR_Y.bed -- a bed file containing transcript -> exon and gene information
@@ -108,13 +125,6 @@ $ python3 bin/coverage_report_single.py --gene_stats output/sample1-exon-coverag
```


### For development

Features to be developed:
- Generate run level statistics from multiple samples
- Generate run level report from multiple samples
- Add interactive elements to tables to increase useability (i.e sorting, filtering, searching)

Any bugs or suggestions for improvements please raise an issue.


@@ -130,3 +140,4 @@ Any bugs or suggestions for improvements please raise an issue.
[mosdepth-url]: https://github.com/brentp/mosdepth

[dx-url]: https://github.com/eastgenomics/eggd_athena
[transcript-file-url]: https://cuhbioinformatics.atlassian.net/wiki/spaces/P/pages/2241101840/Generating+transcripts+file+for+Athena
253 changes: 253 additions & 0 deletions bin/annotate_bed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
"""
Script to annotate a panel bed file with transcript information and per
base coverage data.
Requires: bedtools
Jethro Rainford
20/06/2021
"""

import argparse
import os
import pandas as pd
from pathlib import Path
import pybedtools as bedtools

from load_data import loadData


class annotateBed():

def add_transcript_info(self, panel_bed, transcript_info_df):
"""
Use pybedtools to annotate panel bed file with coverage data
Args:
- panel_bed (df): panel bed file regions df
- transcript_info_df (df): transcript info file df
Returns:
- bed_w_transcript (df): panel bed file with transcript information
"""
print("calling bedtools to add transcript info")

# get total number of transcripts before to ensure none are dropped
panel_transcripts = panel_bed.transcript.unique().tolist()

# turn dfs into BedTools objects
bed = bedtools.BedTool.from_dataframe(panel_bed)
transcript_info = bedtools.BedTool.from_dataframe(transcript_info_df)

# intersecting panel bed file with transcript/gene/exon information
# requires 100% overlap on panel -> transcript coordinates
bed_w_transcript = bed.intersect(
transcript_info, wa=True, wb=True, F=1.0
)

# convert pybedtools object to df
bed_w_transcript = bed_w_transcript.to_dataframe(names=[
"p_chrom", "p_start", "p_end", "p_transcript",
"t_chrom", "t_start", "t_end", "t_gene", "t_transcript", "t_exon"
])

# check for empty file
assert len(bed_w_transcript.index) > 0, """Empty file returned from
intersecting panel bed and transcript file. Check if flanks are
being used as 100% coordinate overlap is currently required."""

# panel bed file defines transcript to use, filter transcript file for
# just those transcripts
bed_w_transcript = bed_w_transcript[
bed_w_transcript["p_transcript"] == bed_w_transcript["t_transcript"]
]

# drop duplicate columns
bed_w_transcript = bed_w_transcript.drop(columns=[
't_chrom', 't_start', 't_end', 'p_transcript'
])

intersect_transcripts = bed_w_transcript.t_transcript.unique().tolist()

# ensure no transcripts dropped from panel due to missing from
# transcripts file
assert len(panel_transcripts) == len(intersect_transcripts), (
f"Transcript(s) dropped from panel during intersecting with "
f"transcript file. Total before {len(panel_transcripts)}. Total "
f"after {len(intersect_transcripts)}. Dropped transcripts: "
f"{set(panel_transcripts) - set(intersect_transcripts)}"
)

return bed_w_transcript


def add_coverage(self, bed_w_transcript, coverage_df, chunks=False):
"""
Use pybedtools to add coverage bin data to selected panel regions
Args:
- bed_w_transcript (df): panel bed file with transcript information
- coverage_df (df / list): coverage bin data df / list of dfs if
chunks value passed
Returns:
- bed_w_coverage (df): panel bed with transcript and coverage info
"""
print("calling bedtools to add coverage info")

# turn dfs into BedTools objects
bed_w_transcript = bedtools.BedTool.from_dataframe(bed_w_transcript)

col_names = [
"t_chrom", "t_start", "t_end", "t_gene", "t_transcript", "t_exon",
"c_chrom", "cov_start", "cov_end", "cov"
]

if not chunks:
# per-base coverage all in one df
coverage_df = bedtools.BedTool.from_dataframe(coverage_df)

bed_w_coverage = bed_w_transcript.intersect(
coverage_df, wa=True, wb=True
)
bed_w_coverage = bed_w_coverage.to_dataframe(names=col_names)
else:
# coverage data in chunks, loop over each df and intersect
bed_w_coverage = pd.DataFrame(columns=col_names)

for num, df in enumerate(coverage_df):
print(f"intersecting {num + 1}/{len(coverage_df)} coverage chunks")
# read each to bedtools object, intersect and add back to df
chunk_df = bedtools.BedTool.from_dataframe(df)

bed_w_coverage_chunk = bed_w_transcript.intersect(
chunk_df, wa=True, wb=True
)

bed_w_coverage_chunk = bed_w_coverage_chunk.to_dataframe(
names=col_names
)

bed_w_coverage = pd.concat(
[bed_w_coverage, bed_w_coverage_chunk],
ignore_index=True
)

# check again for empty output of bedtools, can happen due to memory
# maxing out and doesn't seem to raise an exception...
assert len(bed_w_coverage) > 0, """Error intersecting with coverage
data, empty file generated. Is this the correct coverage data for
the panel used? bedtools may also have reached memory limit and
died, try re-running with --chunk_size 1000000"""

# drop duplicate chromosome col and rename
bed_w_coverage.drop(columns=["c_chrom"], inplace=True)

bed_w_coverage.columns = [
"chrom", "exon_start", "exon_end", "gene", "tx", "exon",
"cov_start", "cov_end", "cov"
]

return bed_w_coverage


def write_file(bed_w_coverage, outfile):
"""
Write annotated bed to file
Args:
- bed_w_coverage (df): bed file with transcript and coverage info
- output_prefix (str): prefix for naming output file
Outputs: annotated_bed.tsv
"""
# tiny function but want this separate for writing a wrapper script later
bed_w_coverage.to_csv(outfile, sep="\t", header=False, index=False)
print(f"annotated bed file written to {outfile}")


def parse_args():
"""
Parse cmd line arguments
Args: None
Returns:
- args (arguments): args passed from cmd line
"""
parser = argparse.ArgumentParser(
description='Annotate panel bed file with transcript & coverage data.'
)
parser.add_argument(
'--panel_bed', '-p',
help='panel bed file'
)
parser.add_argument(
'--transcript_file', '-t',
help='file with gene and exon information'
)
parser.add_argument(
'--coverage_file', '-c',
help='per base coverage data file'
)
parser.add_argument(
'--chunk_size', '-s', type=int,
help='number lines to read per-base coverage file in one go'
)
parser.add_argument(
'--output_name', '-n',
help='name preifx for output file, if none will use coverage file'
)

args = parser.parse_args()

return args


def main():
annotate = annotateBed()
load = loadData() # class of functions for reading in data

args = parse_args()

if not args.output_name:
# output name not defined, use sample identifier from coverage file
args.output_name = Path(args.coverage_file).name.split('_')[0]

# set dir for writing to
bin_dir = os.path.dirname(os.path.abspath(__file__))
out_dir = os.path.join(bin_dir, "../output/")
outfile_name = f"{args.output_name}_annotated.bed"
outfile = os.path.join(out_dir, outfile_name)

# read in files
panel_bed_df = load.read_panel_bed(args.panel_bed)
transcript_info_df = load.read_transcript_info(args.transcript_file)
pb_coverage_df = load.read_coverage_data(
args.coverage_file, args.chunk_size
)

# add transcript info
bed_w_transcript = annotate.add_transcript_info(
panel_bed_df, transcript_info_df
)

# add coverage
if args.chunk_size:
# per-base coverage split to multiple dfs to limit memory usage
bed_w_coverage = annotate.add_coverage(
bed_w_transcript, pb_coverage_df, chunks=True
)
else:
bed_w_coverage = annotate.add_coverage(
bed_w_transcript, pb_coverage_df, chunks=False
)

# sense check generated file isn't empty, should be caught earlier
assert len(bed_w_coverage.index) > 0, (
'An error has occured: annotated bed file is empty. This is likely ',
'due to an error in regions defined in bed file (i.e. different ',
'transcripts to those in the transcripts file). Start debugging by ',
'intersecting files manually...'
)

write_file(bed_w_coverage, outfile)


if __name__ == "__main__":

main()
71 changes: 0 additions & 71 deletions bin/annotate_bed.sh

This file was deleted.

538 changes: 217 additions & 321 deletions bin/coverage_report_single.py

Large diffs are not rendered by default.

51 changes: 31 additions & 20 deletions bin/coverage_stats_single.py
Original file line number Diff line number Diff line change
@@ -7,6 +7,7 @@
"""

import argparse
from functools import partial
import os
import re
import sys
@@ -165,17 +166,15 @@ def cov_stats(self, data, thresholds):
)
coords = list(np.unique(coords))

if len(coords) > 1:
# more than one region for exon in bed, exit as will
# be incorrectly calculated
print(
"More than one region is present in the bed file for "
"exon {} of {}: {}.\n".format(exon, gene, coords),
"Currently each exon MUST be in one pair of start / "
"end coordinates else coverage values will be "
"incorrect for those regions. Exiting now."
)
sys.exit()
# if more than one region for exon in bed, exit as will
# be incorrectly calculated
assert len(coords) == 1, (
"More than one region is present in the bed file for "
"exon {} of {}: {}.\n".format(exon, gene, coords),
"Currently each exon MUST be in one pair of start / "
"end coordinates else coverage values will be "
"incorrect for those regions. Exiting now."
)

start = exon_cov.iloc[0]
end = exon_cov.iloc[-1]
@@ -280,11 +279,9 @@ def summary_stats(self, cov_stats, thresholds):

# calculate gene coverage values
min = gene_cov["min"].min()
mean = sum(
[x * y for x, y in zip(
gene_cov["mean"], gene_cov["exon_frac"]
)]
)
mean = round(sum(
[x * y for x, y in zip(gene_cov["mean"], gene_cov["exon_frac"])]
), 12)
max = gene_cov["max"].max()

# average coverage % at given thresholds, round to 12 dp to
@@ -468,11 +465,22 @@ def main():
# slices, add each to pool with threshold values and
# concatenates together when finished
print("Generating per base exon stats")
try:
# map threshold arg to function since same is passed to all, then
# use imap to pass each df to worker to generate stats
stats_w_arg = partial(single.cov_stats, thresholds=thresholds)
pool_output = pool.imap_unordered(stats_w_arg, split_dfs)
except AssertionError:
# more than one region for each exon found => exit
pool.close()
pool.terminate()
else:
cov_stats = pd.concat(pool_output, ignore_index=True)

cov_stats = pd.concat(
pool.starmap(
single.cov_stats, map(lambda e: (e, thresholds), split_dfs)
), ignore_index=True)
# imap_unordered() returns everything out of order (funnily enough)
# sort by gene and exon to be nicely formatted
cov_stats.exon = cov_stats.exon.astype(int)
cov_stats = cov_stats.sort_values(['gene', 'exon'])

# split up output coverage stats df for multiprocessing
split_stats_dfs = np.asanyarray(
@@ -489,6 +497,9 @@ def main():
lambda e: (e, thresholds), split_stats_dfs
)), ignore_index=True)

cov_summary = cov_summary.sort_values(['gene'])
cov_summary = cov_summary.drop(columns=["exon"])

# write tables to output files
single.write_outfiles(
cov_stats, cov_summary, args.outfile, flagstat, build
117 changes: 99 additions & 18 deletions bin/load_data.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
from pathlib import Path
import os
import pandas as pd
from pathlib import Path
import sys

import pandas as pd
from version import VERSION


class loadData():
@@ -11,8 +12,11 @@ def __init__(self):
"chrom": str,
"exon_start": 'Int64',
"exon_end": 'Int64',
"start": 'Int64',
"end": 'Int64',
"gene": str,
"tx": str,
"transcript": str,
"exon": 'Int64',
"exon_len": 'Int64',
"min": 'Int64',
@@ -36,6 +40,96 @@ def filter_dtypes(self, df):
return {k: v for k, v in self.dtypes.items() if k in df.columns}


def read_panel_bed(self, bed_file):
"""
Read in panel bed file to dataframe
Args: bed_file (file): file handler of bed file
Returns: panel_bed_df (df): df of panel bed file
"""
print("reading panel bed file")
panel_bed = pd.read_csv(
bed_file, sep="\t", dtype=self.dtypes, names=[
"chrom", "start", "end", "transcript"
]
)

# strip chr from chrom if present
panel_bed["chrom"] = panel_bed["chrom"].apply(
lambda x: str(x).replace("chr", "")
)

return panel_bed


def read_transcript_info(self, transcript_file):
"""
Read in file that contains transcript -> gene symbol, exon no.
Args: transcript_file (file): file handler
Returns: transcript_info_df (df): df of transcript info
"""
print("reading transcript information file")
transcript_info_df = pd.read_csv(
transcript_file, sep="\t", dtype=self.dtypes, names=[
"chrom", "start", "end", "gene", "transcript", "exon"
]
)

# strip chr from chrom if present
transcript_info_df["chrom"] = transcript_info_df["chrom"].apply(
lambda x: str(x).replace("chr", "")
)

return transcript_info_df


def read_coverage_data(self, coverage_file, chunk_size=None):
"""
Read in per-base coverage file (i.e. output of mosdepth)
Args:
- coverage_file (file): file handler
- chunk_size (int): this can be 50 million + line file, use chunks
to read in df and return an iterable to stop bedtools breaking
Returns: pb_coverage_df (df / list): df of coverage data / list of dfs
if chunk_size value passed
"""
print("reading coverage file, this might take a while...")

if chunk_size:
# build list of dataframes split by given chunk size
pb_coverage_df = []

for df in pd.read_csv(
coverage_file, sep="\t", compression="infer",
dtype=self.dtypes,
chunksize=chunk_size, names=["chrom", "start", "end", "cov"]
):
# strip chr prefix if present
df["chrom"] = df["chrom"].apply(
lambda x: str(x).replace("chr", "")
)
# add to list of chunk dfs
pb_coverage_df.append(df)

print(
f"per-base coverage data read in to {len(pb_coverage_df)} "
f"of {chunk_size} rows"
)
else:
# read file into one df
pb_coverage_df = pd.read_csv(
coverage_file, sep="\t", compression="infer",
dtype=self.dtypes,
names=["chrom", "start", "end", "cov"]
)

# strip chr from chrom if present
pb_coverage_df["chrom"] = pb_coverage_df["chrom"].apply(
lambda x: str(x).replace("chr", "")
)

return pb_coverage_df


def read_exon_stats(self, exon_stats):
"""
Read exon stats file from coverage_single_stats into df
@@ -269,8 +363,7 @@ def get_snp_vcfs(snp_vcfs):
if snp_vcfs:
# get names of SNP vcfs used to display in report
vcfs = ", ".join([Path(x).stem for x in snp_vcfs])
vcfs = "<br>VCF(s) of known variants included in report: <b>{}</b>\
</br>".format(vcfs)
vcfs = f"VCF(s) of known variants included in report: <b>{vcfs}</b>"
else:
vcfs = ""

@@ -280,25 +373,13 @@ def get_snp_vcfs(snp_vcfs):
@staticmethod
def get_athena_ver():
"""
Attempt to get version of Athena from dir name to display in
report footer, will only work for zip/tar
Return version of Athena to display in report footer
Args: None
Returns:
- version (str): version of Athena
"""
bin_dir = os.path.dirname(os.path.abspath(__file__))

try:
path = str(os.path.join(bin_dir, "../")).split("/")
version = [s for s in path if "athena" in s][0].split("-")[1]
version = "({})".format(version)
except Exception:
print("Error getting version from dir name, continuing.")
version = ""
pass

return version
return VERSION


@staticmethod
1 change: 1 addition & 0 deletions bin/version.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
VERSION = '1.2.0'
21,619 changes: 469 additions & 21,150 deletions data/example/Example_coverage_report.html

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions data/static/js/jquery-3.5.1.min.js

Large diffs are not rendered by default.

680 changes: 609 additions & 71 deletions data/templates/single_template.html

Large diffs are not rendered by default.

0 comments on commit 755255b

Please sign in to comment.