Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.2 - development -> main #47

Merged
merged 121 commits into from
Jul 7, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
d598f71
add UTF8 charset to template meta tag, fixes weird symbol rendering
jethror1 Jun 11, 2021
539f4d4
add datatables functionality to gene and exon stats tables
jethror1 Jun 11, 2021
9b0ad3a
fix table indexing
jethror1 Jun 11, 2021
bb38758
added % values to summary text
jethror1 Jun 11, 2021
bd503f0
adding local js
jethror1 Jun 14, 2021
5fe460d
changes to write all plots to array, reduce plot file size by ~1/2
jethror1 Jun 19, 2021
c7c0a87
modify template for new plot format
jethror1 Jun 20, 2021
342b8e2
fix formatting of plot str to write
jethror1 Jun 20, 2021
708b99c
add functions for loading bed annotation files
jethror1 Jun 20, 2021
4b175b0
fix plot row autosize
jethror1 Jun 20, 2021
bbd1857
annotate_bed but its now in python
jethror1 Jun 21, 2021
7ecce37
add to args
jethror1 Jun 21, 2021
b03a55e
fix writing to file
jethror1 Jun 21, 2021
30a9cc8
fix window size on plots
jethror1 Jun 21, 2021
ffa5d98
increase plot size
jethror1 Jun 21, 2021
ae46474
improve plot labels
jethror1 Jun 21, 2021
2fa5c66
changes to accomdate moving plotly functionality to JS function, low …
jethror1 Jun 23, 2021
bf90a26
changes to JS for low exon plots to match current styling
jethror1 Jun 26, 2021
a2e66e4
fix sorting in low exon plot area
jethror1 Jun 26, 2021
4249c65
tidy up low_exon_plot() to remove old code
jethror1 Jun 26, 2021
043d634
formatting and comments
jethror1 Jun 27, 2021
6c153d1
update docstring
jethror1 Jun 27, 2021
e9226d2
low exon plot formatting
jethror1 Jun 27, 2021
a5d95ba
comments for datatables/plotly script
jethror1 Jun 27, 2021
f40f5fb
remove unneeded JS
jethror1 Jun 27, 2021
9413974
better styling on summary plot axis
jethror1 Jun 27, 2021
7e07826
revert commented line, remove print
jethror1 Jun 27, 2021
70f7739
remove ws
jethror1 Jun 27, 2021
ca78d73
tidy up html template
jethror1 Jun 28, 2021
072619c
tidy up python script
jethror1 Jun 28, 2021
56343fc
remove unneeded function from load_data.py
jethror1 Jun 28, 2021
4b9111b
remove bad mouthing of pandas, set dtypes for everything
jethror1 Jun 28, 2021
76daf96
changes to move low exon table styling to JS
jethror1 Jun 28, 2021
1f6773e
clean up style_sub_threshold()
jethror1 Jun 28, 2021
eda9efd
update docstring
jethror1 Jun 28, 2021
9d3560a
improve threshold column colouring
jethror1 Jun 28, 2021
6e25cad
revert skipping figure
jethror1 Jun 28, 2021
9b21179
tweak report formatting
jethror1 Jun 28, 2021
806399e
remove skipping plots
jethror1 Jun 28, 2021
2a6c356
formatting, comments
jethror1 Jun 28, 2021
0c689c3
Merge pull request #33 from eastgenomics/1.2-add-dataTables
Yu-jinKim Jun 28, 2021
e5b73ef
improve plot styling
jethror1 Jun 28, 2021
a301f50
Merge branch 'development' into 1.2-mod-plots
jethror1 Jun 28, 2021
2251a4a
add check for empty bed file
jethror1 Jun 28, 2021
41931c1
fix assertion
jethror1 Jun 28, 2021
1ba536c
remove ws
jethror1 Jun 28, 2021
33ea0b0
remove ws
jethror1 Jun 28, 2021
e982e76
remove ws
jethror1 Jun 28, 2021
9036791
Merge pull request #31 from eastgenomics/1.2-summary-text
Aisha-D Jun 29, 2021
3552970
Merge branch 'development' into 1.2-fix-utf8
jethror1 Jun 29, 2021
0b00683
Merge pull request #32 from eastgenomics/1.2-fix-utf8
Aisha-D Jun 29, 2021
89d5446
Merge branch 'development' into 1.2-low-exon-plots
mattgarner Jun 29, 2021
ef5d9d7
Merge pull request #36 from eastgenomics/1.2-low-exon-plots
mattgarner Jun 29, 2021
4feca77
Merge branch 'development' into 1.2-mod-plots
mattgarner Jun 29, 2021
f95a5fb
Merge pull request #34 from eastgenomics/1.2-mod-plots
mattgarner Jun 29, 2021
ec5c2af
- fix y-axis values on low exon plots sometimes being too low
jethror1 Jun 29, 2021
b32ca65
comment & ws
jethror1 Jun 29, 2021
e64a10d
Merge pull request #37 from eastgenomics/1.2-low-exon-plots
mattgarner Jun 29, 2021
69b27f3
fix merge issues
jethror1 Jun 29, 2021
a2c7bbe
move copy to clipboard function to head
jethror1 Jun 29, 2021
8cf2416
change report text
jethror1 Jun 29, 2021
55a71fd
adjust template
jethror1 Jun 29, 2021
e6f83e3
add multiprocessing to generate low exon plots
jethror1 Jun 29, 2021
a1e5684
change example report in `/data/example`
jethror1 Jun 29, 2021
efb56a3
remove old arg
jethror1 Jun 29, 2021
37bdf29
remove ws
jethror1 Jun 29, 2021
9e8f0fa
changes to summary text:
jethror1 Jun 29, 2021
f378362
Merge branch '1.2-fix-merge-issues' of https://github.com/eastgenomic…
jethror1 Jun 29, 2021
3f68a92
improve low exon threshold column colouring for nicer colours
jethror1 Jun 29, 2021
793d61b
Merge pull request #38 from eastgenomics/1.2-fix-merge-issues
mattgarner Jun 30, 2021
ecb16a2
remove call to generate sub threshold plots
jethror1 Jun 30, 2021
04b60a3
Merge pull request #39 from eastgenomics/fix-plot-bug
Jay-Miles Jun 30, 2021
ad37902
add checks, output naming
jethror1 Jun 30, 2021
6569ce2
comment
jethror1 Jun 30, 2021
f549a28
another comment
jethror1 Jun 30, 2021
056eced
remove old annotate script
jethror1 Jun 30, 2021
39daade
update readme
jethror1 Jun 30, 2021
e69ccc2
add extra check for missing transcripts on intersecting
jethror1 Jun 30, 2021
17655d8
remove testing lines
jethror1 Jun 30, 2021
85a225a
add arg to readme
jethror1 Jun 30, 2021
559f867
readme typo
jethror1 Jun 30, 2021
8586fa6
fix typo
jethror1 Jun 30, 2021
dc85a00
readme update
jethror1 Jun 30, 2021
19d5011
switch to non reciprocal overlap, only require panel region covers tr…
jethror1 Jun 30, 2021
6600987
addition of `--chunk-size` arg to split per-base coverage up to not e…
jethror1 Jun 30, 2021
b8e4e7c
Merge pull request #40 from eastgenomics/1.2-annotate-bed
Aisha-D Jun 30, 2021
72de98b
changes to annotate_bed to match old output
jethror1 Jul 1, 2021
064790e
Merge pull request #41 from eastgenomics/1.2-fix_annotate_bed-output
Aisha-D Jul 1, 2021
6c09f19
fix float precision of mean for gene stats
jethror1 Jul 1, 2021
31f23ea
Merge pull request #42 from eastgenomics/fix-float-precision
Aisha-D Jul 1, 2021
29a7a27
fix missing hash on hex code
jethror1 Jul 1, 2021
042353f
simplify version parsing
jethror1 Jul 1, 2021
870c99d
add newline to make pep8 overlord bot happy
jethror1 Jul 1, 2021
ad5c47e
Merge pull request #43 from eastgenomics/1.2-fixes
Aisha-D Jul 1, 2021
dd335cd
fix for appropriately exiting from multi processing on error in gener…
jethror1 Jul 1, 2021
424e79f
added sorting for nicer output
jethror1 Jul 1, 2021
fc1542c
move dropping exon column in gene stats table to stats script
jethror1 Jul 1, 2021
5c40bb6
Merge pull request #45 from eastgenomics/fix-making-child-kill-parent…
Addy81 Jul 1, 2021
9a19503
switch variant tables to use datatables
jethror1 Jul 5, 2021
2bfe57f
add horizontal seperators
jethror1 Jul 5, 2021
57544ab
fix displaying variant vcfs in summary
jethror1 Jul 5, 2021
dab05b7
clean up template
jethror1 Jul 5, 2021
2ac5409
remove unneeded function
jethror1 Jul 5, 2021
3f69f29
fix replacing empty tables with text, remove old css
jethror1 Jul 5, 2021
8a6e509
remove whitespace from template
jethror1 Jul 5, 2021
00fe31a
add function to confitionally hide low exon plots when empty
jethror1 Jul 5, 2021
6919ec6
refactor template
jethror1 Jul 5, 2021
8d0b6fa
PR tidying up
jethror1 Jul 5, 2021
eb23f21
PR changes, JS tidy up, comments
jethror1 Jul 5, 2021
524bfd8
Merge pull request #46 from eastgenomics/1.2-vcf_datatables
Yu-jinKim Jul 5, 2021
228985f
fix rounding for sub-threshold table
jethror1 Jul 6, 2021
bebb5df
add comment for clarity
jethror1 Jul 6, 2021
4d667b6
switch mean column to float in sub-threshold table
jethror1 Jul 6, 2021
5cb31a3
better comment, add type hints
jethror1 Jul 6, 2021
5baf3cf
Merge pull request #48 from eastgenomics/1.2-fix-rounding
Aisha-D Jul 6, 2021
43f70e0
add file formatting to readme
jethror1 Jul 7, 2021
4cd2dd5
add link to page for generating transcript file to readme
jethror1 Jul 7, 2021
8b4b7a6
tidy readme
jethror1 Jul 7, 2021
bd84c05
readme typo
jethror1 Jul 7, 2021
5fa6597
readme update
jethror1 Jul 7, 2021
d283759
Merge pull request #50 from eastgenomics/1.2-update-readme
Addy81 Jul 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 27 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Athena is a tool to generate coverage statistics for NGS data, and combine these

Dependencies may be installed from the requirements.txt file using ```pip install -r requirements.txt```.
This should contains all the required python packages required to generate coverage statistics and reports.
For optional calculating of variant coverage from VCFs, [BEDtools][bedtools-url] is also required to be installed.
In addition, [BEDtools][bedtools-url] is also required to be installed and on path.

Tested on Ubuntu 18.04.4 and macOS 10.15.4

Expand All @@ -22,35 +22,52 @@ Tested on Ubuntu 18.04.4 and macOS 10.15.4
It is written to take in per base coverage data (as output from tools such as [mosdepth][mosdepth-url]) as input to calculate coverage for target regions defined in a bed file. <br></br>

The general workflow for generating the statistics and report is as follows: <br>
- Annotate each region of the bed file with the gene, exon and per base coverage data using `annotate_bed.sh`
- Annotate each region of the bed file with the gene, exon and per base coverage data using `annotate_bed.py`
- Generate per exon and per gene statistics using `coverage_stats_single.py`
- Generate HTML coverage report with `coverage_report_single.py`

For DNAnexus cloud platform users, an Athena [dx applet][dx-url] has also been built.


### Expected file formats

As a minimum, Athena requires 3 input files. These are a bed file for the gene panel, a file of transcript information and the output of your coverage tool (mosdepth, samtools etc.). These files MUST have the following columns:

- panel bed file: `chromosome start end transcript`
- transcript file: `chromosome start end gene transcript exon`
- coverage file: `chromosome start end coverage`

n.b. the process for creating the transcript file may be found [here][transcript-file-url].

### Annotating BED file
The BED file containing regions of interest is first required to be annotated with gene, exon and coverage information prior to analysis. This may be done using [BEDtools intersect][bedtools-intersect-url], with a file containing transcript to gene and exon information, and then the per base coverage data. <br>
The BED file containing regions of interest is first required to be annotated with gene, exon and coverage information prior to analysis. This may be done using [BEDtools intersect][bedtools-intersect-url], with a file containing transcript to gene and exon information, and then the per base coverage data. Currently, 100% overlap is required between coordinates in the panel bed file and the transcript annotation file, therefore you must ensure any added flank regions etc. are the same.<br>

Included is a Bash script (`annotate_bed.sh`) to perform the required BED file annotation.
Included is a Python script (`annotate_bed.py`) to perform the required BED file annotation.

Expected inputs:

```
-i : Input panel bed file; must have ONLY the following 4 columns chromosome, start position, end position, gene/transcript.
-g : Exons nirvana file, contains required gene and exon information.
-b : Per base coverage file (output from mosdepth or similar).
-p / --panel_bed : Input panel bed file; must have ONLY the following 4 columns chromosome, start position, end position, gene/transcript

-t / --transcript_file : Transcript annotation file, contains required gene and exon information. Must have ONLY the following 6 columns:
chromosome, start, end, gene, transcript, exon

-c / --coverage_file : Per base coverage file (output from mosdepth or similar)

-s / -chunk_size : (optional) nrows to split per-base coverage file for intersecting, with <16GB RAM can lead to bedtools intersect failing. Reccomended values: 16GB RAM -> 20000000; 8GB RAM -> 10000000

-n / --output_name : (optional) Prefix for naming output file, if not given will use name from per base coverage file

Example usage:

$ annotate_bed.sh -i panel_bed_file.bed -g exons_nirvana -b {input_file}.per_base.bed
$ annotate_bed.py -p panel_bed_file.bed -t transcript_file.tsv -c {input_file}.per_base.bed
```
<br>
This wraps the bedtools intersect commands below. These commands are given as an example, the output file column ordering must match that given in /data/example example_annotated_bed for calculating coverage statistics:
<br>

```
$ bedtools intersect -a beds/sorted_bed_file.bed -b beds/exons_nirvana2010_no_PAR_Y_noflank.bed -wa -wb | awk 'OFS="\t" {if ($4 == $9) print}' | cut -f 1,2,3,8,9,10 > sample1_genes_exons.bed
$ bedtools intersect -a beds/sorted_bed_file.bed -b beds/exons_nirvana2010_no_PAR_Y_noflank.bed -wa -wb -f 1.0 -r | awk 'OFS="\t" {if ($4 == $9) print}' | cut -f 1,2,3,8,9,10 > sample1_genes_exons.bed

- sorted_bed_file.bed -- bed file defining regions of interest (columns: chromosome, start, end, transcript)
- exons_nirvana2010_no_PAR_Y.bed -- a bed file containing transcript -> exon and gene information
Expand Down Expand Up @@ -108,13 +125,6 @@ $ python3 bin/coverage_report_single.py --gene_stats output/sample1-exon-coverag
```


### For development

Features to be developed:
- Generate run level statistics from multiple samples
- Generate run level report from multiple samples
- Add interactive elements to tables to increase useability (i.e sorting, filtering, searching)

Any bugs or suggestions for improvements please raise an issue.


Expand All @@ -130,3 +140,4 @@ Any bugs or suggestions for improvements please raise an issue.
[mosdepth-url]: https://github.com/brentp/mosdepth

[dx-url]: https://github.com/eastgenomics/eggd_athena
[transcript-file-url]: https://cuhbioinformatics.atlassian.net/wiki/spaces/P/pages/2241101840/Generating+transcripts+file+for+Athena
253 changes: 253 additions & 0 deletions bin/annotate_bed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
"""
Script to annotate a panel bed file with transcript information and per
base coverage data.

Requires: bedtools

Jethro Rainford
20/06/2021
"""

import argparse
import os
import pandas as pd
from pathlib import Path
import pybedtools as bedtools

from load_data import loadData


class annotateBed():

def add_transcript_info(self, panel_bed, transcript_info_df):
"""
Use pybedtools to annotate panel bed file with coverage data
Args:
- panel_bed (df): panel bed file regions df
- transcript_info_df (df): transcript info file df
Returns:
- bed_w_transcript (df): panel bed file with transcript information
"""
print("calling bedtools to add transcript info")

# get total number of transcripts before to ensure none are dropped
panel_transcripts = panel_bed.transcript.unique().tolist()

# turn dfs into BedTools objects
bed = bedtools.BedTool.from_dataframe(panel_bed)
transcript_info = bedtools.BedTool.from_dataframe(transcript_info_df)

# intersecting panel bed file with transcript/gene/exon information
# requires 100% overlap on panel -> transcript coordinates
bed_w_transcript = bed.intersect(
transcript_info, wa=True, wb=True, F=1.0
)

# convert pybedtools object to df
bed_w_transcript = bed_w_transcript.to_dataframe(names=[
"p_chrom", "p_start", "p_end", "p_transcript",
"t_chrom", "t_start", "t_end", "t_gene", "t_transcript", "t_exon"
])

# check for empty file
assert len(bed_w_transcript.index) > 0, """Empty file returned from
intersecting panel bed and transcript file. Check if flanks are
being used as 100% coordinate overlap is currently required."""

# panel bed file defines transcript to use, filter transcript file for
# just those transcripts
bed_w_transcript = bed_w_transcript[
bed_w_transcript["p_transcript"] == bed_w_transcript["t_transcript"]
]

# drop duplicate columns
bed_w_transcript = bed_w_transcript.drop(columns=[
't_chrom', 't_start', 't_end', 'p_transcript'
])

intersect_transcripts = bed_w_transcript.t_transcript.unique().tolist()

# ensure no transcripts dropped from panel due to missing from
# transcripts file
assert len(panel_transcripts) == len(intersect_transcripts), (
f"Transcript(s) dropped from panel during intersecting with "
f"transcript file. Total before {len(panel_transcripts)}. Total "
f"after {len(intersect_transcripts)}. Dropped transcripts: "
f"{set(panel_transcripts) - set(intersect_transcripts)}"
)

return bed_w_transcript


def add_coverage(self, bed_w_transcript, coverage_df, chunks=False):
"""
Use pybedtools to add coverage bin data to selected panel regions
Args:
- bed_w_transcript (df): panel bed file with transcript information
- coverage_df (df / list): coverage bin data df / list of dfs if
chunks value passed
Returns:
- bed_w_coverage (df): panel bed with transcript and coverage info
"""
print("calling bedtools to add coverage info")

# turn dfs into BedTools objects
bed_w_transcript = bedtools.BedTool.from_dataframe(bed_w_transcript)

col_names = [
"t_chrom", "t_start", "t_end", "t_gene", "t_transcript", "t_exon",
"c_chrom", "cov_start", "cov_end", "cov"
]

if not chunks:
# per-base coverage all in one df
coverage_df = bedtools.BedTool.from_dataframe(coverage_df)

bed_w_coverage = bed_w_transcript.intersect(
coverage_df, wa=True, wb=True
)
bed_w_coverage = bed_w_coverage.to_dataframe(names=col_names)
else:
# coverage data in chunks, loop over each df and intersect
bed_w_coverage = pd.DataFrame(columns=col_names)

for num, df in enumerate(coverage_df):
print(f"intersecting {num + 1}/{len(coverage_df)} coverage chunks")
# read each to bedtools object, intersect and add back to df
chunk_df = bedtools.BedTool.from_dataframe(df)

bed_w_coverage_chunk = bed_w_transcript.intersect(
chunk_df, wa=True, wb=True
)

bed_w_coverage_chunk = bed_w_coverage_chunk.to_dataframe(
names=col_names
)

bed_w_coverage = pd.concat(
[bed_w_coverage, bed_w_coverage_chunk],
ignore_index=True
)

# check again for empty output of bedtools, can happen due to memory
# maxing out and doesn't seem to raise an exception...
assert len(bed_w_coverage) > 0, """Error intersecting with coverage
data, empty file generated. Is this the correct coverage data for
the panel used? bedtools may also have reached memory limit and
died, try re-running with --chunk_size 1000000"""

# drop duplicate chromosome col and rename
bed_w_coverage.drop(columns=["c_chrom"], inplace=True)

bed_w_coverage.columns = [
"chrom", "exon_start", "exon_end", "gene", "tx", "exon",
"cov_start", "cov_end", "cov"
]

return bed_w_coverage


def write_file(bed_w_coverage, outfile):
"""
Write annotated bed to file
Args:
- bed_w_coverage (df): bed file with transcript and coverage info
- output_prefix (str): prefix for naming output file
Outputs: annotated_bed.tsv
"""
# tiny function but want this separate for writing a wrapper script later
bed_w_coverage.to_csv(outfile, sep="\t", header=False, index=False)
print(f"annotated bed file written to {outfile}")


def parse_args():
"""
Parse cmd line arguments

Args: None

Returns:
- args (arguments): args passed from cmd line
"""
parser = argparse.ArgumentParser(
description='Annotate panel bed file with transcript & coverage data.'
)
parser.add_argument(
'--panel_bed', '-p',
help='panel bed file'
)
parser.add_argument(
'--transcript_file', '-t',
help='file with gene and exon information'
)
parser.add_argument(
'--coverage_file', '-c',
help='per base coverage data file'
)
parser.add_argument(
'--chunk_size', '-s', type=int,
help='number lines to read per-base coverage file in one go'
)
parser.add_argument(
'--output_name', '-n',
help='name preifx for output file, if none will use coverage file'
)

args = parser.parse_args()

return args


def main():
annotate = annotateBed()
load = loadData() # class of functions for reading in data

args = parse_args()

if not args.output_name:
# output name not defined, use sample identifier from coverage file
args.output_name = Path(args.coverage_file).name.split('_')[0]

# set dir for writing to
bin_dir = os.path.dirname(os.path.abspath(__file__))
out_dir = os.path.join(bin_dir, "../output/")
outfile_name = f"{args.output_name}_annotated.bed"
outfile = os.path.join(out_dir, outfile_name)

# read in files
panel_bed_df = load.read_panel_bed(args.panel_bed)
transcript_info_df = load.read_transcript_info(args.transcript_file)
pb_coverage_df = load.read_coverage_data(
args.coverage_file, args.chunk_size
)

# add transcript info
bed_w_transcript = annotate.add_transcript_info(
panel_bed_df, transcript_info_df
)

# add coverage
if args.chunk_size:
# per-base coverage split to multiple dfs to limit memory usage
bed_w_coverage = annotate.add_coverage(
bed_w_transcript, pb_coverage_df, chunks=True
)
else:
bed_w_coverage = annotate.add_coverage(
bed_w_transcript, pb_coverage_df, chunks=False
)

# sense check generated file isn't empty, should be caught earlier
assert len(bed_w_coverage.index) > 0, (
'An error has occured: annotated bed file is empty. This is likely ',
'due to an error in regions defined in bed file (i.e. different ',
'transcripts to those in the transcripts file). Start debugging by ',
'intersecting files manually...'
)

write_file(bed_w_coverage, outfile)


if __name__ == "__main__":

main()
Loading