-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathREADME
188 lines (138 loc) · 8.7 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
#####################################
# #
# HugeSeq #
# The Variant Detection Pipeline #
# #
#####################################
-- DEPENDENCIES
+ STANOVAR version 0.1
+ BEDtools version 2.17.0
+ BreakDancer version 1.1.2
+ BreakSeq Lite version 1.0
+ BWA version 0.7.4
+ CNVnator version 0.2.7
+ GATK version 3.2.2
+ JDK version 1.7.0_03
+ Modules Release 3.2.8
+ Perl
+ Picard Tools version 1.32
+ Pindel version 0.2.4t
+ Python version 2.7
+ Simple Job Manager version 1.0
+ Tabix version 0.2.6
+ vcftools version 0.1.12
+ zlib version 1.2.7
+ root version 5.34.30
+ r version 3.2.0
-- INSTALLATION
HugeSeq is a modular, computational pipeline that runs in a Unix environment in a highly parallel fashion. It was tested on Red Hat Enterprise Linux (RHEL) server v5.6 but it should work in most Linux servers. The batch system it currently supports out-of-the-box is Sun Grid Engine.
Batch System
Many of the clusters are already installed with Sun Grid Engine (SGE). For installing SGE, please refer to the vendor's manual.
Running the analysis pipeline requires submitting many interdependent jobs to the batch scheduling system (e.g. Sun Grid Engine). Therefore, we developed a software program called SJM (Simple Job Manager) to simplify this process, including properly specifying the dependencies, tracking progress of the group of jobs, and responding properly if a job fails.
For batch systems other than SGE, it requires developing an adaptor in SJM. Please write to us for more details.
Modules Environment
To manage different versions of softwares and parameters in the modules, HugeSeq uses a Unix software package called Environment Modules, which provides for the dynamic modification of a user's environment via modulefiles.
To initiate Modules, modify your login profile such as .bash_profile to add the following:
. /path-to-Modules/default/init/sh
Supporting Tools
Install the required softwares, such as the aligners, variant callers and manipulation tools, defined in the software requirements section. For details, please refer to the individual software websites. The softwares are recommended to be installed separately under a single parent directory, such as ~/apps/BreakSeq and ~/apps/CNVnator.
Data Sets
HugeSeq depends on several public data sets for alignment, variant calling, and annotations. They are:
The reference genome (e.g. HG19 in FASTA format: hg19.fa)
The BWA index of the reference genome (e.g. hg19.fa.bwt, hg19.fa.ann, etc)
A .dict dictionary of the contig names and sizes (e.g. hg19.fa.dict)
A .fai fasta index file (e.g. hg19.fa.fai)
For creating .dict and .fai, please see here. All the indexes and dictionary should reside in the reference genome directory which contains the whole genome FASTA (e.g. hg19.fa)
The breakpoint junctions (i.e. BreakSeq library in FASTA format: bplib.fa)
The SNP annotation
UCSC Known Genes (knownGene)
dbSNP
SIFT (avsift)
RepeatMasker (buildver_rmsk.gff)
The STANOVAR application should be installed and corresponding module need to be defined.
Download HugeSeq to your server.
Extract the programs from the compressed archive to a directory, such as ~/app. A directory like ~/app/HugeSeq will then be created, which contains the core program and its configuration. As described above, HugeSeq uses the Environment Modules package for configuration. Its modulefile is in the directory /path-to-HugeSeq/modulefiles/hugeseq named with its version, such as 1.0. To enable Modules to look up the modulefile for correct setting, modify the login profile as above and add the following:
export MODULEPATH=/path-to-HugeSeq/modulefiles:$MODULEPATH
In addition, modify the module file, such as /path-to-HugeSeq/modulefiles/hugeseq/2.0, and change all the programs' paths to the locations where you installed the required programs and the data paths to where you stored the datasets.
Logout and login again to your shell to activate the login profile with the latest configuration. You should now be able to run HugeSeq by loading its module:
> module load hugeseq/2.0
After loading the module, you can run HugeSeq simply by typing:
> hugeseq
For the usage of HugeSeq, please refer to the Usage section.
-- USAGE
usage: hugeseq [-h] --reads1 FILE [FILE ...] [--reads2 FILE [FILE ...]]
--output DIR [--account STR] [--tmp DIR] [--readgroup STR]
[--samplename STR] [--bam] [--variants TYPE [TYPE ...]]
[--targeted] [--capture FILE [FILE ...]] [--relax_realignment]
[--reference_calls] [--snp_hapcaller] [--indel_hapcaller]
[--nosnpvqsr] [--noindelvqsr] [--vqsrchrom] [--nobinning]
[--nocleanup] [--novariant] [--alignmentonly] [--cleanuponly]
[--variantonly] [--donealign] [--donebinning] [--donecleanup]
[--donegenotyping] [--donesnpvqsr] [--memory SIZE]
[--queue NAME] [--email NAME] [--threads COUNT]
[--jobfile FILE] [--submit]
Generating the job file for the HugeSeq variant detection pipeline
optional arguments:
-h, --help show this help message and exit
--reads1 FILE [FILE ...]
The FASTQ file(s) for reads 1
--reads2 FILE [FILE ...]
The FASTQ file(s) for reads 2, if paired-end
--output DIR The output directory
--account STR Accounting string for the purpose of cluster
accounting.
--tmp DIR The TMP directory for storing intermediate files
(default=output directory
--readgroup STR The read group annotation (Default:
@RG\tID:Default\tLB:Library\tPL:Illumina\tSM:SAMPLE)
--samplename STR The SM tag in the read group annotation (Default:
"SAMPLE" in
@RG\tID:Default\tLB:Library\tPL:Illumina\tSM:SAMPLE)
--bam Support for aligned BAMs as input. By default input
(-r) is aligned again. Use --variantonly otherwise.
--variants TYPE [TYPE ...]
gatk breakdancer cnvnator pindel breakseq (default to
all)
--targeted Use GATK in targeted sequencing mode (default: whole-
genome mode)
--capture FILE [FILE ...]
Capture BED file(s) used for targeted genotyping
(default: void, separate multipe files with commas:
capture1.bed,capture2.bed,...)
--relax_realignment Relaxes GATKs realignment when dealing with badly
scored reads (default: false)
--reference_calls Store all reference calls from GATK (default: false)
in gVCF format in addition to a standard VCF file
containing only the variants (valid only for SNV
calling)
--snp_hapcaller Use GATK HaplotypeCaller to discover SNPs (default:
UnifiredGenotyper)
--indel_hapcaller Use GATK HaplotypeCaller to discover Indels (default:
UnifiredGenotyper)
--nosnpvqsr Do not perform VQSR SNPs (variant quality score
recalibration)
--noindelvqsr Do not perform VQSR on Indels (variant quality score
recalibration)
--vqsrchrom Perform VQSR on individual chromosomes (valid when
binning performed; default: VQSR on whole genome VCF)
--nobinning Do not bin the alignments by chromosomes
--nocleanup Do not clean up the alignments
--novariant Do not call variants
--alignmentonly Only align input FASTQ or BAM files (-r)
--cleanuponly Only clean up input BAM files (-r)
--variantonly Only call variants in input BAM files (-r)
--donealign Sequences already aligned using the pipeline
--donebinning Alignments already binned by chromosomes using the
pipeline
--donecleanup Alignments already cleaned using the pipeline
--donegenotyping Variants already called using the pipeline but VQSR is
not
--donesnpvqsr Processing is started after SNP VQSR (from Indel VQSR)
--memory SIZE Memory size (GB) per job (default: 12)
--queue NAME Queue for jobs (default: extended)
--email NAME Email address to receive emails for ending or aborting
last jobs in the queque
--threads COUNT Number of threads for alignment, only works for SGE
(default: 4)
--jobfile FILE The jobfile name (default: stdout)
--submit Submit the jobs