-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathmanual.html
311 lines (252 loc) · 14.4 KB
/
manual.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
<html>
<head>
<title>IgSimulator 1.0 Manual</title>
<style type="text/css">
.code {
background-color: lightgray;
}
</style>
</head>
<style>
</style>
<body>
<h1>IgSimulator 1.0 manual</h1>
1. <a href = "#intro">What are IgSimulator?</a></br>
2. <a href = "#install">Installation</a></br>
2.1. <a href = "#test_datasets">Verifying your installation</a></br>
3. <a href = "#simulator">IgSimulator</a></br>
3.1. <a href = "#simulator_basic">Basic options</a></br>
3.2. <a href = "#simulator_genes">Ig genes options</a></br>
3.3. <a href = "#simulator_advanced">Advanced options</a></br>
3.4. <a href = "#simulator_examples">Examples</a></br>
3.5. <a href = "#simulator_output">Output files</a></br>
4. <a href = "#repertoire_files">Antibody repertoire representation<a></br>
4.1. <a href = "#clusters_fasta">CLUSTERS.FASTA file format</a></br>
4.2. <a href = "#read_cluster_map">RCM file format</a></br>
5. <a href = "#feedback">Feedback and bug reports</a></br>
<!- ----------------- ->
<a id = "intro"></a>
<h2>1. What is IgSimulator?</h2>
<code>IgSimulator</code> is a tool for simulation of antibody repertoire and Ig-Seq library.
<code>IgSimulator</code> is designed for testing and benchmarking tools for reconstruction of Ig repertoires.
</br>
<!- ---------------------------------------------------------------- ->
<a id = "install"></a>
<h2>2. Installation</h2>
IgSimulator requires the following pre-installed dependencies:
<ul>
<li>64-bit Linux system</li>
<li>g++ (version 4.7 or higher)</li>
<li>Python (version 2.7 or higher)</li>
<li>Additional Python modules</li>
<ul>
<li>BioPython (<a href = "http://biopython.org/wiki/Download">download link</a>)</li>
<li>NumPy and SciPy (including PyLab) (<a href = "http://www.scipy.org/scipylib/download.html">download link</a>)</li>
<li>Matplotlib (<a href = "http://matplotlib.org/downloads.html">download link</a>)</li>
</ul>
</ul>
To install <code>IgSimulator</code>, type:
<pre class = "code">
<code>
make
</code>
</pre>
<a id = "test_datasets"></a>
<h3>2.1. Verifying your installation</h3>
For testing purposes, IgSimulator comes with a toy data set. </br></br>
► To try <code>IgSimulator</code> on test data set, run:
<pre class="code">
<code>
./ig_simulator.py --test
</code>
</pre>
If the installation is successful, you will find the following information at the end of the log:
<pre class="code">
<code>
======== IgSimulator ends
Main output files:
* Sequences of simulated repertoire were written to <igtools_installation_directory>/ig_simulator_test/repertoire.fasta
* Simulated merged reads were written to <igtools_installation_directory>/ig_simulator_test/merged_reads.fastq
* CLUSTERS.FA for simulated repertoire were written to <igtools_installation_directory>/ig_simulator_test/ideal_repertoire.clusters.fa
* RCM for simulated repertoire were written to <igtools_installation_directory>/ig_simulator_test/ideal_repertoire.rcm
Thank you for using IgSimulator!
Log was written to <igtools_installation_directory>/ig_simulator_test/ig_repertoire_simulation.log
</code>
</pre>
<a id = "simulator"></a>
<h2>3. IgSimulator</h2>
<code>IgSimulator</code> tool takes parameters of the simulation as an input and constructs reference heavy chain repertoire, corresponding Illumina library and ideal repertoire.</br></br>
Command line:
<pre class="code">
<code>
./ig_simulator.py [options] --chain-type TYPE --num-bases N1 --num-mutated N2 --repertoire-size N3 -o <output_dir>
</code>
</pre>
<!- --------------------- ->
<a id = "simulator_basic"></a>
<h3>3.1. Basic options:</h3>
<code>-o <output_dir></code></br>
output directory (required). </br></br>
<code>--num-bases <int></code></br>
number of base sequences (required).</br></br>
<code>--num-mutated <int></code></br>
expected number of mutated sequences (required).</br></br>
<code>--repertoire-size <int></code></br>
expected reference repertoire size (required).</br></br>
<code>--chain-type HC or LC</code></br>
type of chain that can be used for repertoire simulation. Default value is 'HC'.</br></br>
<code>--test</code></br>
runs toy test data set (see Section <a href = "#test_datasets">3.4</a>). Command line corresponding to the test run is equivalent to the following line:
<pre class = "code">
<code>
./ig_simulator.py --num-bases 10 --num-mutated 50 --repertoire-size 1000 -o ig_repertoire_simulator_test
</code>
</pre>
<!- --------------------- ->
<a id = "simulator_genes"></a>
<h3>3.2. Ig genes options:</h3>
<code>--vgenes <filename></code></br>
FASTA file with Ig germline V genes. Default value is <code><igtools_installation_directory>/data/human_ig_germline_genes/human_IGHV.fa</code> for heavy chain repertoire and <code><igtools_installation_directory>/data/human_ig_germline_genes/human_IGKV.fa</code> for light chain repertoire.</br></br>
<code>--dgenes <filename></code></br>
FASTA file with Ig germline D genes. Default value is <code><igtools_installation_directory>/data/human_ig_germline_genes/human_IGHD.fa</code> for heavy chain repertoire.</br></br>
<code>--jgenes <filename></code></br>
FASTA file with Ig germline J genes. Default value is <code><igtools_installation_directory>/data/human_ig_germline_genes/human_IGHJ.fa</code> for heavy chain repertoire and <code><igtools_installation_directory>/data/human_ig_germline_genes/human_IGKJ.fa</code> for light chain repertoire.</br></br>
<code>--db-type imgt or reg</code><br>
Type of dababase. By default, this parameter has 'imgt' value and means that headers of FASTA files with V, D, and J genes are consistent with IMGT format, for example:
<pre class = "code">
<code>
>M99641|IGHV1-18*01|Homo sapiens|F|V-REGION|188..483|296 nt|1| | | | |296+0=296| | |
CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCC
CTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACAT
GGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA
</code>
</pre>
In this case, gene segment name specified after the first '|' symbol (IGHV1-18*01) will be used in output files containing V(D)J recombination (see <a href = "#simulator_output">Output files</a> for more details).<br>
If your database is not in IMGT format, please specify 'reg' value for this option. In this case, entire sequences specified in headers will be used as gene segment names.
<!- ---------------------- ->
<a id = "simulator_advanced"></a>
<h3>3.3. Advanced options:</h3>
<code>--skip-drawing</code></br>
skips visualization of statistics for merged reads. Default value is <code>false</code>.</br>
</br>
<code>--help</code></br>
prints help.</br>
<!- --------------------- ->
<a id = "simulator_examples"></a>
<h3>3.4. Examples</h3>
To simulate heavy chain data set with 100 base sequences, ~500 mutated sequences and ~1500 sequences in the final repertoire size and, correspondingly, simulated Illumina library, run the following command:
<pre class = "code">
<code>
./ig_simulator.py --chain-type HC --num-bases 100 --num-mutated 500 --repertoire-size 1500 -o ig_simulator_test
</code>
</pre>
If you want to additionally specify paths to V/D/J germline genes instead of using default IMGT database:
<pre class = "code">
<code>
./ig_simulator.py --chain-type HC --num-bases 100 --num-mutated 500 --repertoire-size 1500 \\
--VH <path_to_your_vgenes.fasta> --VD <path_to_your_dgenes.fasta> --JH <path_to_your_jgenes.fasta> -o ig_simulator_test
</code>
</pre>
<!- --------------------- ->
<a id = "simulator_output"></a>
<h3>3.5. Output files</h3>
<code>IgSimulator</code> tool creates working directory (which name was specified using option <code>-o</code>) and writes there the following files:
<ul>
<li>Files with sequences</li>
<ul>
<li><b>final_repertoire.fasta</b> - FASTA file with simulated antibody repertoire that will be used as reference for Illumina library simulation.</li>
<li><b>paired_reads1.fq</b> - FASTQ file with left reads constructed using ART read simulator. Reads correspond to simulated Illumina MiSeq library.</li>
<li><b>paired_reads2.fq</b> - FASTQ file with right reads constructed using ART read simulator. Reads correspond to simulated Illumina MiSeq library.</li>
<li><b>merged_reads.fastq</b> - FASTQ file consructed as result of merging left and right files with reads. This file is expected to be input for <code>IgRepertoireConstruction</code> tool.</li>
<li><b>reads_vdj_recombination.txt</b> contains information about V(D)J recombination for each read from <b>merged_reads.fastq</b> file. Example of <b>reads_vdj_recombination.txt</b> file is given below:</li>
<pre class = "code">
<code>
34_merged_read_antibody_20_multiplicity_1_copy_1-1/1 IGHV3-48*02;IGHD3-16*02;IGHJ6*02
53_merged_read_antibody_34_multiplicity_1_copy_1-1/1 IGHV4-28*02;IGHD3-10*01;IGHJ6*03
59_merged_read_antibody_37_multiplicity_2_copy_2-1/1 IGHV4-28*02;IGHD3-10*01;IGHJ6*03
8_merged_read_antibody_9_multiplicity_4_copy_1-1/1 IGHV3-66*01;IGHD3-22*01;IGHJ6*03
</code>
</pre>
</ul></br>
<li>Files with statistics of the simulated repertoire:</li>
<ul>
<li><b>base_sequences.fasta</b> contains sequences of base repertoire.</li>
<li><b>base_frequencies.txt</b> contains frequencies of base sequences.</li>
<li><b>mutated_sequences.fasta</b> contains sequences of mutated repertoire.</li>
<li><b>mutated_frequencies.txt</b> contains frequencies of mutated sequences.</li>
<li><b>shm_positions.txt</b> contains information about all introduced somatic hypermutations.
Each line corresponds to one mutation and of this file includes two field (separated by 'tab'): 'mutation position' and 'sequence length'.</li>
<li><b>repertoire_vdj_recombination.txt</b> contains information about V(D)J recombination for each constructed antibody. Example of <b>repertoire_vdj_recombination.txt</b> file is given below:</li>
<pre class = "code">
<code>
antibody_1 IGHV3-13*01;IGHD3-3*02;IGHJ4*02
antibody_2 IGHV4-30-4*02;IGHD4-17*01;IGHJ4*03
antibody_3 IGHV4-30-4*02;IGHD4-17*01;IGHJ4*03
antibody_4 IGHV3-13*02;IGHD3-10*01;IGHJ5*01
</code>
</pre>
</ul></br>
<li>Visialization of the statistics for the simulated repertoire</li>
<ul>
<li><b>base_seq_lens.png</b> - PNG file with histogram of base sequences lengths distribution.
If number of base sequences (controlled by option <code>--num-bases</code>) is enough large, distribution is expected to be normal.
This file is created based on statistics from <b>base_repertoire.stats</b>.</li>
<li><b>base_seq_freqs.png</b> - PNG file with histogram of base sequence frequencies distribuition.
This file is created based on statistics from <b>base_multiplicities.txt</b>.</li>
<li><b>mutated_seq_freqs.png</b> - PNG file with histogram of mutated sequence frequencies distribuition in final repertoire.
This file is created based on statistics from <b>mutated_multiplicities.txt</b>.</li>
<li><b>shm_positions.png</b> - PNG file with histogram of distribution of somatic hypermutations relative positions.
This file is created based on statistics from <b>shm_positions.txt</b>.</li>
<li><b>paired_reads1.aln</b> and <b>paired_reads2.aln</b> show alignment of paired-end reads to reference repertoire.
Files are generated by ART read simulator.</li>
</ul></br>
<li>Files described ideal repertoire (see details in section <a href = "#repertoire_files">4</a>):</li>
<ul>
<li><b>ideal_repertoire.clusters.fasta</b> - CLUSTERS.FASTA file corresponding ideal clusters for <b>merged_reads.fastq</b>.</li>
<li><b>ideal_repertoire.rcm</b> - RCM file corresponding ideal clusters for <b>merged_reads.fastq</b>. This file can be used as ideal read-cluster map in <code>IgQUAST</code> tool.</li>
</ul></br>
<li><b>ig_simulator.log</b> - full log of <code>IgSimulator</code> run.</li>
</ul>
</br>
<!- ---------------------------------------------------------------- ->
<a id = "repertoire_files"></a>
<h2>4. Antibody repertoire representation</h2>
We used two formats of files for representation of repertoire for the set of reads: CLUSTERS.FASTA and RCM.
<a id = "clusters_fasta"></a>
<h3>4.1. CLUSTERS.FASTA file format</h3>
CLUSTERS.FASTA is a FASTA file, where each sequence corresponds to the monoclonal antibody and header of sequence contains information about corresponding cluster (set of input reads related to the same monoclonal antibody) id and size:
<pre class = "code">
<code>
>cluster___1___size___3
CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGACG
>cluster___2___size___2
CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGG
>cluster___3___size___1
CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGAC
</code>
</pre>
<a id = "read_cluster_map"></a>
<h3>4.2. RCM file format</h3>
Every line of RCM (read-cluster map) file contains information about read name and corresponding cluster id:
<pre class = "code">
<code>
MISEQ@:53:000000000-A2BMW:1:2114:14345:28882 1
MISEQ@:53:000000000-A2BMW:1:2114:14374:28884 1
MISEQ@:53:000000000-A2BMW:1:2114:14393:28886 1
MISEQ@:53:000000000-A2BMW:1:2114:16454:28882 2
MISEQ@:53:000000000-A2BMW:1:2114:16426:28886 2
MISEQ@:53:000000000-A2BMW:1:2114:15812:28886 3
</code>
</pre>
</br>
<b>NOTE:</b> ids in CLUSTERS.FASTA and RCM files should be consistent.</br></br>
<!- -------------------------------------------------------------------- ->
<a id = "feedback"></a>
<h2>5. Feedback and bug reports</h2>
Your comments, bug reports, and suggestions are very welcomed.
They will help us to further improve IgSimulator.
<br><br>
If you have any troubles running IgSimulator, please send us log file from output output directory.
<br><br>
Address for communications: <a href="mailto:[email protected]">[email protected]</a>.
</body>