manual.html

<html>
<head>
    <title>IgSimulator 1.0 Manual</title>
    <style type="text/css">
        .code {
            background-color: lightgray;
        }
    </style>
</head>

  <style>

  </style>

<body>
<h1>IgSimulator 1.0 manual</h1>

1. <a href = "#intro">What are IgSimulator?</a></br>

2. <a href = "#install">Installation</a></br>
&nbsp;&nbsp;&nbsp;&nbsp;2.1. <a href = "#test_datasets">Verifying your installation</a></br>

3. <a href = "#simulator">IgSimulator</a></br>
&nbsp;&nbsp;&nbsp;&nbsp;3.1. <a href = "#simulator_basic">Basic options</a></br>
&nbsp;&nbsp;&nbsp;&nbsp;3.2. <a href = "#simulator_genes">Ig genes options</a></br>
&nbsp;&nbsp;&nbsp;&nbsp;3.3. <a href = "#simulator_advanced">Advanced options</a></br>
&nbsp;&nbsp;&nbsp;&nbsp;3.4. <a href = "#simulator_examples">Examples</a></br>
&nbsp;&nbsp;&nbsp;&nbsp;3.5. <a href = "#simulator_output">Output files</a></br>

4.  <a href = "#repertoire_files">Antibody repertoire representation<a></br>
&nbsp;&nbsp;&nbsp;&nbsp;4.1. <a href = "#clusters_fasta">CLUSTERS.FASTA file format</a></br>
&nbsp;&nbsp;&nbsp;&nbsp;4.2. <a href = "#read_cluster_map">RCM file format</a></br>

5. <a href = "#feedback">Feedback and bug reports</a></br>

<!- ----------------- ->

<a id = "intro"></a>
<h2>1. What is IgSimulator?</h2>
<code>IgSimulator</code> is a tool for simulation of antibody repertoire and Ig-Seq library.
<code>IgSimulator</code> is designed for testing and benchmarking tools for reconstruction of Ig repertoires. 
</br>

<!- ---------------------------------------------------------------- ->

<a id = "install"></a>
<h2>2. Installation</h2>

IgSimulator requires the following pre-installed dependencies:
<ul>
	<li>64-bit Linux system</li>
    <li>g++ (version 4.7 or higher)</li>
	<li>Python (version 2.7 or higher)</li>
	<li>Additional Python modules</li>
    <ul>
        <li>BioPython (<a href = "http://biopython.org/wiki/Download">download link</a>)</li>
        <li>NumPy and SciPy (including PyLab) (<a href = "http://www.scipy.org/scipylib/download.html">download link</a>)</li>
        <li>Matplotlib (<a href = "http://matplotlib.org/downloads.html">download link</a>)</li>
    </ul>
</ul>

To install <code>IgSimulator</code>, type:
<pre class = "code">
    <code>
    make
    </code>
</pre> 

<a id = "test_datasets"></a>
<h3>2.1. Verifying your installation</h3>
For testing purposes, IgSimulator comes with a toy data set. </br></br>

&#9658; To try <code>IgSimulator</code> on test data set, run:
<pre class="code">
<code>
    ./ig_simulator.py --test
</code>
</pre>

If the installation is successful, you will find the following information at the end of the log:

<pre class="code">
    <code>
    ======== IgSimulator ends

    Main output files:
    * Sequences of simulated repertoire were written to &lt;igtools_installation_directory>/ig_simulator_test/repertoire.fasta
    * Simulated merged reads were written to &lt;igtools_installation_directory>/ig_simulator_test/merged_reads.fastq
    * CLUSTERS.FA for simulated repertoire were written to &lt;igtools_installation_directory>/ig_simulator_test/ideal_repertoire.clusters.fa
    * RCM for simulated repertoire were written to &lt;igtools_installation_directory>/ig_simulator_test/ideal_repertoire.rcm

    Thank you for using IgSimulator!

    Log was written to &lt;igtools_installation_directory>/ig_simulator_test/ig_repertoire_simulation.log

    </code>
</pre>

<a id = "simulator"></a>
<h2>3. IgSimulator</h2>
<code>IgSimulator</code> tool takes parameters of the simulation as an input and constructs reference heavy chain repertoire, corresponding Illumina library and ideal repertoire.</br></br>

Command line:
<pre class="code">
<code>
    ./ig_simulator.py [options] --chain-type TYPE --num-bases N1 --num-mutated N2 --repertoire-size N3 -o &lt;output_dir>
</code>
</pre>

<!- --------------------- ->

<a id = "simulator_basic"></a>
<h3>3.1. Basic options:</h3>
<code>-o &lt;output_dir></code></br>
output directory (required). </br></br>

<code>--num-bases &lt;int></code></br>
number of base sequences (required).</br></br>

<code>--num-mutated &lt;int></code></br>
expected number of mutated sequences (required).</br></br>

<code>--repertoire-size &lt;int></code></br>
expected reference repertoire size (required).</br></br>

<code>--chain-type HC or LC</code></br>
type of chain that can be used for repertoire simulation. Default value is 'HC'.</br></br>

<code>--test</code></br>
runs toy test data set (see Section <a href = "#test_datasets">3.4</a>). Command line corresponding to the test run is equivalent to the following line:
<pre class = "code">
    <code>
    ./ig_simulator.py --num-bases 10 --num-mutated 50 --repertoire-size 1000 -o ig_repertoire_simulator_test 
    </code>
</pre>

<!- --------------------- ->

<a id = "simulator_genes"></a>
<h3>3.2. Ig genes options:</h3>
<code>--vgenes &lt;filename></code></br>
FASTA file with Ig germline V genes. Default value is <code>&lt;igtools_installation_directory>/data/human_ig_germline_genes/human_IGHV.fa</code> for heavy chain repertoire and <code>&lt;igtools_installation_directory>/data/human_ig_germline_genes/human_IGKV.fa</code> for light chain repertoire.</br></br>

<code>--dgenes &lt;filename></code></br>
FASTA file with Ig germline D genes. Default value is <code>&lt;igtools_installation_directory>/data/human_ig_germline_genes/human_IGHD.fa</code> for heavy chain repertoire.</br></br>

<code>--jgenes &lt;filename></code></br>
FASTA file with Ig germline J genes. Default value is <code>&lt;igtools_installation_directory>/data/human_ig_germline_genes/human_IGHJ.fa</code> for heavy chain repertoire and <code>&lt;igtools_installation_directory>/data/human_ig_germline_genes/human_IGKJ.fa</code> for light chain repertoire.</br></br>

<code>--db-type imgt or reg</code><br>
Type of dababase. By default, this parameter has 'imgt' value and means that headers of FASTA files with V, D, and J genes are consistent with IMGT format, for example:
<pre class = "code">
    <code>
    >M99641|IGHV1-18*01|Homo sapiens|F|V-REGION|188..483|296 nt|1| | | | |296+0=296| | |
    CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCC
    CTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACAT
    GGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA
    </code>
</pre> 
In this case, gene segment name specified after the first '|' symbol (IGHV1-18*01) will be used in output files containing V(D)J recombination (see <a href = "#simulator_output">Output files</a> for more details).<br>
If your database is not in IMGT format, please specify 'reg' value for this option. In this case, entire sequences specified in headers will be used as gene segment names.   
<!- ---------------------- ->

<a id = "simulator_advanced"></a>
<h3>3.3. Advanced options:</h3>

<code>--skip-drawing</code></br>
skips visualization of statistics for merged reads. Default value is <code>false</code>.</br>
</br> 

<code>--help</code></br>
prints help.</br>

<!- --------------------- ->

<a id = "simulator_examples"></a>
<h3>3.4. Examples</h3>
To simulate heavy chain data set with 100 base sequences, ~500 mutated sequences and ~1500 sequences in the final repertoire size and, correspondingly, simulated Illumina library, run the following command: 
<pre class = "code">
    <code>
    ./ig_simulator.py  --chain-type HC --num-bases 100 --num-mutated 500 --repertoire-size 1500 -o ig_simulator_test 
    </code>
</pre>

If you want to additionally specify paths to V/D/J germline genes instead of using default IMGT database:
<pre class = "code">
    <code>
    ./ig_simulator.py --chain-type HC --num-bases 100 --num-mutated 500 --repertoire-size 1500 \\
        --VH &lt;path_to_your_vgenes.fasta> --VD &lt;path_to_your_dgenes.fasta> --JH &lt;path_to_your_jgenes.fasta> -o ig_simulator_test 
    </code>
</pre>

<!- --------------------- ->

<a id = "simulator_output"></a>
<h3>3.5. Output files</h3>
<code>IgSimulator</code> tool creates working directory (which name was specified using option <code>-o</code>) and writes there the following files:
<ul>
    <li>Files with sequences</li>
    <ul>
        <li><b>final_repertoire.fasta</b> - FASTA file with simulated antibody repertoire that will be used as reference for Illumina library simulation.</li>
        <li><b>paired_reads1.fq</b> - FASTQ file with left reads constructed using ART read simulator. Reads correspond to simulated Illumina MiSeq library.</li>  
        <li><b>paired_reads2.fq</b> - FASTQ file with right reads constructed using ART read simulator. Reads correspond to simulated Illumina MiSeq library.</li>  
        <li><b>merged_reads.fastq</b> - FASTQ file consructed as result of merging left and right files with reads. This file is expected to be input for <code>IgRepertoireConstruction</code> tool.</li>  
        <li><b>reads_vdj_recombination.txt</b> contains information about V(D)J recombination for each read from <b>merged_reads.fastq</b> file. Example of <b>reads_vdj_recombination.txt</b> file is given below:</li>
        <pre class = "code">
            <code>
    34_merged_read_antibody_20_multiplicity_1_copy_1-1/1    IGHV3-48*02;IGHD3-16*02;IGHJ6*02
    53_merged_read_antibody_34_multiplicity_1_copy_1-1/1    IGHV4-28*02;IGHD3-10*01;IGHJ6*03
    59_merged_read_antibody_37_multiplicity_2_copy_2-1/1    IGHV4-28*02;IGHD3-10*01;IGHJ6*03
    8_merged_read_antibody_9_multiplicity_4_copy_1-1/1      IGHV3-66*01;IGHD3-22*01;IGHJ6*03
            </code>
        </pre>
    </ul></br>

    <li>Files with statistics of the simulated repertoire:</li>
    <ul>
        <li><b>base_sequences.fasta</b> contains sequences of base repertoire.</li>
        <li><b>base_frequencies.txt</b> contains frequencies of base sequences.</li>
        <li><b>mutated_sequences.fasta</b> contains sequences of mutated repertoire.</li>
        <li><b>mutated_frequencies.txt</b> contains frequencies of mutated sequences.</li>
        <li><b>shm_positions.txt</b> contains information about all introduced somatic hypermutations. 
        Each line corresponds to one mutation and of this file includes two field (separated by 'tab'): 'mutation position' and 'sequence length'.</li>
        <li><b>repertoire_vdj_recombination.txt</b> contains information about V(D)J recombination for each constructed antibody. Example of <b>repertoire_vdj_recombination.txt</b> file is given below:</li>
        <pre class = "code">
            <code>
    antibody_1     IGHV3-13*01;IGHD3-3*02;IGHJ4*02
    antibody_2     IGHV4-30-4*02;IGHD4-17*01;IGHJ4*03
    antibody_3     IGHV4-30-4*02;IGHD4-17*01;IGHJ4*03
    antibody_4     IGHV3-13*02;IGHD3-10*01;IGHJ5*01
            </code>
        </pre>
    </ul></br>
    
    <li>Visialization of the statistics for the simulated repertoire</li>
    <ul> 
        <li><b>base_seq_lens.png</b> - PNG file with histogram of base sequences lengths distribution. 
        If number of base sequences (controlled by option <code>--num-bases</code>) is enough large, distribution is expected to be normal. 
        This file is created based on statistics from <b>base_repertoire.stats</b>.</li>

        <li><b>base_seq_freqs.png</b> - PNG file with histogram of base sequence frequencies distribuition. 
        This file is created based on statistics from <b>base_multiplicities.txt</b>.</li>

        <li><b>mutated_seq_freqs.png</b> - PNG file with histogram of mutated sequence frequencies distribuition in final repertoire. 
        This file is created based on statistics from <b>mutated_multiplicities.txt</b>.</li>

        <li><b>shm_positions.png</b> - PNG file with histogram of distribution of somatic hypermutations relative positions. 
        This file is created based on statistics from <b>shm_positions.txt</b>.</li>

        <li><b>paired_reads1.aln</b> and <b>paired_reads2.aln</b> show alignment of paired-end reads to reference repertoire.
        Files are generated by ART read simulator.</li>
    </ul></br>

    <li>Files described ideal repertoire (see details in section <a href = "#repertoire_files">4</a>):</li>
    <ul>
        <li><b>ideal_repertoire.clusters.fasta</b> - CLUSTERS.FASTA file corresponding ideal clusters for <b>merged_reads.fastq</b>.</li>

        <li><b>ideal_repertoire.rcm</b> - RCM file corresponding ideal clusters for <b>merged_reads.fastq</b>. This file can be used as ideal read-cluster map in <code>IgQUAST</code> tool.</li>
    </ul></br>

    <li><b>ig_simulator.log</b> - full log of <code>IgSimulator</code> run.</li>
</ul> 
</br>
<!- ---------------------------------------------------------------- ->

<a id = "repertoire_files"></a>
<h2>4. Antibody repertoire representation</h2>
We used two formats of files for representation of repertoire for the set of reads: CLUSTERS.FASTA and RCM.

<a id = "clusters_fasta"></a>
<h3>4.1. CLUSTERS.FASTA file format</h3>
CLUSTERS.FASTA is a FASTA file, where each sequence corresponds to the monoclonal antibody and header of sequence contains information about corresponding cluster (set of input reads related to the same monoclonal antibody) id and size:
<pre class = "code">
    <code>
    >cluster___1___size___3
    CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGACG
    >cluster___2___size___2
    CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGG
    >cluster___3___size___1
    CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGAC
    </code>
</pre>

<a id = "read_cluster_map"></a>
<h3>4.2. RCM file format</h3>
Every line of RCM (read-cluster map) file contains information about read name and corresponding cluster id:
<pre class = "code">
    <code>
    MISEQ@:53:000000000-A2BMW:1:2114:14345:28882    1
    MISEQ@:53:000000000-A2BMW:1:2114:14374:28884    1
    MISEQ@:53:000000000-A2BMW:1:2114:14393:28886    1
    MISEQ@:53:000000000-A2BMW:1:2114:16454:28882    2
    MISEQ@:53:000000000-A2BMW:1:2114:16426:28886    2
    MISEQ@:53:000000000-A2BMW:1:2114:15812:28886    3
    </code>
</pre>

</br>  
<b>NOTE:</b> ids in CLUSTERS.FASTA and RCM files should be consistent.</br></br> 

<!- -------------------------------------------------------------------- ->
<a id = "feedback"></a>
<h2>5. Feedback and bug reports</h2>
Your comments, bug reports, and suggestions are very welcomed.
They will help us to further improve IgSimulator.
<br><br>
If you have any troubles running IgSimulator, please send us log file from output output directory.
<br><br>
Address for communications: <a href="mailto:igtools_support@googlegroups.com">igtools_support@googlegroups.com</a>. 

</body>