Skip to content

Commit

Permalink
draft of diversity analyzer manual. missing test file was added
Browse files Browse the repository at this point in the history
  • Loading branch information
yana-safonova committed Jul 21, 2016
1 parent 96f653a commit 1cabe1d
Show file tree
Hide file tree
Showing 72 changed files with 551 additions and 1 deletion.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ igrec_test
dsf_test
.idea
build/
divan_test
2 changes: 1 addition & 1 deletion diversity_analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
import html_report_writer

test_reads = os.path.join(home_directory, "test_dataset/merged_reads.fastq")
test_dir = os.path.join(home_directory, "cdr_test")
test_dir = os.path.join(home_directory, "divan_test")

tool_name = "Diversity Analyzer"

Expand Down
355 changes: 355 additions & 0 deletions diversity_analyzer_manual.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,355 @@
<head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Diversity Analyzer 1.0 Manual</title>
<style type="text/css">
.code {
background-color: lightgray;
}
</style>
<style>
</style>
</head>
<body>

<h1>Diversity Analyzer 1.0 manual</h1>

1. <a href="#intro">What is Diversity Analyzer?</a><br>

2. <a href="#install">Installation</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;2.1. <a href="#test_datasets">Verifying your installation</a><br>

3. <a href="#usage">Diversity Analyzer usage</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;3.1. <a href="#basic_options">Basic options</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;3.2. <a href="#advanced_options">Advanced options</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;3.3. <a href="#examples">Examples</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;3.4. <a href="#output">Output files</a><br>

4. <a href="#files_format">Output file formats</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;4.1. <a href="#cdr_details">CDR details file</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;4.2. <a href="#shm_details">SHM details file</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;4.3. <a href="#v_alignments">V alignments file</a><br>

5. <a href = "#plot_descr">Plot description</a><br>

6. <a href="#feedback">Feedback and bug reports</a><br>
<!--- &nbsp;&nbsp;&nbsp;&nbsp;5.1. <a href="#citation">Citation</a><br> --->

<!-- -------- --->
<h2 id = "intro">1. What is Diversity Analyzer?</h2>
<p>
Diversity Analyzer is a tool for diversity analysis of adaptive immune repertoires.
It takes full-length immunosequencing reads or constructed repertoire as an input and performs the following steps:
<ul>
<li>Alignment of input sequences against germline V and J segments.
<b>Please note</b> that reads with low quality alignment will be filtered at this step and will not be analyzed further. </li>
<li>CDR labeling of aligned reads</li>
<li>Computation of SHM positions</li>
<li>Computation of diversity indices for computed CDRs</li>
<li>Visualization of computed diversity statistics</li>
</ul>
</p>

<!-- -------- --->
<a id="install"></a>
<h2>2. Installation</h2>

Diversity Analyzer has the following dependencies:
<ul>
<li>64-bit Linux or MacOS system</li>
<li>g++ (version 4.7 or higher) or clang compiler</li>
<li>cmake (version 2.8.8 or higher)</li>
<li>Python 2 (version 2.7 or higher), including:</li>
<ul>
<li><a href = "http://biopython.org/wiki/Download">BioPython</a></li>
<li><a href = "http://matplotlib.org/users/installing.html">Matplotlib</a></li>
<li><a href = "http://www.numpy.org/">NumPy</a></li>
<li><a href = "http://www.scipy.org/install.html">SciPy</a></li>
<li>Seaborn</li>
<li>Pandas</li>
</ul>
</ul>

To install DiversityAnalyzer, type:
<pre class="code">
<code>
./prepare_cfg
</code>
</pre>
and:
<pre class="code">
<code>
make
</code>
</pre>

<a id="test_datasets"></a>
<h3>2.1. Verifying your installation</h3>
For testing purposes, Diversity Analyzer comes with toy data sets. <br><br>

&#9658; To try Diversity Analyzer on the test data set, run:
<pre class="code"><code>
./diversity_analyzer.py --test
</code>
</pre>

If the installation of Diversity Analyzer is successful, you will find the following information at the end of the log:

<pre class="code">
<code>
Thank you for using Diversity Analyzer!
Log was written to &lt;your_installation_dir>/divan_test/ig_repertoire_constructor.log
</code>
</pre>

<!-- -------- --->
<h2 id = "usage">3. Diversity Analyzer usage</h2>
<p>
Diversity Analyzer takes full-length immunosequencing reads or constructed repertoire in FASTQ/FASTA as an input and
analyzes diversity characteristics: VJ combinations, CDRs, SHMs.
</p>

To run Diversity Analyzer, type:
<pre class="code">
<code>
./diversity_analyzer.py [options] -i &lt;input_sequences&gt; -o &lt;output_dir&gt;
</code>
</pre>

<!-- --->
<h3 id = "basic_options">3.1. Basic options</h3>

<code>-i &lt;input_sequences&gt;</code><br>
input sequences in FASTA/FASTQ format (required).

<br><br>

<code>-o / --output &lt;output_dir&gt;</code><br>
output directory (required).

<br><br>

<code>-t / --threads &lt;int&gt;</code><br>
The number of parallel threads. The default value is <code>16</code>.

<br><br>

<code>--test</code><br>
Running on the toy test dataset. Command line corresponding to the test run is equivalent to the following:
<pre class = "code">
<code>
./diversity_analyzer.py -i test_dataset/merged_reads.fastq -o divan_test
</code>
</pre>

<code>--help</code><br>
Printing help.

<br><br>

<!-- --->
<h3 id = "advanced_options">3.2. Advanced options</h3>

<code>--domain &lt;str&gt;</code><br>
Domain system that will be used for CDR computation: <code>imgt</code> or <code>kabat</code>.
Default value is <code>imgt</code>.

<br><br>

<code>--loci / -l&lt;str&gt;</code><br>
Immunological loci to align input reads and discard reads with low score. <br>
Available values are <code>IGH</code> / <code>IGL</code> / <code>IGK</code> / <code>IG</code> (for all BCRs) /
<code>TRA</code> / <code>TRB</code> / <code>TRG</code> / <code>TRD</code> / <code>TR</code> (for all TCRs) or <code>all</code>.
Default value is <code>IG</code>.

<br><br>

<code>--organism &lt;str&gt;</code><br>
Organism. Available value is <code>human</code>.
Further Diversity Analyzer usage will be extended for <code>mouse</code>, <code>pig</code>,
<code>rabbit</code>, <code>rat</code> and <code>rhesus_monkey</code>.
Default value is <code>human</code>.

<br><br>

<code>--skip-plots</code><br>
Skipping plot drawing.

<!-- --->

<h3 id = "examples">3.3. Examples</h3>
To perform diversity analysis of full-length immunosequencing reads <code>input_reads.fastq</code>,
type:
<pre class="code">
<code>
./diversity_analyzer.py -i input_reads.fastq -o divan_test
</code>
</pre>

<!-- --->

<h3 id = "dsf_output">3.4. Output files</h3>
Diversity Analyzer creates working directory (which name was specified using option <code>-o</code>)
and outputs the following files there:

<ul>
<li><b>cleaned_sequences.fasta</b> &mdash; input sequences that have good alignment against V and J germline database.
Diversity Analyzer also crops input sequences by the start of V segment and inverts them in V(D)J direction.
</li>
<li>
<b>cdr_details.txt</b> &mdash; detailed information about CDR labeling of sequences from <b>cleaned_sequences.fasta</b>.
Description of <b>cdr_details.txt</b> file format can be found <a href = "#cdr_details">here</a>.
</li>
<li>
<b>shm_details.txt</b> &mdash; detailed information about SHM labeling of sequences from <b>cleaned_sequences.fasta</b>.
Description of <b>shm_details.txt</b> file format can be found <a href = "#shm_details">here</a>.
</li>
<li>
<b>v_alignment.fasta</b> &mdash; alignment of sequences from <b>cleaned_sequences.fasta</b> against V segments in FASTA format.
Description of <b>v_alignment.fasta</b> file format can be found <a href = "#v_alignments">here</a>.
</li>
</ul>

<ul>
<li>
<b>cdr1s.fasta</b> &mdash; FASTA file with all computed CDR1 sequences.
</li>
<li>
<b>cdr2s.fasta</b> &mdash; FASTA file with all computed CDR2 sequences.
</li>
<li>
<b>cdr3s.fasta</b> &mdash; FASTA file with all computed CDR3 sequences.
</li>
<li>
<b>compressed_cdr3s.fasta</b> &mdash; FASTA file with unique CDR3 sequences.
Abundances of unique CDR3 sequences are specified in header lines.
</li>
</ul>

<ul>
<li>
<b>annotation_report.html</b> &mdash; summary of Diversity Analyzer in HTML format.
Example of <b>annotation_report.html</b> file can be found <a href = "docs/divan_docs/annotation_report.html">here</a>.
</li>
<li>
<b>plots</b> &mdash; directory containing plots with diversity statistics.
Please not that <b>plots</b> directory will not be created in case of option <code>--skip-plots</code>.
</li>
</ul>

<ul>
<li>
<b>diversity_analyzer.log</b> &mdash; a full log of Diversity Analyzer tool.
</li>
</ul>

<!--- -->
<h2 id = "output_files">4. Output file formats</h2>

<h3 id = "cdr_details">4.1. CDR details file</h3>
<b>cdr_details.txt</b> file presents a tab-separated data-frame containing the following fields:
<ul>
<li><code>Read_name</code> &mdash; names of reads from <b>cleaned_sequences.fasta</b> file.
Rows in <b>cdr_details.txt</b> and sequences in <b>cleaned_sequences.fasta</b> are consistently ordered.</li>
<li><code>Chain_type</code> &mdash; type of chain of a sequence: IGH / IGK / IGL / TRA / TRB / TRD or TRG.</li>
<li><code>V_hit</code>, <code>J_hit</code> &mdash; names of V and J gene segments that provide the best alignments.</li>
<li><code>AA_seq</code> &mdash; amino acid sequence. </li>
<li><code>Has_stop_codon</code> &mdash; indicator of presence of stop codon in a sequence:
<code>1</code> - sequence contains stop codon, <code>0</code> - sequence does not contain stop codon.</li>
<li><code>In-frame</code> &mdash; indicator showing whether a sequence is in-frame or not. </li>
<li>
<code>Productive</code> &mdash; indicator of sequence productiveness.
We consider that sequence is productive if it is in-frame and does not contain stop codons.
</li>
<li>
<code>CDR1_nucls</code>, <code>CDR1_start</code>, <code>CDR1_end</code> &mdash;
nucleotide sequence, start and end positions of CDR1.
</li>
<li>
<code>CDR2_nucls</code>, <code>CDR2_start</code>, <code>CDR2_end</code> &mdash;
nucleotide sequence, start and end positions of CDR2.
</li>
<li>
<code>CDR3_nucls</code>, <code>CDR3_start</code>, <code>CDR3_end</code> &mdash;
nucleotide sequence, start and end positions of CDR3.
</li>
</ul>

<h3 id = "shm_details">4.2. SHM details file</h3>
<p>
<b>shm_details.txt</b> file contains a list of SHMs that are consecutively written for each sequences from <b>cleaned_sequences.fasta</b>.
Records in <b>shm_details.txt</b> are consistently ordered with respect to <b>cleaned_sequences.fasta</b>.
</p>

<p>
SHMs are reported separately for each sequence and V / J hit.
SHMs in different hits are separated by a line containing information about name and length of sequence,
name and length of gene, type of segment (V / J) and chain (IGH / IGK / IGL / TRA / TRB / TRG / TRD):
<pre class="code">
<code>
Read_name:1_merged_read Read_length:354 Gene_name:IGHV3-20*01 Gene_length:296 Segment:V Chain_type:IGH
</code>
</pre>
</p>

<p>
For a given hit, SHMs are written in order of position increasing.
Each line corresponds to a single SHM and contains the following fields:
<ul>
<li><code>SHM_type</code> &mdash; type of SHM.
Diversity Analyzer distinguishes three possible types of SHMs: substitution (<code>S</code>),
insertion (<code>I</code>) and deletion (<code>D</code>).
</li>
<li>
<code>Read_pos</code>, <code>Gene_pos</code> &mdash; position of SHM of read and gene. respectively.
Please note that indexation is 1-based.
</li>
<li>
<code>Read_nucl</code>, <code>Gene_nucl</code> &mdash; nucleotide corresponding to SHM on read and gene, respectively.
If SHM corresponds to deletion, value of <code>Read_nucl</code> field will be '<code>-</code>'.
If SHM corresponds to insertion, value of <code>Gene_nucl</code> field will be '<code>-</code>'.
</li>
<li>
<code>Read_aa</code>, <code>Gene_aa</code> &mdash; amino acid corresponding to SHM on read and gene, respectively.
</li>
<li>
<code>Is_synonymous</code> &mdash; indicator showing whether SHM does not change amino acid.
</li>
<li>
<code>To_stop_codon</code> &mdash; indicator showing whether SHM changes amino acid into stop codon.
</li>
</ul>
If a hit does not contain SHMs it will be skipped in <b>shm_details.txt</b>.
</p>

Example of the top of <b>shm_details.txt</b> file:
<pre class="code">
<code>
SHM_type Read_pos Gene_pos Read_nucl Gene_nucl Read_aa Gene_aa Is_synonymous To_stop_codon
Read_name:1_merged_read Read_length:354 Gene_name:IGHV3-20*01 Gene_length:296 Segment:V Chain_type:IGH
S 20 20 C T S S 1 0
S 29 29 C T G G 1 0
S 35 35 C A V V 1 0
S 37 37 A G Q R 0 0
S 45 45 A G R G 0 0
Read_name:1_merged_read Read_length:354 Gene_name:IGHJ3*02 Gene_length:50 Segment:J Chain_type:IGH
S 30 335 C A T T 1 0
S 32 337 C T T M 0 0
</code>
</pre>



<h3 id = "v_alignments">4.3. V alignments file</h3>
Sequences in <b>v_alignment.fasta</b> present a list of pairs: input sequence and a corresponding V hit.

<!--- -------------------------------------------------------------------- --->
<h2 id = "plot_descr">5. Plot description</h2>

<!--- -------------------------------------------------------------------- --->
<a id="feedback"></a>
<h2>6. Feedback and bug reports</h2>
Your comments, bug reports, and suggestions are very welcome.
They will help us to further improve Diversity Analyzer.
<br><br>
If you have any trouble running Diversity Analyzer, please send us the log file from the output directory.
<br><br>
Address for communications: <a href="mailto:[email protected]">[email protected]</a>.
Loading

0 comments on commit 1cabe1d

Please sign in to comment.