version 0.0.4

rajewsky-lab · Apr 13, 2017 · 7076e89 · 7076e89
1 parent 1af1f0f
commit 7076e89
Show file tree

Hide file tree

Showing 14 changed files with 2,055 additions and 1,093 deletions.
diff --git a/README b/README
@@ -1,5 +1,5 @@
 Authors : Marc Friedlaender and Sebastian Mackowiak.
-Date:     11/01/2011
+Date:     12/08/2009
 
 This is miRDeep2 developed by Marc Friedlaender and Sebastian Mackowiak.
 miRDeep2 discovers active known or novel miRNAs from deep sequencing data (Solexa/Illumina, 454, ...).
@@ -23,7 +23,7 @@ Installation:
                 perl install.pl
 
 
-2. without the install.pl script follow the instructions given in Sample Installation
+2. without the install mirdeep script follow the instructions given in Sample Installation
 
 
 
@@ -116,6 +116,7 @@ Script Reference:
 miRDeep2 analyses can be performed using the three scripts miRDeep2.pl, mapper.pl and quantifier.pl.
 
 
+
 name:
 miRDeep2.pl
 
@@ -129,7 +130,7 @@ arf format, an optional fasta file with known miRNAs of the analysing species an
 species
 
 output:
-A spreadsheet and a html file with an overview of all detected miRNAs in the deep sequencing input data.
+A spreadsheet and an html file with an overview of all detected miRNAs in the deep sequencing input data.
 
 
 options:
@@ -170,21 +171,10 @@ results generated (result.html), a copy of the novel and known miRNAs contained
 but in text format which allows easy parsing (result.csv), a copy of the performance survey
 contained in the webpage but in text format (survey.csv) and a copy of the miRNA read signatures
 contained in the pdfs but in text format (output.mrd).
-The ids in files miRBase_mmu_v14.fa and precursors_ref_this_species.fa need to be similar to each other.
-This is usually no problem if you downloaded both files from miRBase.
-Otherwise it can happen that the quantifier fails to produce results. 
-
-
-Example use 2:
 
-As in example use 1, except that the user has already run quantifier.pl and wants to use this
-output to get information on the miRNAs not detected by miRDeep2 included in the html webpage.
-miRBase.mrd is a file generated by quantifier.pl:
 
-miRDeep2.pl reads_collapsed.fa genome.fa reads_collapsed_vs_genome.arf miRBase_mmu_v14.fa miRBase_rno_v14.fa -t Mouse -q miRBase.mrd 2>report.log
-
-This command will generate the same type of files as example use 1 above.
 
+Example use 2:
 
 The user wishes to identify miRNAs in deep sequencing data from an animal with no related species
 in miRBase: 
@@ -218,13 +208,8 @@ Read input file:
 -a              input file is seq.txt format
 -b              input file is qseq.txt format
 -c              input file is fasta format
--e              input file is fastq format
--d              input file is a config file (see miRDeep2 documentation).
-                options -a, -b or -c must be given with option -d.
-
 
 Preprocessing/mapping:
--g              three-letter prefix for reads (by default 'seq')
 -h              parse to fasta format
 -i              convert rna to dna alphabet (to map against genome)
 -j              remove all entries that have a sequence that contains letters other than
@@ -295,52 +280,22 @@ mapper.pl reads.fa -c -h -i -j -k TCGTATGCCGTCTTCTGCTTGT -l 18 -m -s reads_colla
 
 
 
-Example use 5: (experimental)
+Example use 5:
 
 The user has already removed 3' adapters in color space and has mapped the reads against the genome
-using bwa/bowtie resulting in a sam file. 
-Note that each genome locus to which a read was aligned has to occur
-in its own line. Otherwise only the first genome locus of each line will be taken!
-The mapping output file is named mapped.sam. 
-The user wishes to generate the files 'reads_collapsed.fa'
-and 'reads_collapsed_vs_genome.arf' as input to miRDeep2:
-
-perl sam_reads_collapse.pl mapped.sam reads_collapsed.fa
-perl bwa_sam_converter.pl -i mapped.sam -t read_1_to_1.txt -o reads_collapsed_vs_genome.arf
-
-
-If read ids are already collapsed and in correct miRDeep2 format (eg. ">ABC_1_x10", see File Formats at the bottom or consult the online documentation)
-then the sam file just needs to be converted:
-
-
-perl bwa_sam_converter.pl -i mapped.sam -o reads_collapsed_vs_genome.arf
-
-
-Example use 6:
+using the BWA tool. The BWA output file is named reads_vs_genome.sam. Notice that the BWA output
+contains extra fields that are not required for SAM format. Our converter requires these fields and
+thus may not work with all types of SAM files. The user wishes to generate 'reads_collapsed.fa'
+and 'reads_vs_genome.arf' to input to miRDeep2:
 
-The user has sequencing data from different samples e.g. different cell-types. A config.txt file has to be created in which each line 
-designates file locations and a unique 3 letter code. 
-For instance:
-sequencing_data_sample1.fa	sd1
-sequencing_data_sample2.fa	sd2
-sequencing_data_sample3.fa	sd3
-.
-.
-.
+bwa_sam_converter.pl reads_vs_genome.sam reads.fa reads_vs_genome.arf
 
-The use wishes then to pool these files and use the generated files reads.fa and reads_vs_genome.fa for the miRDeep2 analysis. 
-
-
-mapper.pl config.txt -d -c -i -j -l 18 -m -p genome_index -s reads.fa -t reads_vs_genome.arf 
-
-Since the reads_vs_genome.arf still contains the 3 letter code for each read mapped to genome the user can then later on 
-dilute the contribution of the different samples to a predicted or known miRNA.
-It can also be used for example to define 'high confident' predictions if the results are filtered for miRNAs that have sequencing
-evidence from at least two samples.   
+mapper.pl reads.fa -c -i -j -l 18 -m -s reads_collapsed.fa
 
 ############################################################################################################################### 
 
 
+
 name:
 quantifier.pl
 
@@ -360,46 +315,18 @@ A 2 column table file called miRNA_expressed.csv with miRNA identifiers and its
 miRNAs having 0 read counts, a signature file called miRBase.mrd, a file called expression.html that gives an overview of all miRNAs the input data
 and a directory called pdfs that contains for each miRNA a pdf file showing its signature and structure. 
 
-[options]
-
-[mandatory parameters]
-        -u      list all values allowed for the species parameter that have an entry at UCSC 
-
-        -p precursor.fa  miRNA precursor sequences from miRBase
-        -m mature.fa     miRNA sequences from miRBase
-        -r reads.fa      your read sequences
-
-[optional parameters]
-        -c [file]    config.txt file with different sample ids... or just the one sample id 
-        -s [star.fa] optional star sequences from miRBase     
-        -t [species] e.g. Mouse or mmu
-                     if not searching in a specific species all species in your files will be analyzed
-                     else only the species in your dataset is considered
-        -y [time]    optional otherwise its generating a new one
-        -d           if parameter given pdfs will not be generated, otherwise pdfs will be generated
-        -o           if parameter is given reads were not sorted by sample in pdf file, default is sorting
-        -k           also considers precursor-mature mappings that have different ids, eg let7c
-                     would be allowed to map to pre-let7a
-        -n           do not do file conversion again
-        -x           do not do mapping against precursor again
-        -g [int]     number of allowed mismatches when mapping reads to precursors, default 1
-        -e [int]     number of nucleotides upstream of the mature sequence to consider, default 2
-        -f [int]     number of nucleotides downstream of the mature sequence to consider, default 5
-        -j           do not create an output.mrd file and pdfs if specified
-
-        -w           considers the whole precursor as the 'mature sequence'
-
-
+options:
+-t  list all values allowed for the species parameter that have an entry at UCSC 
 
 example usage:
-Assume we want to quantify C.elegans miRNAs then we would run the command
-quantifier.pl -p precursors.fa -m mature.fa -r reads.fa -s star.fa -y now -t cel
+quantifier.pl precursors.fa mature.fa reads.fa star.fa/none species/none timestamp/none pdf
 
 
 
 #####################################################################################################################################
 
 
+
 name:
 make_html.pl
 
@@ -1136,32 +1063,3 @@ options:
 
 notes:
 -
-
-
-##########################
-File Formats
-.fa
-The fasta files that contain sequencing reads used by miRDeep2 are ordinary fasta files with a predefined identifier format. It comprises three values separated by underscore. The first value is a three letter code which is intended to be a tag for the sample a read is coming from. The second value is a running number that is used to make sure that identifiers are uniquely assigned to sequences from the same sample. The third value starts with and 'x' followed by an integer number that indicates the occurrence of a read sequence in a sample. The sequence in a fasta file that is supplied to miRDeep2 is not allowed to contain characters others than A, C, G, T and N. If the id line or the sequence line do not follow these conventions miRDeep2 will abort with a warning message. Example entry from a fasta file that can be supplied to miRDeep2
-
->PAN_123456_x969696
-ATACAATCTACTGTCTTTCCT
-
-.arf
-The arf format is a proprietary file format generated and processed by miRDeep2. It contains information of reads mapped to a reference genome. Each line in such a file contains 13 columns. Example line:
-
-#1                    2     3    4     5                        6        7     8           9           10                       11   12   13
-PAN_123456_x969696    21    1    21    ATACAATCTACTGTCTTTCCT    chr22    21    46508682    46508702    ATACAATCTACTGTCTTTCCT    +    1    mmmmmmmmmmmmmmmmmmmmm
-
-1    read identifier
-2    length of read sequence
-3    start position in read sequence that is mapped
-4    end position in read sequence that is mapped
-5    read sequence
-6    identifier of the genome-part to which a read is mapped to. This is either a scaffold id or a chromosome name
-7    length of the genome sequence a read is mapped to
-8    start position in the genome where a read is mapped to
-9    end position in the genome where a read is mapped to
-10   genome sequence to which a read is mapped
-11   genome strand information. Plus means the read is aligned to the sense-strand of the genome. Minus means it is aligned to the antisense-strand of the genome.
-12   Number of mismatches in the read mapping
-13   Edit string that indicates matches by lowercase 'm' and mismatches by uppercase 'M'