-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME
71 lines (61 loc) · 4.42 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
Using HMMgs:
See detailed step-by-step instructions in Xander_assembler repository (https://github.com/rdpstaff/Xander_assembler)
Build - Build a De Bruijn graph from from a set of reads
java -jar hmmgs.jar build <read_file> <bloom_out> <kmerSize> <bloomSizeLog2> [cutoff = 2] [# hashCount = 4] [bitsetSizeLog2 = 30]
read_file
fasta or fastq files containing the reads to build the graph from
bloom_out
file to write the bloom filter to
kmerSize
should be multiple of 3, (recommend 45, minimum 30, maximum 63)
bloomSizeLog2
the size of the bloom filter (or memory needed) is 2^bloomSizeLog2 bits, increase if the predicted false positive rate is greater than 1%
cutoff
minimum number of times a kmer has to be observed in SEQFILE to be included in the final bloom filter
hashCount
number of hash functions, recommend 4
bitsetSizeLog2
the size of one bitSet 2^bitsetSizeLog2, recommend 30
The bloom filter stats such as bloom filter predicted false positive rate is written to stdout.
Search - Perform local assembly starting at the given start points in a given de Bruijn Graph
output files <kmers>_nucl.fasta, _prot.fasta, search stats written to stdout
java -jar hmmgs.jar search [-h] [-u] [-p <n_nodes>] <k> <limit_in_seconds> <bloom_filter> <for_hmm> <rev_hmm> <kmers>
-u
don't normalize the hmm input
-p n_nodes
prune the search if the score does not improve after n_nodes (default 20, set to 0 to disable pruning)
k
number of best local assemblies to return for each kmer
limit_in_seconds
dtime limit for individual searches (conservative suggestion = 100)
bloom_filter
bloom filter built using hmmgs build
for_hmm, rev_hmm
hidden markov models, HMMER3 format
kmers
starting points (can use KmerFilter's fast_kmer_filter to identify starting points)
[#threads] experimental, suggested 1 (not thoroughly tested)
Merge - Merge the left and right contigs generated by hmmgs search
java -jar hmmgs.jar merge [options] <hmm> <hmmgs_file> <nucl_contig>
-a,--all Generate all combinations for multiple paths for each starting kmer, instead of just the best
-b,--min-bits <arg> Minimum bits score
-l,--min-length <arg> Minimum length
-o,--out <arg> Write output to file instead of stdout
KmerFilter:
fast_kmer_filter - search a set of reads against a set of reference sequences to identify starting points for assembly
java -jar KmerFilter.jar fast_kmer_filter <kmerSize> <query_file> [name=]<ref_file> ...
-a,--aligned Build trie from aligned sequences
-o,--out <arg> Redirect output to file
-T,--transl-table <arg> Translation table to use when translating
nucleotide to protein sequences
-t,--threads <arg> #Threads to use
<kmerSize> kmer length, should be multiple of 3, (recommend 45, minimum 30, maximum 63)
<query_file> read file to search for starting points in (use the same fasta file used to build the De Bruijn Graph)
1 or more aligned reference files (aligned using the same HMM that will be used to search) with an optional reference name (ie nifh=my_nifh_refs_aligned.fasta)
Other uses:
HMMgs can also be used to extract subgraphs from starting points instead of contigs to perform further analysis with (see edu.msu.cme.rdp.graph.GraphSearch)
HMMgs can also be used to compute base coverage for contigs (generated by hmmgs or other programs) (see edu.msu.cme.rdp.graph.abundance.ReadKmerMapper and base_coverage.py)
NOTES:
When using fast_kmer_filter to identify start points there are two things to be aware of.
1. While the Bloom Filter Builder allows any k-size (hmmgs requiers a k divisible by 3 however), fast_kmer_filter requires k <= 63
2. fast_kmer_filter allows for multiple gene starting points to be searched for at the same time (since each requires a scan over the read file it is faster to do every gene at once), however this means the output file is multiplexed and must be demultiplexed before used in hmmgs search. This can be done with the following command: grep 'gene_name' <multiplexed_starts_file> | cut -f2- > <demultiplexed_gene_start_points>