-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
152 lines (102 loc) · 5.44 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
GFClassify was developed to classify sequences into categories, published in CITATION
Copyright 2012 Ribosomal Database Project
This project is distributed under the terms of the GNU GPLv3
---
The program uses Interpolated Context Models trained with a set of categorized sequences, to determine the categories of an input set of sequences.
This program requires:
Glimmer 3.0
an installation of BioPython
-------------------
Directories
-------------------
src/ - Contains the source for GFClassify (score.cc and score.h) and a
make file to build the tool.
models/ - Models included with GFClassify, currently amoA, pmoA, nifH 1-5
scripts/ - Helper scripts for testing models and processing GFClassify result
files (biopython required for some scripts)
examples/ - Contains example sequence files
------------------
BUILDING
------------------
Prep:
-------
To build gf_classify, first download Glimmer 3.0 from their website
(http://www.cbcb.umd.edu/software/glimmer/) and
edit the file src/Makefile. Alter the glimmer_dir to point to the
directory where Glimmer was extracted. Type 'make' inside the src/
directory. Should compile on any platform with gcc.
Information on BioPython
Compiling:
------------
(1) Unpack the files with tar ...
(2) Change the Makefile to show where your installation will be.
Edit 'Makefile' in the src/ directory
Change the glimmer_dir variable to the path where the Glimmer installation is
e.g. /home/user/glimmer3.02/
If you're not sure what the path is, at the command line, go into the
directory where you have installed Glimmer. Then type 'pwd' The output
of that command is the path of your cd-hit directory.
(3) In the src/ directory and type 'make' to compile gf_classify. The binary
will be in the src directory (src/gf_classify).
cd src && make
No install option is included, you can run gf_classify from the src
directory or copy the executable to a directory on your path
(ex: /usr/local/bin), or modify your path to include the src directory.
NOTE: Building gf_classify will trigger a build of Glimmer, if for
some reason Glimmer compilation fails the gf_classify build will
fail too.
-----------------------
Example/Testing
-----------------------
1. cd to the root of the gf_classify directory
2. Run GFClassify
src/gf_classify -c -t 10 examples/fg_amoa_pmoa_bg.fna models/amoa.icm models/pmoa.icm models/bg.icm > gf_classify.txt
This command runs gf_classify telling it to check the reverse complement of
all query sequences as well as forward orientation and reject any classifications
with log-likelihood <= 10 nats. By default gf_classify writes results to stdout,
results can be redirected to file with the > operator on the command line.
3. Summarizing results:
cut -f 7 gf_classify.txt | sort | uniq -c
The output should be (with the header value 'label' omitted)
984 amoa
10675 bg
979 pmoa
1 rejected
4. Sorting sequences by label
scripts/sort_results.py examples/fg_amoa_pmoa_bg.fna gf_classify.txt
This will create one file per label in the current direcotry containing
all sequences with that label.
--------------------
Using GFClassify
--------------------
Using gf_classify:
-------------------
gf_classify requires one or more Interpolated Context Models (ICMs) to
categorize a set of input sequences.
ICMs can be created using Glimmer and a training set for each category
of interest. See Glimmer documentation for more
information.
When run with no arguments (or -h) gf_classify will output a brief usage
message.
USAGE: gf_classify [options] <seq_file> <model1> [model2 ...]
The command line program takes a sequence file (in fasta format) and one
or more ICMs on the command line. If <seq_file> is - gf_classify will
read sequences from stdin instead of from file.
[options]
-t log likelihood threshold cutoff. Log likelihook must be greater
than or equal to the threshold, otherwise the input sequence is
rejected.
(more explation of what this is and how to calculate it)
-c consider both the forward and reverse complement of the sequence
--
Output is written to standard out. The first line (starting with a #) contains the column headers. The first three columns in every output will be the query sequence id, sequence description and predicted orientation (+/-). Then there is one column for each input ICM listing the nats score of the query when tested against that model. If more than one ICM is used the last column will contain the label with the highest probability.
All probabilities are reported in nats (log base e).
The included sort_results.py script can be used to sort query sequences in to fasta files based on predicted gene label.
Building ICMs:
---------------
ICMs for GFClassify are built using the build-icm program in the Glimmer
package.
The included grid_search.py script can be used to find optimal values for window width, depth, and periodicy given two or more sets of training sequences. The script takes a list of values for each parameter to test the error rate with n-fold cross validation (specified by the -k argument).
We recommend --window 8,9,10,11,12,13,14,15 --depth 7,8,9 --period 1.
For more detailed help see the Glimmer 3 documentation.
http://www.cbcb.umd.edu/software/glimmer/glim302notes.pdf