-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path01_annotate_lib.Rmd
164 lines (116 loc) · 5.85 KB
/
01_annotate_lib.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
<link href="http://kevinburke.bitbucket.org/markdowncss/markdown.css" rel="stylesheet"></link>
``` {r setup, include=FALSE}
opts_chunk$set(warnings = FALSE, message = FALSE)
require(ggplot2)
require(plyr)
```
# Step 01: Annotating Library Member Data Frame
## Motivation
Here we are loading all the library member sequences into a DataFrame called `lib_seqs` and annotating them based on their promoter, RBS, CDS, gene, etc. At the end we check to make sure that the library has every item, the sequences all check out, and that the sequences are all the same length.
## Splitting
We want to take `203.norestrict.fa` and split it on the upper/lowercase boundaries. First, lets look at the UC/LC situation for each of the
sequence types in 203 and also the 202 library.
### Comparison of Library Sequences between 203 and 202
The 203 library consists of a promoter, a RBS, an ATG start codon, and then exactly 10 more codons (30 bp) of the sequence in question. So I want to cut at the lc/uc boundary, then cut again at the ATG.
Things are different for the natural UTR, like this:
```
>BBaJ23100-ribF-1
ttgacggctagctcagtcctaggtacagtgctagcTTAATTTCACTGTTTTGAGCCAGACATGAAGCTGATACG...
```
Compared to the designed UTR, like this:
```
>BBaJ23100-ribF-2
ttgacggctagctcagtcctaggtacagtgctagcTTAATATTAAAGAGGAGAAAtactagATGAAGCTGATAC...
```
Here is an example of that same promo/RBS pair in 202:
```console
$ grep -B1 -i TTAATATTAAAGAGGAGAAA 202.norestrict.fa
>BBa_J23100--B0030_RBS
TTGACGGCTAGCTCAGTCCTAGGTACAGTGCTAGCTTAATattaaagaggagaaatta
```
### Aligning them by hand:
```
wt ttgacggctagctcagtcctaggtacagtgctagcTTAAT TTCACTGTTTTGAGCCAGACATGAAGCTGA...
bbbbb.rrrrrrrrrrrrrrrrrrrr MET
rbs ttgacggctagctcagtcctaggtacagtgctagcTTAATATTAAAGAGGAGAAAtactagATGAAGCTGA...
bbbbbrrrrrrrrrrrrrrrrrrrXXXXXXMET
202 TTGACGGCTAGCTCAGTCCTAGGTACAGTGCTAGCTTAATattaaagaggagaaattaCATATG <GFP>
bbbbbrrrrrrrrrrrrrrrrrrrXXX MET
```
`b` for barcode, `r` for RBS, and `X` for variable rbs/cds spacer, and `MET`for start codon.
So, take home messages:
* WT promo/rbs combos in 203 have no spacer between the WT RBS and the MET codon.
* `BBaJ23100/BBaJ23108` RBSes have `TACATG` separating the RBS sequence and the MET. This is different than the same RBSs for 202, which necessarily end in `CATATG`. So they are not equivalent, strictly speaking. Which might complicate comparison of 202 and 203 libraries.
* The equivalent 202 RBSes (`BBa_J23100/BBa_J23108`) have the barcode on the promoter, not the RBS (which is actually more correct.)
So first we need to annotate the library sequences and split them up using this information.
For the 203 sequences, we split into
* *Promoter* - all LC seq + 5 bases (for barcode)
* *RBS* - middle of this sequence
* *CDS* - last 33 bases (ATG + 10 aa)
### Regex Prep for reaidng FASTA library sequences into R
I reformatted `203.norestrict.fa` into a tab-delimited file with perl regexp. I also added column headers.
```bash
echo -en "Name\tPromoter\tGene\tCDS.num\tPromoter.seq\tRBS.seq\tCDS.seq" \
> 203.norestrict.txt
perl -ne 'chomp; s/^>((\w+)-(\w+)-(\d+))/\n$1\t$2\t$3\t$4/;
s/^([atgc]+[ATGC]{5})([ATGCatgc]*?)(ATG[ATGCatgc]{30})$/\t$1\t$2\t$3/;
print $_;' 203.norestrict.fa >> 203.norestrict.txt
```
## Reading Into R
Now we have to read `203.norestrict.txt` into R as a tab-delimited text file.
```{r 01.01-load_libs, cache=TRUE}
lib_seqs <- read.table(file=paste(getwd(),"/data/203.norestrict.txt", sep=''),
sep="\t", header=T)
```
### Annotating `lib_seqs` DataFrame
Now we want to add the library info to each sequence.
Using the leader peptide number (`CDS.num`) we can assign the predetermined attributes to each sequence, like RBS identity, codon usage, and secondary structure.
```{r 01.02-annotate_libs, cache=TRUE}
lib_seqs$CDS.num <- as.integer(lib_seqs$CDS.num)
#split Leader into RBS identities
lib_seqs$RBS = NA
lib_seqs$RBS[which(lib_seqs$CDS.num %in% c(1,5,9,13:22))] <- 'WT'
lib_seqs$RBS[which(lib_seqs$CDS.num %in% c(2,6,10,23:32))] <- 'BB0030'
lib_seqs$RBS[which(lib_seqs$CDS.num %in% c(3,7,11,33:42))] <- 'BB0032'
lib_seqs$RBS[which(lib_seqs$CDS.num %in% c(4,8,12,43:52))] <- 'BB0034'
lib_seqs$RBS <- factor(lib_seqs$RBS, rev(c('BB0030','BB0034','BB0032','WT')))
#split leader into CDS types (WT, min/max rare codons, secondary structure)
lib_seqs$CDS.type <- (as.integer(lib_seqs$CDS.num) -3) %% 10
lib_seqs$CDS.type[which(lib_seqs$CDS.num %in% c(1:4))] <- 'WT'
lib_seqs$CDS.type[which(lib_seqs$CDS.num %in% c(5:8))] <- 'Min Rare'
lib_seqs$CDS.type[which(lib_seqs$CDS.num %in% c(9:12))] <- 'Max Rare'
lib_seqs$CDS.type <- factor(lib_seqs$CDS.type,
rev(c('WT','Min Rare','Max Rare',0:9)),
rev(c('WT','Min Rare','Max Rare',paste('∆G',c(1:10),sep=' '))))
#function to get the length of any factored string field and add RBS length
#to each sequence, since this varies, unlike promoter and CDS
get_len <- function (df,field) {
seq_field <- paste(field,'seq',sep='.')
len_field <- paste(field,'len',sep='.')
df[,len_field] <- nchar(as.character(df[1,seq_field]))
return(df)
}
require(plyr)
lib_seqs <- ddply(lib_seqs, .(RBS.seq), get_len, 'RBS')
```
## Checking `lib_seqs` DataFrame for completeness
First we should check that all the identities are equally distributed:
```{r 01.03-check_counts}
#Promoter
table(lib_seqs$Promoter)
#RBS Type
table(lib_seqs$RBS)
#CDS Type
table(lib_seqs$CDS.type)
#Gene
table(lib_seqs$Gene)
```
Now we can see why there are differences in the RBS lengths:
```{r 01.04-plot_rbs_len, fig.width=7, fig.height=4, message=FALSE, fig.keep=FALSE}
ggplot(lib_seqs, aes(x=RBS, y=RBS.len)) + geom_point()
```
WT sequences are all 20 bp, and the three designed RBSs are 18, 19, and 21 bp each. Good to know.
```{r 01-save, include=FALSE}
save(list=ls(),
file= paste(getwd(),"/rdata/01.Rdata", sep=''))
```