-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Since NCBI does not include gene synteny data and context matters, I am making this repository to describe how you can build your own database which includes gene synteny information. Furthermore, I will include simple tools to identify what flanks your favorite genes. A major part is "cleaning" the NCBI databases and records. I seem to find new surprises each time I download another part of NCBI so the methods I am using may not be enough for your data, so remember to include a sanity check a each step. In this example I am creating a database of all plasmids and Archaeal genomes
All plasmid sequences can be downloaded from the ftp server and unpacked like this
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/plasmid.*.genomic.gbff.gz
gunzip *.gbff.gz
Now, we can start identifying "weird entries". The easiest is finding nonsensical genes such as 0 length etc.