-
Notifications
You must be signed in to change notification settings - Fork 24
tutorial strawberry
This is a tutorial using version version v0.4.0
Previous versions:
- version v0.1.3 - the corresponding version of this page.
- version v0.2.0 - the corresponding version of this page.
- smudgeplot v0.2.1 - the corresponding version of this page.
Let's get the data from the NCBI strawberry project PRJDB3320 published originally in Hirakawa et al. 2014. To complete this tutorial you will need 16 cores, about 300GB of disk space and several hours of computations.
We will warm up on a diploid species Fragaria iinumae and then we will do the famous octoploid strawberry F. ananassa.
F. iinumae has reads in SRA under accession DRR013884, we can download them from ebi ftp server (it will take a while, it's quite a lot of data):
mkdir -p strawberry_iinumae && cd strawberry_iinumae
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR013/DRR013884/DRR013884_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR013/DRR013884/DRR013884_2.fastq.gz
great, we got 70.1GB of data, 109G base pairs, corresponding to approximately 500x of the ~200M Fragaria genome. It's useful to know a bit idea about the expected genome coverage.
Now we would like to build a database of all 31-mers using FastK
FastK -v -t10 -k31 -M16 -T4 DRR013884_[12].fastq.gz -Nkmer_db
We expect coverage to be enormous therefore we can just chose a really high cutoff hopefully filtering out all the sequencing errors. Now let's extract all the kmers with coverage >100x.
smudgeplot.py hetmers -L 100 -t 4 -o kmerpairs_L100 --verbose kmer_db
now we got a file kmerpairs_L100_text.smu
that contains kmer pair coverages and their respective frequency in the dataset.
We need to feed the coverage file to the R script that will plot the smudgeplot and estimate ploidy, specify input file (-i
), output name (-o
) and a title for the plot so you will know what are you looking at (argument -t
). We also expect a high coverage dataset given how big are the files and how small usually strawberry genomes are, so we can ramp up the coverage range that is tested for 1n coverage by specifying -cov_max
parameter
smudgeplot.py all -cov_min 100 -cov_max 200 -o f_iinumae -t "Fragaria iinumae" kmerpairs_L100_text.smu
And that's it. Now a smugleplot are plotted on both the linear (f_iinumae_smudgeplot.pdf
) and the log scale (f_iinumae_smudgeplot_log10.pdf
). All the information is embedded in the figures, with smudge sizes also stored in the _smudge_sizes.txt
file. Sometimes it's easier to look at the logscale smudges sometimes non-transformed. This time log transformation made a clear picture
Oh heck, this Fragaria species seems to be tetraploid OR a rather homozygous species with a lot of closely related duplications. The reason is that we searched for loci that are exactly one SNP from each other, if there are two homozygous loci that are exactly one nucleotide different, I will pick them up as AABB. If the smudge would be strong around AAAB, the evidence for tetraploidy would be stronger since such paralog structure would be more than peculiar. Alright, is this a tetraploid??
All the literature I found were talking about this species as the basal branching Fragaria lineage that is diploid. However in Potter et al. 2000 they found an octoploid strawberry plant that was determined as Fragaria iinumae and reclassified subsequently. Maybe this individual that was sequenced is also a hybrid, but more likely it's just the duplication story and false inference from the side of smudgeplot?! Just to make sure that I have not done something stupid I run also GenomeScope (a step I have omitted at the beginning).
Histex -G kmer_db > kmer_k31.hist
genomescope.R -i kmer_k31.hist -k 31 -p 2 -o . -n Fiinumae_genomescope
Alright, the haploid genome size estimate is ~200M assuming diploidy, suggesting that it's indeed not tetraploid. Furthermore The heterozygosity is really low (0.17%) which explains why duplication signal is relatively stronger than the heterozygosity signal. We also see that the duplication bump (the third peak in the histogram) is indeed quite high. I am very intrigued that so many of the duplications are so recent that it made the smudgeplot so confusing.
I must admit, this was not the best example. I did not know how it's going to end up when I started.
cd ..
Let's see if we will be able to predict the genome ploidy of the cultivar hybrid species F. ananassa.
Here we want to download 4 libraries, so I made this small for loop that will fetch the data
mkdir -p strawberry_ananassa && cd strawberry_ananassa
for lib in DRR013873 DRR013874 DRR013875 DRR013877; do
libdir=$(echo $lib | cut -b 1-6)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/$libdir/$lib/"$lib"_{1,2}.fastq.gz
done
Let's again count kmers using FastK. This time we are not that sure what we are dealing with, so we want to extract a histogram of kmers as well to determine L:
FastK -v -t10 -k31 -M16 -T4 *_[12].fastq.gz -Nkmer_db
Histex -G kmer_db > Fara_kmer_k31.hist
I would like to see the kmer spectra first before deciding about L. It also help not to fall into the same trap as with the F. iinumae smudgeplot.
genomescope.R -i Fara_kmer_k31.hist -k 31 -p 2 -o . -n Fananassa_genomescope
The histogram:
The log-log histogram of the same data
It seems that the haploid genome coverage is ~60x (visible on the non-tranformed histogram), then we also see on the log-log histogram that the last big bump is somewhere bellow 1000x. We need to decide by looking at the histograms (L=30 would be reasonable choice to me).
L=30
smudgeplot.py hetmers -L "$L" -t 4 -o kmerpairs_L"$L" kmer_db
Finally we generate a smudgeplot
smudgeplot.py all -cov_min 40 -cov_max 150 -o f_ananassa -t "Fragaria x ananassa" kmerpairs_L"$L"_text.smu
Let's look again at the log transformed smudges
Wow, finally something interesting. It looks like we are dealing with an octoploid give the high abundance of AAAAAABB loci. It also seems that the genome structure is AAAABBCC, where A and B are more closely related than C (because the smudge AAAAAABB is the brightest) and because there is quite bright AAAABB smudge suggesting senario where two haplotypes (let's say CC) do have a fixed indel and therefore we detect them as a hexploid loci. Analogically we could interpret the AABB smudge...
Compared to previous version, there are two noteworthy changes - we don't "predict" ploidy anymore - smudgeplot never has a good evolutionary model, we always just looked at the ploid with the most k-mer pairs and said that's the predicted ploidy. With that model, a user can make that assessment themself, but at least, this time people won't just look at whatever is there predicted and take it for granted. The thing is - it's context dependent. This species is certainly octoploid, given how much more 8-ploid k-mer pairs are there in comparison to regular genomes (tons vs usually none) and that is even if there would be a few less k-mers in 8-ploid state compared to diploid state. We are working on an evolutionary model that would make a real prediction, but till then, we would like users to understand these plots and make the assessment themselves.
Wanna try something without my help? You can do another strawberry species like diploid Fragaria nipponica with accesion DRR013885. Tweet about it (#smudgeplot), let us know how it went.
Or even better, you can try one of the species we discussed that could be interesting and make such tutorial out of it, we will be happy to add it to the wiki.