NOTE: The script egglib_sliding_windows.py
is no longer recommended. Users have reported an error associated with the egglib polymorphism()
function. Scripts that replace the functionality of egglib_sliding_windows.py
can be found at https://github.com/simonhmartin/genomics_general.
This repository contains code and data for the manuscript "Evaluating the use of ABBA-BABA statistics to locate introgressed loci" by Simon H. Martin ([email protected]), John W. Davey ([email protected]) and Chris D. Jiggins, available on bioRxiv (doi: 10.1101/001347).
The results in the paper were generated as follows. If details are missing or incorrect, please tell us.
Figures were generated using the following scripts, all of which have data files hard coded and so will need modifying for new data sets. compare_f_estimators.r
performs simulations and generates Figure 2 and Figure S1.
./Figure_1.R
./compare_f_estimators.r
./Figures_3_S2_S3.R
./Figure_4.R
./Figure_5.R
./Table_S1.R
The model scripts should be general enough to be used for your own models. If they are not, please let us know. They require the R packages phyclust, plyr, yaml, optparse, ggplot2, gridExtra, reshape and stringr, and for Seq-Gen to be installed and in your path.
Models can be generated using the shared_ancestry_simulator.R
script. A single combined model can be generated like this:
./shared_ancestry_simulator.R -w 10000 -t 60 -c Alternate_t123-0.4_t23-0.2.yml:0.1,Background_t123-0.6_t21-0.4.yml:0.9
This will generate 10000 windows, 10% of which will be generated using the model described in the file Alternate_t123-0.4_t23-0.2.yml
and 90% of which will be generated using Background_t123-0.6_t21-0.4.yml
, using 60 threads. See the model files folders for the YAML files generated for this paper. The CSV files for the models will be made available in a Data Dryad repository on publication and can be made available on request.
A single model, as used for the null models reported in the paper, can be run like this:
./shared_ancestry_simulator.R -w 10000 -t 60 -c Background_t123-0.6_t21-0.4.yml:1
The model YAML files were generated by the run_model_combinations.py
script, using the following commands:
./run_model_combinations.py -m Model_parameter_list.csv -w 10000 -t 30 -s 0.01 -l 5000 -r 5
./run_model_combinations.py -m Model_parameter_list.csv -w 10000 -t 30 -s 0.01 -l 5000 -r 50
Option -m
specifies a file containing the list of models to simulate (available in the repository). Option -w
specifies the number of windows and -t
the number of threads. For Seq-Gen options, -s
defines the scale factor and -l
the length of sequence. -r
defines the recombination rate for ms.
ms produces a tree with branch lengths and a set of variant sites. This set of variant sites is generated assuming an infinite sites model. We produce output using this set of variant sites (in files with '.ms.' in the name) but also generate sequences from the ms tree using Seq-Gen (in files with '.sg.' in the name), which does not assume an infinite sites model. We used the Seq-Gen output for all analyses in the paper.
Summary statistics for the models found in the partition.summary
and dxy.summary
files were generated as follows:
./generate_summary_statistics.R -m model_files_win10000_s0.01_l5000_r5 -l Model_parameter_list.csv -t 10
./generate_summary_statistics.R -m model_files_win10000_s0.01_l5000_r50 -l Model_parameter_list.csv -t 10
Summary files are produced for alternate and null models and for ms and Seq-Gen output. The Seq-Gen files used for the paper analyses are included in the repository.
NOTE: The script egglib_sliding_windows.py
is no longer recommended. Users have reported an error associated with the egglib polymorphism()
function. Scripts that replace the functionality of egglib_sliding_windows.py
can be found at https://github.com/simonhmartin/genomics_general.
Real data for the Heliconius genome, split into 5 kb windows, can be found in the files Heliconius_autosome_windows.csv
and Heliconius_Zchromosome_windows.csv
.
Calculation of summary statistics was done with the python script egglib_sliding_windows.py
, which makes use of the EGGLIB library. Input was a "calls" format file, provided in Martin et al. 2013. Window size is specified with the -w
flag, sliding increment with the -i
flag and minimum number of sites with the -m
flag. The latter is a hard cutoff, and windows with fewer sites are discarded. There is also a soft cutoff between 0 and 1, specified with --minimumExploitableData
. The script will output a column called sitesOverMinExD. At a value of 0.5, this would report the number of sites in the window that had genotype calls for at least 50% of the individuals. To analyse autosomes and Z-linked scaffolds separately, the --include
and --exclude
flags were used, along with the file Hmel1-1_Zupdated_Zscafs.txt
, which provides names of all Z-linked scaffolds provided in Martin et al. 2013. For the Z chromosome analysis, ploidy was specified using the --ploidy
flag, because there were two females in the dataset of Martin et al. 2013. See the commands below.
Commands to calculate values for the whole genome were, for autosomes:
python egglib_sliding_windows.py -i set31.Zupdated.union.geno -o set31.Zupdated.union.autoScafs.PiDxyABBAs.w5m1s5.csv -w 5000 -m 1000 -s 5000 --report 100 -a pi,dxy,ABBABABA,popS,S -p "P1[ag108,ag572,ag112,ag569];P2[am216,am160,am48,am293];P3[tiP86,tiP313,tiP84,tiP57];O[hec273,eth67,ser202,par371];P3a[tiP86,tiP313];P3b[tiP84,tiP57]" --minimumExploitableData 0.5 --exclude Hmel1-1_Zupdated_Zscafs.txt
And for the Z chromosome:
python egglib_sliding_windows.py -i set31.Zupdated.union.geno -o set31.Zupdated.union.chrZscafs.PiDxyABBAs.w5m1s5.csv -w 5000 -m 1000 -s 5000 --report 100 -a pi,dxy,ABBABABA,popS,S -p "P1[ag108,ag572,ag112,ag569];P2[am216,am160,am48,am293];P3[tiP86,tiP313,tiP84,tiP57];O[hec273,eth67,ser202,par371];P3a[tiP86,tiP313];P3b[tiP84,tiP57]" --minimumExploitableData 0.5 --include Hmel1-1_Zupdated_Zscafs.txt --ploidy "2,2,2,2,2,1,2,1,2,2,2,2,2,2,2,2"