-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include SMCSMC #34
base: master
Are you sure you want to change the base?
Include SMCSMC #34
Conversation
Nice to speak with everyone on the call today -- my apologies for not finding the developer documentation earlier. I will edit this to conform to the |
no worries at all- thanks for joining the team! |
Nice chatting with everyone! My homework for the next call will be to perform demographic inferences using SMC++ under the Gutenkunst model. These results will be helpful to compare the parameter estimates from SMC++ to the ones from fastsimcoal and dadi obtained by Chris Kyriazis. SMC++ assumes no gene flow after the populations split. Therefore, I will do the Gutenkunst model analysis including and not including gene flow to see how gene flow would bias the inferences. |
Sounds good @dortegadelv - let me know if you need any help adding it to the two population pipeline thats already in place. I should have a more finalized version up on the github by the end of the week. |
Using the pipeline you have already developed would be great @ckyriazis ! Please let me know once you have the finalized version. Thanks! |
65f2f8e
to
e9a11c5
Compare
My apologies to anyone who received a bunch of emails from this thread -- cleaning up and squashing the history required considerably more git-fu than I am comfortable with. Everything is now squashed in, and
I also had some issues getting flake8 to work with the Snakemake -- and it appears as though the code is not all pep8 compliant in any case, so I haven't tried too hard to make this the case. I think that's all, but when you have a moment please let me know what the next steps would be. I'm looking to do a whole run with the whole pipeline (so far, I've only tested and confirmed that my Thanks everyone, best wishes. |
Hey @Chris1221- I just merged @jradrion's PR with the changes to |
@Chris1221 I can go ahead and review this code and implement the minor changes needed for your PR to play with this most recent merge that @andrewkern just made. No need to edit your PR at this point. |
Thanks @andrewkern and @jradrion -- I appreciate the heads up. I was just thinking that I've actually made a few (very minor) changes to the |
Hopefully you should be able to install conda config --add channels conda-forge
conda config --add channels terhorst
conda install -c luntergroup smcsmc And I have the above code set up to run on a qsub cluster (it's just the |
so I believe the channels should be all set. we'll just need to add |
Actually, @Chris1221, if it's not too much trouble. Would you be able to integrate the changes for smcsmc into the most recent PR merge that Andy just made? This way I can review smcsmc once rather than needing an additional re-review. Sorry for the confusion. |
Yep, no problem @jradrion -- I'll also rerun it just to check and make sure that everything is working before you review. |
Sorry for the delay, the changes are merged in now. I'm just rerunning to make sure that everything is working with the current smcsmc. |
@Chris1221 No worries! I'll give it a run on our machine this afternoon. |
Perfect, thanks. I can confirm that everything is working fine for me with this test config. The number of particles is set to 10, so there won't be any accuracy in this run -- I usually use 5000 or 10000 for reference. {
"seed" : 12345,
"population_id" : 0,
"num_sampled_genomes_per_replicate" : 20,
"num_sampled_genomes_msmc" : "2,8",
"num_sampled_genomes_smcsmc" : "4",
"num_smcsmc_particles": 10,
"num_msmc_iterations" : 20,
"num_smcsmc_iterations": 1,
"replicates" : 1,
"species" : "homo_sapiens",
"model" : "GutenkunstThreePopOutOfAfrica",
"genetic_map" : "HapmapII_GRCh37",
"chrm_list" : "all",
} Also, I think you guys are using a SLURM cluster, but |
n_t/Snakefile
Outdated
'Np': str(num_smcsmc_particles), | ||
# Submission Parameters | ||
'chunks': '100', | ||
'c': '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete this line to not use an SGE/qsub cluster.
@Chris1221 Sorry, got sidetracked yesterday and I'm just getting to this today. Can you confirm that your conda install of smcsmc did not inadvertently remove smcpp? When I attempted the install via
|
That's extremely strange, do you mind posting the full output and which packages it wants to downgrade? I wonder why it wants to get rid of smcpp -- I don't mention it anywhere in the recipe. |
No problem! I don't have a ton of experience using conda, so let me know if I did something incorrectly.
|
Thanks, that's very helpful. Don't worry, it's definitely not you, I think I must have messed something up. Maybe I was too stringent with the version pinning... If you don't mind a delay, I'll take a look at this today and see if I can figure out what's going on, I'm very confused by the number of packages that it wants to downgrade, almost none of those are necessary. Sorry about this! |
@Chris1221 No worries, let me know when I should try the install again. |
Would you mind trying now @jradrion? I did not explicitly make an
|
Looks great @Chris1221! This install worked and everything is looking good. I probably won't have the opportunity to fully test everything until I return from SMBE. I'll post an update when I get the chance. |
Oh great, thanks @jradrion. Enjoy the conference, I'm attending via twitter ;) |
So I've been having some issues with the Snakefile hanging right around 80% completion, but it looks like the problem is with MSMC. I'll have to dig into this further, because my MSMC rules were previously working. @Chris1221 can you confirm that you were able to run the Snakefile in it's entirety? The reason I ask is that I was also getting a wildcard error in the |
Hi @jradrion, I actually have not been able to run the entire snakefile, though the issues that I've been having have been with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Chris1221 I identified where my Snakefile was hanging, and it's actually at rule compound_smcsmc
. The last message printed to the screen was INFO:smcsmc.model:Waiting for chunk 0 to appear...
. Any thoughts?
Yep that sounds right. That message means that |
Hooray! Okay, I'm very happy that it's working for you. The configuration file that I provided is very much a 'validating the software works' example. To get the accuracy that is shown in my first plots requires the particle count to be increased to ~ 10k and increasing the iterations to ~15-30. But I'm very happy that the software is at least running on the data and the pipeline is working on more than just my local setup. Perhaps I should take out the cluster directive, or recode it into the configuration file, would that perhaps be more appropriate? At least in that case it would run on anyone's computer, albeit slowly. |
It works! Of course, I understand that these are easy-to-test parameters, and that accuracy could likely be dramatically improved. I should have specified that I'm not running this on a cluster, just a Linux box with 80 cores! I think ultimately it would be nice to be able to run it on a cluster, but for now it might simplify things to remove the cluster directive. |
197caf4
to
c0b35c0
Compare
Neat! Makes sense, I've done that in the latest (squashed) commit. I also made the input format for the number of sampled genomes the same as |
Looks good! I'm also running into problems with the smcsmc implementation in plots.py. Are you able to debug that? I think you're going to need the output from smcpp and msmc as well. I can give it a pass if you are not currently able to generate the smcpp and msmc outputs. One issue I see is that the lines to split the sample size from your output file names will not work the same way as they do for the msmc output, as msmc files start with the sample size where yours are all labeled "results.out.csv" in separate directories that indicate the sample sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Chris1221 I couldn't find a way to modify your PR, but I have attached a correction to the function in question from plots.py
. This now allows the plots for all methods to be output in the same pdf. This is working on my end but I'm happy to try it again when you commit.
def plot_all_ne_estimates(sp_infiles, smcpp_infiles, msmc_infiles, smcsmc_infiles, outfile,
model, n_samp, generation_time, species,
pop_id = 0, steps=None):
ddb = msprime.DemographyDebugger(**model.asdict())
if steps is None:
end_time = ddb.epochs[-2].end_time + 10000
steps = np.linspace(1,end_time,end_time+1)
num_samples = [0 for _ in range(ddb.num_populations)]
num_samples[pop_id] = n_samp
coal_rate, P = ddb.coalescence_rate_trajectory(steps=steps,
num_samples=num_samples, double_step_validation=False)
steps = steps * generation_time
num_msmc = set([os.path.basename(infile).split(".")[0] for infile in msmc_infiles])
num_smcsmc = set([infile.split("/")[-2].split(".")[0] for infile in smcsmc_infiles])
num_msmc = sorted([int(x) for x in num_msmc])
num_smcsmc = sorted([int(x) for x in num_smcsmc])
f, ax = plt.subplots(1,2+len(num_msmc) + len(num_smcsmc), sharex=True,sharey=True,figsize=(14, 7))
for infile in smcpp_infiles:
nt = pandas.read_csv(infile, usecols=[1, 2], skiprows=0)
line1, = ax[0].plot(nt['x'], nt['y'], alpha=0.8)
ax[0].plot(steps, 1/(2*coal_rate), c="black")
ax[0].set_title("smc++")
for infile in sp_infiles:
nt = pandas.read_csv(infile, sep="\t", skiprows=5)
line2, = ax[1].plot(nt['year'], nt['Ne_median'],alpha=0.8)
ax[1].plot(steps, 1/(2*coal_rate), c="black")
ax[1].set_title("stairwayplot")
plot_counter=2
for i,sample_size in enumerate(num_msmc):
for infile in msmc_infiles:
fn = os.path.basename(infile)
samp = fn.split(".")[0]
if(int(samp) == sample_size):
nt = pandas.read_csv(infile, usecols=[1, 2], skiprows=0)
line3, = ax[plot_counter].plot(nt['x'], nt['y'],alpha=0.8)
ax[plot_counter].plot(steps, 1/(2*coal_rate), c="black")
ax[plot_counter].set_title(f"msmc, ({sample_size} samples)")
plot_counter+=1
for i,sample_size in enumerate(num_smcsmc):
for infile in smcsmc_infiles:
samp = infile.split("/")[-2].split(".")[0]
if(int(samp) == sample_size):
nt = pandas.read_csv(infile, usecols=[1, 2], skiprows=0)
line3, = ax[plot_counter].plot(nt['x'], nt['y'],alpha=0.8)
ax[plot_counter].plot(steps, 1/(2*coal_rate), c="black")
ax[plot_counter].set_title(f"smcsmc, ({sample_size} samples)")
plot_counter+=1
plt.suptitle(f"{species}, population id {pop_id}", fontsize = 16)
for i in range(2+len(num_msmc)+len(num_smcsmc)):
ax[i].set(xscale="log", yscale="log")
ax[i].set_xlabel("time (years ago)")
red_patch = mpatches.Patch(color='black', label='Coalescence rate derived Ne')
ax[0].legend(frameon=False, fontsize=10, handles=[red_patch])
ax[0].set_ylabel("population size")
f.savefig(outfile, bbox_inches='tight', alpha=0.8)
a note for you guys-- you will want to |
Wow, thank you @jradrion! You're right that I'm unable to test it, but thank you for modifying the plotting code so that it works with the output. I'm sure that if it works for you then it would also work for me. I've put it in in the latest push and (@andrewkern) rebased against the current master (with mask). Everything is all squashed now as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor nitpick above and a question: is SMCSMC set up to do masking in this PR? If not let's get it there before merging
"species" : "homo_sapiens", | ||
"model" : "GutenkunstThreePopOutOfAfrica", | ||
"genetic_map" : "HapmapII_GRCh37", | ||
"chrm_list" : "chr22,chrX" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nitpick: we are going to want to leave the mask_file
variable in the readme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, sorry I was too zealous with the merging. I will put it back!
Right, sorry that slipped my mind. I think I can mostly use your |
it would be great to get another set of eyes on what I implemented for |
Sorry for never coming back to this, I did actually fix the masking code shortly after our conversation, a modified and admittedly very hacky version is in the development version of smc2. I subsequently got very distracted by life and passing a phd milestone. I'll do a little bit more testing and release the new version of smc2 on the weekend if that's okay, then I believe we should be good to go here? (I saw in another pr that @andrewkern was looking to clean up the repo a little and this is very much outstanding, my apologies) |
no worries @Chris1221. the repo has moved a bit since you were working on this PR so be sure to fetch/merge upstream code. |
Some changes upstream in stdpopsim mean that your branch won't work @Chris1221. Once the changes in #48 are merged, you should be able to rebase this branch and it should work. You haven't made any changes to the simulation code, so it should all work fine (after #48 is merged and you have rebased). |
Thanks @jeromekelleher, sorry I missed your comment originally. I'll do my best to get to this soon, sorry for the massive delay. |
Hello everyone,
I'm a PhD student working with Gerton Lunter, and here I've added our algorithm
smcsmc
to the analysis pipeline. As the input format is similar tomsmc
, there is not a lot of additional overhead needed.Details about the algorithm may be found here and the implementation is (just released) here.
I have been able to successfully run through the whole pipeline with five replicates of the Gutenkunst model, with comparable accuracy to msmc. Here's a plot with the Gutenkunst model in black, looking at the 2nd (European-acting) population. I ran this through for both 2 and 4 haploids.
I'm working on getting
smcsmc
onconda-forge
but for the moment the easiest way to install it is manually. Since I'm working on getting it up, I haven't included any other manual install scripts likemsmc
.Changes Summary
smcsmc
essentially duplicating the rules for themsmc
analysis with different scripts. The structure is the same where the overall rule iscompound_smcsmc
which creates a plot. I've added a guide to this plot, which is slightly different thanmsmc
.config.json
and put functional ones in then_t/README.md
n_t/plots.py
for plotting replicates ofsmcsmc
.smcsmc
plots toplots.plot_all_ne_estimates
. I haven't had a chance to run the entire pipeline so I haven't had a chance to test this yet.Other
PopSim
related code. This is probably best packaged here, but I haven't figured out the best way to do this yet. One of these functions is found in theSnakefile
and the submodule is justsmcsmc.popsim
. Right now it's pretty basic.smcsmc
can do any number of haplotypes (though performance wise, 8 is a soft cap) butSnakefile
where the output was being put into the config file directory rather than the specified output directory in the config file. Just a typo.Thanks for putting together such an excellent initiative. Gerton and I are hoping to join in the call today and discuss this with everyone.
Thank you all,
Chris