-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zotu fasta and insect contain more zotus than zotu table #85
Comments
related to this process as well, I have been meaning to say that I am pretty sure that "all_cpus" for the derep processes is using all cores on the computer rather than all cores given to the run to use. I assume this is not the desired functionality but if so no worries. |
Oh yeah, this is something I've known about and meant to fix but keep forgetting. I'll make a separate issue for it. |
Hey @mhoban , @cajwalsh , How you found what is causing this vsearch behavior? or how can we avoid losing these ZOTUs? Also related, are you still pursuing adding the option to use the DADA2 denoiser? (issue #34) |
@ericgarciaresearch @cajwalsh can you provide input files that reproduce this problem? This is the relabeled and merged fasta file that goes into vsearch. There should be a copy sitting in |
It's still on the list, but I haven't made much progress. One of the issues is that DADA2 prefers that you keep your reads un-merged, so it would require some reworking to fit it into the rest of the pipeline. |
@mhoban I sent the relabeled fasta file to your hawaii email. Please let me know if you cannot access it. Also please don't make that data public. some of my remaining questions are: Is the LCA taking the list of zotus from the fasta file? are the missing zotus in the zotu table correctly being dropped by vsearch (chimera or other filter)? If yes, should all downstream analyses only consider the zotus in the zotu table? Let me know what you find out! |
BLAST and insect use the zOTUs fasta file as their query source (and LCA uses the BLAST results as input), but the final zOTU table by necessity only contains the zOTUs in the dereplicated table ( I'll look into this further and let you know. |
This issue is very likely due to the fact that Here's how vsearch is called (note emphasized
|
The zOTU table generation step of This is a holdover from eDNAFlow and the pipeline should be updated to allow the user to customize this value. |
I tried re-running the otu table generation step with |
This looks to be the same issue as #73, so I'm gonna close that one. |
Cool, I think that answers my questions, Thanks!!! |
Hi @mhoban , sorry forgot to ask, are you planning to make that fractional identify --id a user-defined variable? I think we might be interested in increasing this threshold for some datasets at the very least |
ahh nevermind, I see it! --zotu-identity |
related question: I have been increasing alpha to 5 for COI datasets (based on https://doi.org/10.1186/s12859-021-04115-6) which has generated more zotus. I have been keeping track in my notes of which run uses which alpha but I have a script that harvests the settings and results stats from the output but I couldn't find where the used alpha is reported in the output. I only see the min seq abundance and denoiser. |
Of course, I should have done so already. But heads up that there's a (low priority) open issue to change those settings.txt files to yaml format. It may require you to update your script, but it may make things easier in terms of machine-readability. |
@ericgarciaresearch this is now done as of 4046795 |
I noticed recently that the zotu table was missing entries/rows for zOTUs that were classified by insect. They also appear in the fasta file. E.g. the fasta contains 10675 zOTUs but the table only has 10621. Many of these were unclassified/blank or unknown eukaryotes (according to insect), but many were identified to at least order and some to family or to species.
I also noticed that there were zOTUs that had fewer than the minimum abundance specified for vsearch (e.g. 8 is the default but I had many below 8 read abundance across the whole dataset).
This made me suspect that this is coming from after the derep and clustering step of the vsearch process in either the uchime or remap/table creation step.
I don't think this is new as I'm pretty sure I remember looking into this before but forgetting, but also as you mentioned it doesn't seem likely to be something with the pipeline so much as vsearch doing things we don't expect.
The text was updated successfully, but these errors were encountered: