Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zotu fasta and insect contain more zotus than zotu table #85

Closed
cajwalsh opened this issue Aug 16, 2024 · 17 comments · Fixed by #112
Closed

zotu fasta and insect contain more zotus than zotu table #85

cajwalsh opened this issue Aug 16, 2024 · 17 comments · Fixed by #112

Comments

@cajwalsh
Copy link

I noticed recently that the zotu table was missing entries/rows for zOTUs that were classified by insect. They also appear in the fasta file. E.g. the fasta contains 10675 zOTUs but the table only has 10621. Many of these were unclassified/blank or unknown eukaryotes (according to insect), but many were identified to at least order and some to family or to species.

I also noticed that there were zOTUs that had fewer than the minimum abundance specified for vsearch (e.g. 8 is the default but I had many below 8 read abundance across the whole dataset).

This made me suspect that this is coming from after the derep and clustering step of the vsearch process in either the uchime or remap/table creation step.

I don't think this is new as I'm pretty sure I remember looking into this before but forgetting, but also as you mentioned it doesn't seem likely to be something with the pipeline so much as vsearch doing things we don't expect.

@cajwalsh
Copy link
Author

related to this process as well, I have been meaning to say that I am pretty sure that "all_cpus" for the derep processes is using all cores on the computer rather than all cores given to the run to use. I assume this is not the desired functionality but if so no worries.

@mhoban
Copy link
Owner

mhoban commented Aug 19, 2024

related to this process as well, I have been meaning to say that I am pretty sure that "all_cpus" for the derep processes is using all cores on the computer rather than all cores given to the run to use. I assume this is not the desired functionality but if so no worries.

Oh yeah, this is something I've known about and meant to fix but keep forgetting. I'll make a separate issue for it.

@ericgarciaresearch
Copy link

Hey @mhoban , @cajwalsh ,
I have been running/testing your pipeline with a few datasets along with @cbird808 , and I am also seeing this behavior too. My zotu_table.tsv is missing some of the ZOTUs that are present in the zotu fasta and also the lca_taxonomy.tsv

How you found what is causing this vsearch behavior? or how can we avoid losing these ZOTUs?

Also related, are you still pursuing adding the option to use the DADA2 denoiser? (issue #34)

@mhoban
Copy link
Owner

mhoban commented Oct 8, 2024

@ericgarciaresearch @cajwalsh can you provide input files that reproduce this problem? This is the relabeled and merged fasta file that goes into vsearch. There should be a copy sitting in preprocess/merged in your analysis directory

@mhoban
Copy link
Owner

mhoban commented Oct 9, 2024

Also related, are you still pursuing adding the option to use the DADA2 denoiser? (issue #34)

It's still on the list, but I haven't made much progress. One of the issues is that DADA2 prefers that you keep your reads un-merged, so it would require some reworking to fit it into the rest of the pipeline.

@ericgarciaresearch
Copy link

@mhoban I sent the relabeled fasta file to your hawaii email. Please let me know if you cannot access it. Also please don't make that data public.

some of my remaining questions are: Is the LCA taking the list of zotus from the fasta file? are the missing zotus in the zotu table correctly being dropped by vsearch (chimera or other filter)? If yes, should all downstream analyses only consider the zotus in the zotu table?

Let me know what you find out!

@mhoban
Copy link
Owner

mhoban commented Oct 11, 2024

BLAST and insect use the zOTUs fasta file as their query source (and LCA uses the BLAST results as input), but the final zOTU table by necessity only contains the zOTUs in the dereplicated table (output/zotus/zotu_table.tsv).

I'll look into this further and let you know.

@mhoban
Copy link
Owner

mhoban commented Oct 11, 2024

This issue is very likely due to the fact that zotus.fasta is generated in one step and the zotu table is generated in a subsequent step using a similarity threshold (currently hard-coded to 97%). This is one of a number of aspects of the pipeline that are a holdover from its adaptation from eDNAFlow. I will make that threshold user-definable (which I should have done a long time ago). I suspect that tweaking that value will change which zotus end up in the fasta versus the abundance table.

Here's how vsearch is called (note emphasized --id parameter on the last line):


vsearch --threads ${task.cpus} --derep_fulllength ${relabeled_merged} --sizeout --output "${id}_unique.fasta"
vsearch --threads ${task.cpus} --cluster_unoise "${id}_unique.fasta" --centroids "${id}_centroids.fasta" --minsize ${params.minAbundance} --unoise_alpha ${params.alpha}
vsearch --threads ${task.cpus} --uchime3_denovo "${id}_centroids.fasta" --nonchimeras "${id}_zotus.fasta" --relabel Zotu 
vsearch --threads ${task.cpus} --usearch_global ${relabeled_merged} --db "${id}_zotus.fasta" --id 0.97 --otutabout zotu_table.tsv

@mhoban
Copy link
Owner

mhoban commented Oct 11, 2024

The zOTU table generation step of dereplicate currently hard-codes a "fractional identity" of 97%. In vsearch it's passed explicitly, in usearch 0.97 is the default so it isn't specified.

This is a holdover from eDNAFlow and the pipeline should be updated to allow the user to customize this value.

@mhoban
Copy link
Owner

mhoban commented Oct 11, 2024

I tried re-running the otu table generation step with --id 0.99 and all zOTUs from the fasta were included in the table.

@mhoban
Copy link
Owner

mhoban commented Oct 16, 2024

This looks to be the same issue as #73, so I'm gonna close that one.

@ericgarciaresearch
Copy link

Cool, I think that answers my questions, Thanks!!!

@ericgarciaresearch
Copy link

Hi @mhoban , sorry forgot to ask, are you planning to make that fractional identify --id a user-defined variable? I think we might be interested in increasing this threshold for some datasets at the very least

@ericgarciaresearch
Copy link

ahh nevermind, I see it!

--zotu-identity

@ericgarciaresearch
Copy link

related question:
can you report the used --alpha and --zotu-identity in /output/zotus/settings.txt?

I have been increasing alpha to 5 for COI datasets (based on https://doi.org/10.1186/s12859-021-04115-6) which has generated more zotus. I have been keeping track in my notes of which run uses which alpha but I have a script that harvests the settings and results stats from the output but I couldn't find where the used alpha is reported in the output. I only see the min seq abundance and denoiser.

@mhoban
Copy link
Owner

mhoban commented Oct 25, 2024

can you report the used --alpha and --zotu-identity in /output/zotus/settings.txt?

Of course, I should have done so already. But heads up that there's a (low priority) open issue to change those settings.txt files to yaml format. It may require you to update your script, but it may make things easier in terms of machine-readability.

@mhoban
Copy link
Owner

mhoban commented Oct 25, 2024

@ericgarciaresearch this is now done as of 4046795

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants