-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
only 100 cells output from feature barcoding data #136
Comments
Thanks for the report, @crazyhottommy! Pinging @DongzeHE here to help work toward a quick resolution :). |
Hello @crazyhottommy, Thank you very much for choosing alevin-fry. Yes, processing 10X Feature barcoding is a little bit complicated because it uses a different set of barcodes than GEX (see here). Therefore, you could not directly use the 10x whitelist, in your case, passing
Hope the above can help you to solve the problem. Please let us know if there are any other questions! Best, |
Thanks @rob-p and @DongzeHE for the quick reply! I am aware that the feature barcode is different from the RNA. https://divingintogeneticsandgenomics.com/post/how-to-use-salmon-alevin-to-preprocess-cite-seq-data/ For this PBMC dataset, I did not provide the --unfiltered-pl with a path for the ADT data, and it worked fine Thanks for your answer, and I will try it again with your suggestion. PS, I did read the Solution 2 tutorial before, and from "only" my experience, I still prefer to use explicit commands rather than a pre-configured workflow:) Best, |
Oh Ok, if it is CITE-seq instead of 10X feature barcode, then the problem might not come from the barcodes. @rob-p : can simpleaf process 10x5' v2 data now? |
Hi @crazyhottommy, after going through your command again, there are two things we can check:
Let me know if there are any doubts. Thank you very much! Best, |
Right! So we can process 5’ data, but currently only read 1 is used for biological mapping. Incorporating paired end constraints from what remains of read 2 is an imminent feature and should land soon, but right now the thing to do is treat the barcode read in 5’ as essentially fully technical. We should have an FAQ about this (and update it as new features land). |
Thanks, @rob-p and @DongzeHE
simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori rc --unfiltered-pl CG000193_Barcode_Whitelist_forCustom_Feature_Barcoding_conjugates_RevA.txt --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_whitelist_quant
without whitelist simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori fw --unfiltered-pl --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant This gives me 70019 unfiltered cells, which seems to be right. and with whitelist: simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori fw --unfiltered-pl CG000193_Barcode_Whitelist_forCustom_Feature_Barcoding_conjugates_RevA.txt --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_whitelist_fw_quant This gives me 0 cells too. @DongzeHE "I saw that the read length of your read2 file is 90. As you specified --chemistry 10xv2, I guess simpleaf will try to map the entire read2 against the index." Read2 is 90 bases, and the feature barcode is short: cat adt.tsv
CCR7 AGTTCAGTCAACCGA
CD45RA TCAATCCTTCCGCTT
CD45RO CTCCGAATCATGTTG
CD161 GTACGCAGTCCTTCT
CD8A GCTGCGCTTTCCATT
IgG CTGGAGCGATTAGAA
PD1 ACAGCGCCGTATTTA How should I specify the command to restrict it to only the feature barcode length? @rob-p "the thing to do is treat the barcode read in 5’ as essentially fully technical. " How should one specify it in the command? It is a bit confusing for me to specify the right arguments for different technologies, 10xv2, 10xv3, it will be great to have a FAQ on this! Thanks again for this awesome tool! Tommy |
Hi @crazyhottommy, cool! looks like we are very close to the answer! Based on my understanding of this excellent post, simpleaf can handle 10x Chromium V5 VDJ because it is very similar to the 3' kits. It could be the case that we need to customize it, for example specifying
This means that the feature barcode file only works for 10X barcode assays, not CITE-seq. I apologize for the wrong information I provided. We should not use the feature barcode file provided by 10X.
The increased number of unfiltered cells when specifying the orientation as forward suggests that the problem actually comes from the orientation. However, I am still unsure if mapping the whole read2 to the index makes sense and why the reads can be mapped if they are longer than the indexed sequences. Could you please check the mapping rate? you can find this in the log (json) file exported by
Unfortunately, I think we need to do this using a custom command. To be specific, according to the structure of the feature library, feature barcode starts at the 11th base and of length 15. I confirmed with the read2 file you shared. Therefore, what we can do is write a small awk command to take the part we want. Here I show an example. zcat read2.fastq.gz | awk '{if (NR%4==2 || NR%4 == 0) {print substr($0,11,15)} else {print $0}}' > read2_fb_only.fastq
gzip read1_fb_only.fastq By doing so, you will get a new file named The processed read2 will look like this I am sure that this file can be correctly processed by simpleaf now! Best, |
@DongzeHE : why can't we use a custom geometry flag, and specify the remainder of the read as |
Great, we are making progress! @DongzeHE good luck with your job hunting and thanks for helping out! Let me clarify: This is the 10xv3 pbmc protein ADT fastq R2 reads downloaded from https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_fastqs.tar as in the alevin tutorial @A00228:290:H3FVWDRXX:1:1101:22923:1047 2:N:0:CAGTACTG
TTTCTATCAAGAAAGTCAAAGCACTGCGTTGGTTGCTTTAAGGCCGGTCCTAGCAATCAAAGTATTATGCTTTCGACCCAATACCTGTCTC
+
FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:290:H3FVWDRXX:1:1101:9661:1063 2:N:0:CAGTACTG
GGGTGTTCACCCCTGTCTCTTATACACATCTGACGCTGCCGACGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAGGGGGGGGGGGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFF:::,,,FFFFFFFFFF
zless -S pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz | head -2 | tail | wc -L
91 It is 91 bases. cat adt.tsv
CD3 AACAAGACCCTTGAG
CD4 TACCCGTAATAGCGT
CD8a ATTGGCACTCAGATG
CD14 GAAAGTCAAAGCACT
CD15 ACGAATCAATCTGTG
CD16 GTCTTTGTCAGTGCA
CD56 GTTGTCCGACAATAC
CD19 TCAACGCTTGGCTAG
CD25 GTGCATTCAACAGTA
CD45RA GATGAGAACAGGTTT
CD45RO TGCATGTCATCGGTG
PD-1 AAGTCGTGAGGCATG
TIGIT TGAAGGCTCATTTGT
CD127 ACATTGACGCAACTA
IgG2a CTCTATTCAGACCAG
IgG1 ACTCACTGGAGTCTC
IgG2b ATCACATCGTTGCCA The CD3 feature barcode is in the middle of R2 reads and I used simpleaf quant \
--reads1 $reads1 \
--reads2 $reads2 \
--threads 16 \
--index $AF_SAMPLE_DIR/data/adt_index \
--chemistry 10xv3 --resolution cr-like \
--expected-ori fw --unfiltered-pl \
--t2g-map $AF_SAMPLE_DIR/data/t2g_adt.tsv \
--output alevin_adt in https://divingintogeneticsandgenomics.com/post/how-to-use-salmon-alevin-to-preprocess-cite-seq-data/ without any problems. Why in 10x5' v2 (my case), one needs to The output by specifying simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori fw --unfiltered-pl --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant btw, which log should I check for the mapping rate? ls *
simpleaf_quant_log.json
af_map:
alevin aux_info cmd_info.json libParams logs map.rad unmapped_bc_count.bin
af_quant:
alevin featureDump.txt map.collated.rad permit_map.bin unmapped_bc_count_collated.bin
collate.json generate_permit_list.json permit_freq.bin quant.json Thanks again! my current conclusions: Although I still need to understand why it is the case... |
Hi @rob-p, here the problem is that the actual feature barcodes that should be used for mapping are contained within the 90 bases' Read2. can we use customer geometry flag to process that? |
just updated my answer in this thread:) |
Hi @crazyhottommy,
Yes, the results look good to me. The confusion is from me about how salmon maps sequences that are longer than the indexed sequences.
To check the mapping rate, in your case, you should check this file:
If, in the future, you use
I agree with your conclusions and have the same confusion. From their chemistry specification (v3 and v5), I cannot tell why we need to use Thank you so much for bearing with us. We will work on providing FAQs soon. Best, |
@DongzeHE: Yes; we can do this. If I understand properly, you're saying the read is longer than it needs to be (90bp), when we need to map only 15bp of it? In that case we can use a custom geometry that specifies that the read consists of only the length 15 sequence of interest. Something like |
Ahhh, right! I lost my mind😅. So in this case, we can do Then, from the example provided by @crazyhottommy, when the indexed feature barcodes are 15bp and read2s are 91bp, providing |
So I'm not sure if the underlying mapper being used here is salmon or piscem. But either way, the mapping process is pretty robust to "aberrant" sequence. So as long as the rest of the read isn't spuriously mapping against the (feature barcode) index, those reads are probably still getting mapped. Regardless, we should probably ignore the parts of the reads we know aren't meaningful anyway ;P. |
Yup yup! Thank you very much for the explanations! 😃 |
okay, this is very helpful! so the correct command should be: simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz \
--threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index \
--chemistry 1{b[16]u[10]x:}2{x[10]r[15]x:} --resolution cr-like --expected-ori fw --unfiltered-pl --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant
specify the chemistry using the geometry: simpleaf quant \
--reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz \
--reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz \
--threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index \
--chemistry 1{b[16]u[10]x:}2{x[10]r[15]x:} --resolution cr-like --expected-ori fw \
--unfiltered-pl /mnt/disks/tommy/af_test_workdir/737K-august-2016.txt \
--t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv \
--output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant One has to specify the 10x5' v2 whitelist explicitly. https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist |
Also, I am curious how the parameters for ADT quantification are specified at https://combine-lab.github.io/alevin-fry-tutorials/2023/running-simpleaf-workflow/ |
Thanks @crazyhottommy @DongzeHE and @rob-p for working through this! I can also confirm that for 10x 5' data with HTOs, the above logic seemingly worked for me. I still do not understand why the RNA portion would be mapped with Note for those not using
Then for the matching RNA FASTQs:
|
Hi Alevin-fry developers,
I am using simpleaf (simpleaf 0.14.1) to quantify the protein ADT data. somehow, only 109 cells in the final quantification. The corresponding RNA data gives me 10K cells. Anything I am doing wrong here?
It is 10x5' v2 data
This is the command:
version:
This is the log
Files to reproduce the results
Fastq files can be found at: https://drive.google.com/drive/folders/1diN0ybVo1mha27mvARYrAzAOqUoaM6KP?usp=sharing
Let me if you need any other information.
Thanks a lot for looking into this!
Best,
Tommy
The text was updated successfully, but these errors were encountered: