Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only 100 cells output from feature barcoding data #136

Closed
crazyhottommy opened this issue Feb 16, 2024 · 20 comments
Closed

only 100 cells output from feature barcoding data #136

crazyhottommy opened this issue Feb 16, 2024 · 20 comments

Comments

@crazyhottommy
Copy link

crazyhottommy commented Feb 16, 2024

Hi Alevin-fry developers,

I am using simpleaf (simpleaf 0.14.1) to quantify the protein ADT data. somehow, only 109 cells in the final quantification. The corresponding RNA data gives me 10K cells. Anything I am doing wrong here?

It is 10x5' v2 data

This is the command:

simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_N
LS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1
_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemist
ry 10xv2 --resolution cr-like --expected-ori rc --unfiltered-pl --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv
 --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_quant

version:

 cat af_quant/quant.json
{
  "alt_resolved_cell_numbers": [],
  "cmd": "/home/mtang/miniconda3/envs/af/bin/alevin-fry quant -i /mnt/disks/tommy/af_test_workdir/NLS164_adt_quant/af_quant -o /mnt/disks/tommy/af_test_workdir/NLS164_adt_quant/af_quant -t 16 -m /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv -r cr-like",
  "dump_eq": false,
  "empty_resolved_cell_numbers": [],
  "num_genes": 7,
  "num_quantified_cells": 109,
  "quant_options": {
    "cmdline": "/home/mtang/miniconda3/envs/af/bin/alevin-fry quant -i /mnt/disks/tommy/af_test_workdir/NLS164_adt_quant/af_quant -o /mnt/disks/tommy/af_test_workdir/NLS164_adt_quant/af_quant -t 16 -m /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv -r cr-like",
    "dump_eq": false,
    "filter_list": null,
    "init_uniform": false,
    "input_dir": "/mnt/disks/tommy/af_test_workdir/NLS164_adt_quant/af_quant",
    "large_graph_thresh": 0,
    "num_bootstraps": 0,
    "num_threads": 16,
    "output_dir": "/mnt/disks/tommy/af_test_workdir/NLS164_adt_quant/af_quant",
    "pug_exact_umi": false,
    "resolution": "CellRangerLike",
    "sa_model": "WinnerTakeAll",
    "small_thresh": 10,
    "summary_stat": false,
    "tg_map": "/mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv",
    "use_mtx": true,
    "version": "0.8.2"
  },
  "resolution_strategy": "CellRangerLike",
  "usa_mode": false,
  "version_str": "0.8.2"

This is the log

cat af_map/logs/salmon_quant.log
[2024-02-16 17:09:52.258] [jointLog] [info] setting maxHashResizeThreads to 16
[2024-02-16 17:09:52.258] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2024-02-16 17:09:52.258] [jointLog] [info] The --mimicBT2, --mimicStrictBT2 and --hardFilter flags imply mapping validation (--validateMappings). Enabling mapping validation.
[2024-02-16 17:09:52.258] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2024-02-16 17:09:52.258] [jointLog] [info] The use of range-factorized equivalence classes does not make sense in conjunction with --hardFilter.  Disabling range-factorized equivalence classes.
[2024-02-16 17:09:52.258] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2024-02-16 17:09:52.258] [jointLog] [info] Using default value of 0.87 for minScoreFraction in Alevin
Using default value of 0.6 for consensusSlack in Alevin
[2024-02-16 17:09:52.258] [jointLog] [info] There is 1 library.
[2024-02-16 17:09:52.258] [jointLog] [info] Loading pufferfish index
[2024-02-16 17:09:52.258] [jointLog] [info] Loading dense pufferfish index.
[2024-02-16 17:09:52.259] [jointLog] [info] done
[2024-02-16 17:09:52.305] [jointLog] [info] Index contained 7 targets
[2024-02-16 17:09:52.305] [jointLog] [info] Number of decoys : 0
[2024-02-16 17:10:37.835] [jointLog] [info] Number uniquely mapped : 14848637
[2024-02-16 17:10:37.890] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2024-02-16 17:10:37.890] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2024-02-16 17:10:37.890] [jointLog] [info] Selectively-aligned 14848637 total fragments out of 18461815
[2024-02-16 17:10:37.890] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2024-02-16 17:10:37.890] [jointLog] [warning] Found 1300 reads with `N` in the UMI sequence and ignored the reads.
Please report on github if this number is too large
[2024-02-16 17:10:37.890] [jointLog] [info] finished sc_align()

Files to reproduce the results

cat adt.tsv

CCR7    AGTTCAGTCAACCGA
CD45RA  TCAATCCTTCCGCTT
CD45RO  CTCCGAATCATGTTG
CD161   GTACGCAGTCCTTCT
CD8A    GCTGCGCTTTCCATT
IgG     CTGGAGCGATTAGAA
PD1     ACAGCGCCGTATTTA

# build index for the adt 
salmon index -t adt.tsv -i adt_index --features -k7

## gene to transcript mapping, it is the same column 1
awk '{print $1"\t"$1;}' adt.tsv > t2g_adt.tsv

Fastq files can be found at: https://drive.google.com/drive/folders/1diN0ybVo1mha27mvARYrAzAOqUoaM6KP?usp=sharing

Let me if you need any other information.
Thanks a lot for looking into this!

Best,
Tommy

@rob-p
Copy link
Contributor

rob-p commented Feb 18, 2024

Thanks for the report, @crazyhottommy! Pinging @DongzeHE here to help work toward a quick resolution :).

@DongzeHE
Copy link
Contributor

DongzeHE commented Feb 18, 2024

Hello @crazyhottommy,

Thank you very much for choosing alevin-fry.

Yes, processing 10X Feature barcoding is a little bit complicated because it uses a different set of barcodes than GEX (see here). Therefore, you could not directly use the 10x whitelist, in your case, passing --unfiltered-pl without specifying a path, when calling alevin-fry/simpleaf. There are two ways to overcome this:

  1. You can download the feature barcode whitelist provided by 10X, the file named CG000193_Barcode_Whitelist_forCustom_Feature_Barcoding conjugates_RevA.txt on this page, pass this to --unfiltered-pl, and later map feature barcodes (the second column) to 10x barcodes (the first column), or vise versa, using this "translation" file (https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/translation/3M-february-2018.txt.gz). By doing so, you will get two lists of unfiltered barcodes representing the two modalities. You can then take the overlap and then call high-quality cells using emptyDrops or any methods you like.
  2. We provide a simpleaf workflow template for processing feature barcode data. You can check this tutorial for details. In that template, we first processed the GEX modality using the 10X whitelist. Then, we convert the barcodes exported by simpleaf (the row names of the GEX gene count matrix) into feature barcodes, and use the converted barcodes as the unfiltered whitelist to process the feature barcode modality.

Hope the above can help you to solve the problem. Please let us know if there are any other questions!

Best,
Dongze

@crazyhottommy
Copy link
Author

crazyhottommy commented Feb 18, 2024

Thanks @rob-p and @DongzeHE for the quick reply! I am aware that the feature barcode is different from the RNA.
I previously wrote a series of posts:

https://divingintogeneticsandgenomics.com/post/how-to-use-salmon-alevin-to-preprocess-cite-seq-data/

For this PBMC dataset, I did not provide the --unfiltered-pl with a path for the ADT data, and it worked fine
and later I translated the barcode as you suggested shown below. That dataset is 3' 10xV3 though (note, I specified --expected-ori fw)

https://divingintogeneticsandgenomics.com/post/cite-seq-downstream-analysis-from-alevin-output-to-seurat-visualization/

Thanks for your answer, and I will try it again with your suggestion.

PS, I did read the Solution 2 tutorial before, and from "only" my experience, I still prefer to use explicit commands rather than a pre-configured workflow:)

Best,
Tommy

@DongzeHE
Copy link
Contributor

Oh Ok, if it is CITE-seq instead of 10X feature barcode, then the problem might not come from the barcodes.

@rob-p : can simpleaf process 10x5' v2 data now?

@DongzeHE
Copy link
Contributor

Hi @crazyhottommy, after going through your command again, there are two things we can check:

  1. I am not sure if you should set --expected-ori rc when processing ADT because based on my experience, the orientation of feature barcodes is forward. Could you try --expected-ori fw?
  2. I saw that the read length of your read2 file is 90. As you specified --chemistry 10xv2, I guess simpleaf will try to map the entire read2 against the index. Also, I noticed that almost all read2s end with AGATCTGAAGAGAGTCG or similar sequences with a very small hamming distance. If the data is from an assay that uses short feature barcodes, for example, the one described at here, as the read2s include the feature barcode and other oligonucleotides, I think we need to make sure that only the "barcode" part is used for mapping.

Let me know if there are any doubts. Thank you very much!

Best,
Dongze

@rob-p
Copy link
Contributor

rob-p commented Feb 18, 2024

Right! So we can process 5’ data, but currently only read 1 is used for biological mapping. Incorporating paired end constraints from what remains of read 2 is an imminent feature and should land soon, but right now the thing to do is treat the barcode read in 5’ as essentially fully technical. We should have an FAQ about this (and update it as new features land).

@crazyhottommy
Copy link
Author

Thanks, @rob-p and @DongzeHE
Things I have tried:

  1. use the white list, but now it gives me no cells...
simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori rc --unfiltered-pl CG000193_Barcode_Whitelist_forCustom_Feature_Barcoding_conjugates_RevA.txt --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_whitelist_quant
  1. I tried --expected-ori fw
    Btw, at least for the RNA data, I read Update documentation to include recommended processing for 10x scRNA 5' V2  #118 for 10x5'v2, specify -d rc. alevin library type for 10x Chromium v3 salmon#674

without whitelist

simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori fw --unfiltered-pl --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant

This gives me 70019 unfiltered cells, which seems to be right.

and with whitelist:

simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori fw --unfiltered-pl CG000193_Barcode_Whitelist_forCustom_Feature_Barcoding_conjugates_RevA.txt --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_whitelist_fw_quant

This gives me 0 cells too.

@DongzeHE "I saw that the read length of your read2 file is 90. As you specified --chemistry 10xv2, I guess simpleaf will try to map the entire read2 against the index."

Read2 is 90 bases, and the feature barcode is short:

cat adt.tsv

CCR7    AGTTCAGTCAACCGA
CD45RA  TCAATCCTTCCGCTT
CD45RO  CTCCGAATCATGTTG
CD161   GTACGCAGTCCTTCT
CD8A    GCTGCGCTTTCCATT
IgG     CTGGAGCGATTAGAA
PD1     ACAGCGCCGTATTTA

How should I specify the command to restrict it to only the feature barcode length?

@rob-p "the thing to do is treat the barcode read in 5’ as essentially fully technical. " How should one specify it in the command?

It is a bit confusing for me to specify the right arguments for different technologies, 10xv2, 10xv3, it will be great to have a FAQ on this!

Thanks again for this awesome tool!

Tommy

@DongzeHE
Copy link
Contributor

DongzeHE commented Feb 18, 2024

Hi @crazyhottommy, cool! looks like we are very close to the answer! Based on my understanding of this excellent post, simpleaf can handle 10x Chromium V5 VDJ because it is very similar to the 3' kits. It could be the case that we need to customize it, for example specifying rc as the orientation when processing the GEX modality. You can confirm this by specifying orientation as fw rc and both and comparing them. I will also do an investigation by myself once I get some free cycles. (Job hunting is exhausting ;P)

use the white list, but now it gives me no cells...

This means that the feature barcode file only works for 10X barcode assays, not CITE-seq. I apologize for the wrong information I provided. We should not use the feature barcode file provided by 10X.

This gives me 70019 unfiltered cells, which seems to be right.

The increased number of unfiltered cells when specifying the orientation as forward suggests that the problem actually comes from the orientation. However, I am still unsure if mapping the whole read2 to the index makes sense and why the reads can be mapped if they are longer than the indexed sequences. Could you please check the mapping rate? you can find this in the log (json) file exported by simpleaf quant.
@rob-p What do you think?

How should I specify the command to restrict it to only the feature barcode length?

Unfortunately, I think we need to do this using a custom command. To be specific, according to the structure of the feature library, feature barcode starts at the 11th base and of length 15. I confirmed with the read2 file you shared.
image

Therefore, what we can do is write a small awk command to take the part we want. Here I show an example.

zcat read2.fastq.gz | awk '{if (NR%4==2 || NR%4 == 0) {print substr($0,11,15)} else {print $0}}' > read2_fb_only.fastq
gzip read1_fb_only.fastq

By doing so, you will get a new file named read2_fb_only.fastq.gz containing only the feature barcode sequences.

The processed read2 will look like this
image

I am sure that this file can be correctly processed by simpleaf now!

Best,
Dongze

@rob-p
Copy link
Contributor

rob-p commented Feb 18, 2024

@DongzeHE : why can't we use a custom geometry flag, and specify the remainder of the read as x:. Your solution will likely work, but we can handle this already with custom geometry I think and without the need for external processing with awk.

@crazyhottommy
Copy link
Author

crazyhottommy commented Feb 18, 2024

Great, we are making progress! @DongzeHE good luck with your job hunting and thanks for helping out!

Let me clarify:

This is the 10xv3 pbmc protein ADT fastq R2 reads downloaded from https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_fastqs.tar as in the alevin tutorial

@A00228:290:H3FVWDRXX:1:1101:22923:1047 2:N:0:CAGTACTG
TTTCTATCAAGAAAGTCAAAGCACTGCGTTGGTTGCTTTAAGGCCGGTCCTAGCAATCAAAGTATTATGCTTTCGACCCAATACCTGTCTC
+
FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:290:H3FVWDRXX:1:1101:9661:1063 2:N:0:CAGTACTG
GGGTGTTCACCCCTGTCTCTTATACACATCTGACGCTGCCGACGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAGGGGGGGGGGGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFF:::,,,FFFFFFFFFF


zless -S pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz | head -2 | tail | wc -L
91

It is 91 bases.

cat adt.tsv
CD3     AACAAGACCCTTGAG
CD4     TACCCGTAATAGCGT
CD8a    ATTGGCACTCAGATG
CD14    GAAAGTCAAAGCACT
CD15    ACGAATCAATCTGTG
CD16    GTCTTTGTCAGTGCA
CD56    GTTGTCCGACAATAC
CD19    TCAACGCTTGGCTAG
CD25    GTGCATTCAACAGTA
CD45RA  GATGAGAACAGGTTT
CD45RO  TGCATGTCATCGGTG
PD-1    AAGTCGTGAGGCATG
TIGIT   TGAAGGCTCATTTGT
CD127   ACATTGACGCAACTA
IgG2a   CTCTATTCAGACCAG
IgG1    ACTCACTGGAGTCTC
IgG2b   ATCACATCGTTGCCA

The CD3 feature barcode is in the middle of R2 reads
Screen Shot 2024-02-18 at 3 04 51 PM

and I used

simpleaf quant \
--reads1 $reads1 \
--reads2 $reads2 \
--threads 16 \
--index  $AF_SAMPLE_DIR/data/adt_index \
--chemistry 10xv3 --resolution cr-like \
--expected-ori fw --unfiltered-pl \
--t2g-map $AF_SAMPLE_DIR/data/t2g_adt.tsv \
--output alevin_adt

in https://divingintogeneticsandgenomics.com/post/how-to-use-salmon-alevin-to-preprocess-cite-seq-data/ without any problems.

Why in 10x5' v2 (my case), one needs to substr the feature barcode as you showed using awk?

The output by specifying --expected-or fw without using the whitelist looks good to me, I checked the mtx file and I get many non-zero counts.

simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz --threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index --chemistry 10xv2 --resolution cr-like --expected-ori fw --unfiltered-pl  --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant

btw, which log should I check for the mapping rate?

ls *
simpleaf_quant_log.json

af_map:
alevin  aux_info  cmd_info.json  libParams  logs  map.rad  unmapped_bc_count.bin

af_quant:
alevin        featureDump.txt            map.collated.rad  permit_map.bin  unmapped_bc_count_collated.bin
collate.json  generate_permit_list.json  permit_freq.bin   quant.json

Thanks again!

my current conclusions:
for 10x5' v2, for the RNA, you need to specify --expected-ori rc
for 10x5' v2, for the protein, you need to specify --expected-ori fw
for 10x3' v3 for the RNA and protein, you need to specify --expected-ori fw.

Although I still need to understand why it is the case...

@DongzeHE
Copy link
Contributor

Hi @rob-p, here the problem is that the actual feature barcodes that should be used for mapping are contained within the 90 bases' Read2. can we use customer geometry flag to process that?

@crazyhottommy
Copy link
Author

just updated my answer in this thread:)

@DongzeHE
Copy link
Contributor

Hi @crazyhottommy,

Why in 10x5' v2 (my case), one needs to substr the feature barcode as you showed using awk?

Yes, the results look good to me. The confusion is from me about how salmon maps sequences that are longer than the indexed sequences.
So, @rob-p : say we indexed some feature barcodes of length 15. In a later stage, we need to map Read2s (of length 90) against the index, how does salmon handle this?

btw, which log should I check for the mapping rate?

To check the mapping rate, in your case, you should check this file:

/mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant/af_map/salmon_quant.log

If, in the future, you use piscem, which is the latest (and faster) mapper we provide in our ecosystem, you can find the log in simpleaf_quant_output_dir/af_map/map_info.json

my current conclusions:

I agree with your conclusions and have the same confusion. From their chemistry specification (v3 and v5), I cannot tell why we need to use rc for the 5' assay. I will also investigate this once I have free cycles.

Thank you so much for bearing with us. We will work on providing FAQs soon.

Best,
Dongze

@rob-p
Copy link
Contributor

rob-p commented Feb 18, 2024

So, @rob-p : say we indexed some feature barcodes of length 15. In a later stage, we need to map Read2s (of length 90) against the index, how does salmon handle this?

@DongzeHE: Yes; we can do this. If I understand properly, you're saying the read is longer than it needs to be (90bp), when we need to map only 15bp of it? In that case we can use a custom geometry that specifies that the read consists of only the length 15 sequence of interest. Something like 2{x[10]r[15]x:} to designate we should skip the first 10 bases, use the next 15 as the read, and discard whatever remains. Does that make sense?

@DongzeHE
Copy link
Contributor

DongzeHE commented Feb 18, 2024

Ahhh, right! I lost my mind😅. So in this case, we can do 1{b[16]u[10]x:}2{x[10]r[15]x:} and pass this to --chemistry. We do not need to use the awk command.

Then, from the example provided by @crazyhottommy, when the indexed feature barcodes are 15bp and read2s are 91bp, providing --chemistry 10xv3 without saying the configuration of read2 was enough for salmon to map reads. What's the magic happening there?

@rob-p
Copy link
Contributor

rob-p commented Feb 18, 2024

So I'm not sure if the underlying mapper being used here is salmon or piscem. But either way, the mapping process is pretty robust to "aberrant" sequence. So as long as the rest of the read isn't spuriously mapping against the (feature barcode) index, those reads are probably still getting mapped. Regardless, we should probably ignore the parts of the reads we know aren't meaningful anyway ;P.

@DongzeHE
Copy link
Contributor

Yup yup! Thank you very much for the explanations! 😃

@crazyhottommy
Copy link
Author

crazyhottommy commented Feb 19, 2024

okay, this is very helpful! so the correct command should be:

simpleaf quant --reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz --reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz \
--threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index \
--chemistry 1{b[16]u[10]x:}2{x[10]r[15]x:}  --resolution cr-like --expected-ori fw --unfiltered-pl  --t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv --output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant

somehow specifying --chemistry 10xv2 works too. UPDATE Feb 20, 2024. This is not right. The counts are very big for many of the ADTs.

specify the chemistry using the geometry:

simpleaf quant \
--reads1 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R1_001.fastq.gz \
--reads2 /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/fastq/HKGHGBGXV_1_0427789532_RTD362_Condition1_Batch1_5PADT_NLS164_S1_L001_R2_001.fastq.gz \
--threads 16 --index /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/adt_index \
--chemistry 1{b[16]u[10]x:}2{x[10]r[15]x:}  --resolution cr-like --expected-ori fw \
--unfiltered-pl  /mnt/disks/tommy/af_test_workdir/737K-august-2016.txt \
--t2g-map /mnt/disks/tommy/af_test_workdir/data/IMT009-cite-seq/protein/t2g_adt.tsv \
--output /mnt/disks/tommy/af_test_workdir/NLS164_adt_fw_quant

One has to specify the 10x5' v2 whitelist explicitly. https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist

@crazyhottommy
Copy link
Author

Also, I am curious how the parameters for ADT quantification are specified at https://combine-lab.github.io/alevin-fry-tutorials/2023/running-simpleaf-workflow/
I do not see how one specifies the --chemistry

@jeremymsimon
Copy link

Thanks @crazyhottommy @DongzeHE and @rob-p for working through this! I can also confirm that for 10x 5' data with HTOs, the above logic seemingly worked for me. I still do not understand why the RNA portion would be mapped with -d rc but the HTO would be mapped with -d fw but it's great we have a fix

Note for those not using simpleaf, you can do the following to process the HTO FASTQs:

salmon alevin -l ISR \
   -i hto_index \
   -1 Pool147_8_S8*R1*.fastq.gz \
   -2 Pool147_8_S8*R2*.fastq.gz \
   --read-geometry 2[11-25] \       # <- this effetively skips the first 10bp of R2
   --bc-geometry 1[1-16] \
   --umi-geometry 1[17-26] \
   -o HTO_out \
   --sketch


alevin-fry generate-permit-list \
   -d fw \
   -i HTO_out \
   -o HTO_quant \
   -k


alevin-fry collate \
   -r HTO_out \
   -i HTO_quant


alevin-fry quant \
   -m t2g_hto.tsv \
   -i HTO_quant \
   -o HTO_quant_crlike \
   -r cr-like \
   --use-mtx

Then for the matching RNA FASTQs:

salmon alevin -l ISR \
   --dumpFeatures \
   -i gencode.vM33.annotation_splici_fl85_idx \
   -1 Pool147_1_S1*R1*.fastq.gz \
   -2 Pool147_1_S1*R2*.fastq.gz \
   --chromium \
   -o RNA_out \
   -p 4 \
   --sketch \
   --rad


alevin-fry generate-permit-list \
   -d rc \
   -i RNA_out \
   -o RNA_quant \
   -k


alevin-fry collate \
   -r RNA_out \
   -i RNA_quant \
   -t 16


alevin-fry quant \
   -m gencode.vM33.annotation_splici_fl85_t2g_3col.tsv \
   -i RNA_quant \
   -o RNA_quant_crlike \
   -r cr-like \
   --use-mtx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants