Combining sage + rescoring engines #23

unholyparrot · 2024-11-26T09:46:23Z

Hi,

it would be great if you could add support or other ways to run the pipeline for combining inputs.
For example, I would like to do quantification and apply picked group FDR for the search that was done with sage and later rescored with ms2rescore.

At first glance, this appears to be a viable option. As an example, let's take a look at a common benchmark PXD008425 that
mentioned in the sage example README.

download PXD008425 A, B and AB series and convert them into MZML format using ThermoRawFileParser with default options.
I also download iprg2016_with_labels.fasta provided in ZIP package.
perform a regular sage search with your config
```
sage sage_config.json
```

do a rescore with ms2rescore, which does a rescore and applies Mokapot

ms2rescore --psm-file results.sage.tsv --psm-file-type 'sage' --spectrum-path data/ -f iprg2016_with_labels.fasta -n 40

There is a notable break here. Since ms2rescore renames the columns of the Mokapot output and your inputs expect unambiguous Mokapot results, I've restored the original Mokapot-like form for the rescored files and returned them with original SpecIDs and Sage results.

# reading relabelled by ms2rescore mokapot results
tdf_1 = pd.read_csv("../issues/results.sage.ms2rescore.mokapot.decoy.peptides.txt", sep="\t")
tdf_2 = pd.read_csv("../issues/results.sage.ms2rescore.mokapot.peptides.txt", sep="\t")
# concat decoys and real
peptides = pd.concat([tdf_1, tdf_2], ignore_index=True)
# assign ScanNr from controller in Mokapot-like style
peptides["ScanNr"] = peptides["spectrum_id"].apply(lambda x: int(re.search("(?<=scan=)(\d+?)$", x).group()))
# reading original psm ids
sage = pd.read_csv("../issues/results.sage.tsv", sep="\t")
# parsing run
sage["run"] = sage["filename"].apply(lambda x: x.split(".")[0])
# merge by two values
df = sage.merge(peptides, left_on=['scannr', 'run'], right_on=['spectrum_id', 'run'])
# create suitable names
df["SpecId"] = df["psm_id"]
df["Label"] = df["is_target"]
df["ExpMass"] = df["expmass_y"]
df["CalcMass"] = df["calcmass_y"]
df["FileName"] = df["filename"]
df["Peptide"] = df["peptide_y"]
df["Proteins"] = df["protein_list"].apply(lambda x: ";".join(eval(x)))
# write separate decoys
df[~df["Label"]][
    ["SpecId", "Label", "ExpMass", "CalcMass", "FileName", "Peptide", "Proteins", "mokapot score", "mokapot q-value", "mokapot PEP"]
].to_csv("../issues/as_real_mokapot.proteins.decoy.txt", sep="\t", index=False)
# and normal
df[df["Label"]][
    ["SpecId", "Label", "ExpMass", "CalcMass", "FileName", "Peptide", "Proteins", "mokapot score", "mokapot q-value", "mokapot PEP"]
].to_csv("../issues/as_real_mokapot.proteins.txt", sep="\t", index=False)

now it's time to run picked_group_fdr

when I run it as default with no mention of Percolator\Mokapot, everything goes fine

python3 -u -m picked_group_fdr \
    --fasta iprg2016_with_labels.fasta \
    --sage_results results.sage.tsv \
    --sage_lfq_tsv lfq.tsv \
    --protein_groups_out default_combined_protein.tsv \
    --output_format fragpipe \
    --do_quant --lfq_min_peptide_ratios 1 \
    # the only difference is num of threads
    --methods sage --num_threads 40

when I run it with trying to use rescored results, it seems to make zero difference in results and runtime.

python3 -u -m picked_group_fdr \
    --fasta iprg2016_with_labels.fasta \
    --sage_results results.sage.tsv \
    # inserting rescore results here
    --perc_evidence as_real_mokapot.proteins.txt as_real_mokapot.proteins.decoy.txt \
    --sage_lfq_tsv lfq.tsv \
    --protein_groups_out with_perc_combined_protein.tsv \
    --output_format fragpipe \
    --do_quant --lfq_min_peptide_ratios 1 \
    --methods sage --num_threads 40

It seems that any percolator evidence is ignored, output results are the same (or am I missing the logic and the results are meant to be the same?)

when I try to overcome the ignoring Mokapot output, I reasonably obtain an error

python3 -u -m picked_group_fdr \
    --fasta iprg2016_with_labels.fasta \
    --perc_evidence as_real_mokapot.proteins.txt as_real_mokapot.proteins.decoy.txt \
    --sage_lfq_tsv lfq.tsv \
    --protein_groups_out with_perc_no_sage_combined_protein.tsv \
    --output_format fragpipe \
    --do_quant --lfq_min_peptide_ratios 1 \
    --methods sage --num_threads 40

2024-11-26 09:19:56,178 - INFO - PickedGroupFDR version 0.7.3
Copyright (c) 2020-2023 Matthew The. All rights reserved.
Written by Matthew The ([email protected]) at the
Chair of Proteomics and Bioanalytics at the Technical University of Munich.
2024-11-26 09:19:56,178 - INFO - Issued command: picked_group_fdr.py --fasta iprg2016_with_labels.fasta --perc_evidence as_real_mokapot.proteins.txt as_real_mokapot.proteins.decoy.txt --sage_lfq_tsv lfq.tsv --protein_groups_out with_perc_no_sage_combined_protein.tsv --output_format fragpipe --do_quant --lfq_min_peptide_ratios 1 --methods sage --num_threads 40
2024-11-26 09:19:56,197 - INFO - Protein group level estimation method: Picked Protein Group FDR (best Sage PEP, rescued subset protein grouping, discard shared peptides, picked group target-decoy strategy)
2024-11-26 09:19:56,197 - WARNING - No evidence input file found, skipping method "Picked Protein Group FDR". Check if an appropriate method was specified by the --methods flag.
2024-11-26 09:19:56,197 - INFO - PickedGroupFDR execution took 0.0 seconds wall clock time

This is strange, because --perc_evidence is supposed to be an alternative to --mq_evidence.

if --method is not specified, pipeline skips quantification and fails later

python3 -u -m picked_group_fdr \
    --fasta iprg2016_with_labels.fasta \
    --perc_evidence as_real_mokapot.proteins.txt as_real_mokapot.proteins.decoy.txt \
    --sage_lfq_tsv lfq.tsv \
    --protein_groups_out with_perc_no_sage_no_method_combined_protein.tsv \
    --output_format fragpipe \
    --do_quant --lfq_min_peptide_ratios 1 \
    # manually skipped method
    --num_threads 40

Despite the fact, the --sage_lfq_tsv is provided.

2024-11-26 09:15:10,177 - WARNING - Skipping quantification: need input file with precursor quantifications.

Fails just later here

>>>Traceback (most recent call last):
File "/home/user/sft/miniconda3/envs/pcs/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
File "/home/user/sft/miniconda3/envs/pcs/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
File "/home/user/sft/env_picked_group_fdr/venv/lib/python3.10/site-packages/picked_group_fdr/__main__.py", line 8, in <module>
    main(sys.argv[1:])
File "/home/user/sft/env_picked_group_fdr/venv/lib/python3.10/site-packages/picked_group_fdr/picked_group_fdr.py", line 42, in main
    run_picked_group_fdr(args)
File "/home/user/sft/env_picked_group_fdr/venv/lib/python3.10/site-packages/picked_group_fdr/picked_group_fdr.py", line 246, in run_picked_group_fdr
    run_method(
File "/home/user/sft/env_picked_group_fdr/venv/lib/python3.10/site-packages/picked_group_fdr/picked_group_fdr.py", line 310, in run_method
    (
ValueError: too many values to unpack (expected 3)

In summary, would you be so kind as to recommend the logic for the Mokapot rescored sage results for quantification and picked protein group FDR approach?

Even if you do not support rescoring with ms2rescore, what about combining sage + mokapot?

Versions and packages:

sage: 
    version: 0.14.6
    source: bioconda
ms2rescore:
    version: 3.0.3
    source: pip; conda created clear venv with Python 3.10.8 and ms2rescore installed with pip with venv activated
mokapot:
    version: 0.10.0
    source: goes with ms2rescore
PickedGroupFDR:
    version: 0.7.3
    source: pip; conda created clear venv with Python 3.10.8 and picked_group_fdr installed with pip with venv activated

The text was updated successfully, but these errors were encountered:

MatthewThe · 2024-11-26T12:12:22Z

Thank you for your detailed issue report!

At the moment there is indeed no support for rescored sage results. I will look how hard it would be to make this possible.

unholyparrot · 2024-11-26T13:35:12Z

Thank you for your detailed issue report!

At the moment there is indeed no support for rescored sage results. I will look how hard it would be to make this possible.

Thank you so much! Looking forward to the updates!

MatthewThe · 2024-11-27T10:17:23Z

Could you share your ms2rescore output files? I'm having trouble to get ms2rescore working on my machine...

unholyparrot · 2024-11-27T11:31:39Z

Could you share your ms2rescore output files? I'm having trouble to get ms2rescore working on my machine...

The attachment does not fit in 25 MB. Do you mind if I send the attachment by email to [email protected]?

MatthewThe · 2024-11-27T12:02:01Z

Yes, please do!

MatthewThe · 2024-12-03T22:17:39Z

I have now added support for the ms2rescore mokapot format: https://github.com/kusterlab/picked_group_fdr/tree/develop/data/ms2rescore_example

Please let me know if this works for you.

Note that this does not include quantification support.
The LFQ output from Sage only contains peptides below 1% FDR of the original Sage run, so newly identified peptides below 1% FDR by MS2Rescore simply do not have any quant information to work with.

One option to make quantification with rescoring possible is to build in support for FlashLFQ output, which MS2Rescore actually generates an input file for.
If that is something you are interested in, feel free to create a new issue with this feature request.

EDIT: Note that the 1% FDR quant cutoff is hard-coded in Sage: https://github.com/lazear/sage/blob/73eaf49a4e53179b9bd5329bc1f085835f5f3987/crates/sage/src/lfq.rs#L96

unholyparrot · 2024-12-05T12:53:28Z

Yes, that actually worked for me, thank you very much!

I think support for FlashLFQ would also be very useful in the future. If you want to use FlashLFQ, be aware of the recent Mokapot update.

See wfondrie/mokapot#131 for details. I think as the updated version is in main but not in release yet, the results from ms2rescore for FlashLFQ still contain file with RT in parts of minute, which might confuse you during development.

Do you think, as the Sage FDR filtering for LFQ is hardcoded, it would be possible to include iBAQ calculation for selected group fdr as an analogue? Does this seem possible as we already get all the peptides of the protein group as well as the original fasta database?

I'll try to look into this question and the newly obtained results of this commit and come back with another issue for iBAQ.

Thanks again for the quick fix!

MatthewThe · 2024-12-05T15:10:46Z

Great, glad it worked! I will definitely look into FlashLFQ support for the future.

I'm not sure how we would calculate iBAQ values, it would suffer from the exact same problem of not having quantification information for the newly identified peptides.

If you're thinking to use the ms1_intensity column of results.sage.tsv, this is not the MS1 intensity of the entire precursor peak, but the MS1 intensity at the moment the MS2 spectrum was recorded. This cannot be used for comparing intensities across samples and has been removed in Sage v0.14.5.

MatthewThe added a commit that referenced this issue Dec 3, 2024

Add support for ms2rescore percolator format (#23)

e7929f5

unholyparrot closed this as completed Dec 5, 2024

MatthewThe mentioned this issue Dec 6, 2024

Add support for FlashLFQ #24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining sage + rescoring engines #23

Combining sage + rescoring engines #23

unholyparrot commented Nov 26, 2024

MatthewThe commented Nov 26, 2024

unholyparrot commented Nov 26, 2024

MatthewThe commented Nov 27, 2024

unholyparrot commented Nov 27, 2024

MatthewThe commented Nov 27, 2024

MatthewThe commented Dec 3, 2024 •

edited

Loading

unholyparrot commented Dec 5, 2024

MatthewThe commented Dec 5, 2024

Combining sage + rescoring engines #23

Combining sage + rescoring engines #23

Comments

unholyparrot commented Nov 26, 2024

MatthewThe commented Nov 26, 2024

unholyparrot commented Nov 26, 2024

MatthewThe commented Nov 27, 2024

unholyparrot commented Nov 27, 2024

MatthewThe commented Nov 27, 2024

MatthewThe commented Dec 3, 2024 • edited Loading

unholyparrot commented Dec 5, 2024

MatthewThe commented Dec 5, 2024

MatthewThe commented Dec 3, 2024 •

edited

Loading