Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining sage + rescoring engines #23

Closed
unholyparrot opened this issue Nov 26, 2024 · 8 comments
Closed

Combining sage + rescoring engines #23

unholyparrot opened this issue Nov 26, 2024 · 8 comments

Comments

@unholyparrot
Copy link

Hi,

it would be great if you could add support or other ways to run the pipeline for combining inputs.
For example, I would like to do quantification and apply picked group FDR for the search that was done with sage and later rescored with ms2rescore.

At first glance, this appears to be a viable option. As an example, let's take a look at a common benchmark PXD008425 that
mentioned in the sage example README.

  1. download PXD008425 A, B and AB series and convert them into MZML format using ThermoRawFileParser with default options.
    I also download iprg2016_with_labels.fasta provided in ZIP package.

  2. perform a regular sage search with your config

    sage sage_config.json
  3. do a rescore with ms2rescore, which does a rescore and applies Mokapot

    ms2rescore --psm-file results.sage.tsv --psm-file-type 'sage' --spectrum-path data/ -f iprg2016_with_labels.fasta -n 40
  4. There is a notable break here. Since ms2rescore renames the columns of the Mokapot output and your inputs expect unambiguous Mokapot results, I've restored the original Mokapot-like form for the rescored files and returned them with original SpecIDs and Sage results.

    # reading relabelled by ms2rescore mokapot results
    tdf_1 = pd.read_csv("../issues/results.sage.ms2rescore.mokapot.decoy.peptides.txt", sep="\t")
    tdf_2 = pd.read_csv("../issues/results.sage.ms2rescore.mokapot.peptides.txt", sep="\t")
    # concat decoys and real
    peptides = pd.concat([tdf_1, tdf_2], ignore_index=True)
    # assign ScanNr from controller in Mokapot-like style
    peptides["ScanNr"] = peptides["spectrum_id"].apply(lambda x: int(re.search("(?<=scan=)(\d+?)$", x).group()))
    # reading original psm ids
    sage = pd.read_csv("../issues/results.sage.tsv", sep="\t")
    # parsing run
    sage["run"] = sage["filename"].apply(lambda x: x.split(".")[0])
    # merge by two values
    df = sage.merge(peptides, left_on=['scannr', 'run'], right_on=['spectrum_id', 'run'])
    # create suitable names
    df["SpecId"] = df["psm_id"]
    df["Label"] = df["is_target"]
    df["ExpMass"] = df["expmass_y"]
    df["CalcMass"] = df["calcmass_y"]
    df["FileName"] = df["filename"]
    df["Peptide"] = df["peptide_y"]
    df["Proteins"] = df["protein_list"].apply(lambda x: ";".join(eval(x)))
    # write separate decoys
    df[~df["Label"]][
        ["SpecId", "Label", "ExpMass", "CalcMass", "FileName", "Peptide", "Proteins", "mokapot score", "mokapot q-value", "mokapot PEP"]
    ].to_csv("../issues/as_real_mokapot.proteins.decoy.txt", sep="\t", index=False)
    # and normal
    df[df["Label"]][
        ["SpecId", "Label", "ExpMass", "CalcMass", "FileName", "Peptide", "Proteins", "mokapot score", "mokapot q-value", "mokapot PEP"]
    ].to_csv("../issues/as_real_mokapot.proteins.txt", sep="\t", index=False)
  5. now it's time to run picked_group_fdr

    • when I run it as default with no mention of Percolator\Mokapot, everything goes fine
    python3 -u -m picked_group_fdr \
        --fasta iprg2016_with_labels.fasta \
        --sage_results results.sage.tsv \
        --sage_lfq_tsv lfq.tsv \
        --protein_groups_out default_combined_protein.tsv \
        --output_format fragpipe \
        --do_quant --lfq_min_peptide_ratios 1 \
        # the only difference is num of threads
        --methods sage --num_threads 40
    • when I run it with trying to use rescored results, it seems to make zero difference in results and runtime.
    python3 -u -m picked_group_fdr \
        --fasta iprg2016_with_labels.fasta \
        --sage_results results.sage.tsv \
        # inserting rescore results here
        --perc_evidence as_real_mokapot.proteins.txt as_real_mokapot.proteins.decoy.txt \
        --sage_lfq_tsv lfq.tsv \
        --protein_groups_out with_perc_combined_protein.tsv \
        --output_format fragpipe \
        --do_quant --lfq_min_peptide_ratios 1 \
        --methods sage --num_threads 40

    It seems that any percolator evidence is ignored, output results are the same (or am I missing the logic and the results are meant to be the same?)

    • when I try to overcome the ignoring Mokapot output, I reasonably obtain an error
    python3 -u -m picked_group_fdr \
        --fasta iprg2016_with_labels.fasta \
        --perc_evidence as_real_mokapot.proteins.txt as_real_mokapot.proteins.decoy.txt \
        --sage_lfq_tsv lfq.tsv \
        --protein_groups_out with_perc_no_sage_combined_protein.tsv \
        --output_format fragpipe \
        --do_quant --lfq_min_peptide_ratios 1 \
        --methods sage --num_threads 40
    2024-11-26 09:19:56,178 - INFO - PickedGroupFDR version 0.7.3
    Copyright (c) 2020-2023 Matthew The. All rights reserved.
    Written by Matthew The ([email protected]) at the
    Chair of Proteomics and Bioanalytics at the Technical University of Munich.
    2024-11-26 09:19:56,178 - INFO - Issued command: picked_group_fdr.py --fasta iprg2016_with_labels.fasta --perc_evidence as_real_mokapot.proteins.txt as_real_mokapot.proteins.decoy.txt --sage_lfq_tsv lfq.tsv --protein_groups_out with_perc_no_sage_combined_protein.tsv --output_format fragpipe --do_quant --lfq_min_peptide_ratios 1 --methods sage --num_threads 40
    2024-11-26 09:19:56,197 - INFO - Protein group level estimation method: Picked Protein Group FDR (best Sage PEP, rescued subset protein grouping, discard shared peptides, picked group target-decoy strategy)
    2024-11-26 09:19:56,197 - WARNING - No evidence input file found, skipping method "Picked Protein Group FDR". Check if an appropriate method was specified by the --methods flag.
    2024-11-26 09:19:56,197 - INFO - PickedGroupFDR execution took 0.0 seconds wall clock time
    

    This is strange, because --perc_evidence is supposed to be an alternative to --mq_evidence.

    • if --method is not specified, pipeline skips quantification and fails later
    python3 -u -m picked_group_fdr \
        --fasta iprg2016_with_labels.fasta \
        --perc_evidence as_real_mokapot.proteins.txt as_real_mokapot.proteins.decoy.txt \
        --sage_lfq_tsv lfq.tsv \
        --protein_groups_out with_perc_no_sage_no_method_combined_protein.tsv \
        --output_format fragpipe \
        --do_quant --lfq_min_peptide_ratios 1 \
        # manually skipped method
        --num_threads 40

    Despite the fact, the --sage_lfq_tsv is provided.

    2024-11-26 09:15:10,177 - WARNING - Skipping quantification: need input file with precursor quantifications.
    

    Fails just later here

    >>>Traceback (most recent call last):
    File "/home/user/sft/miniconda3/envs/pcs/lib/python3.10/runpy.py", line 196, in _run_module_as_main
        return _run_code(code, main_globals, None,
    File "/home/user/sft/miniconda3/envs/pcs/lib/python3.10/runpy.py", line 86, in _run_code
        exec(code, run_globals)
    File "/home/user/sft/env_picked_group_fdr/venv/lib/python3.10/site-packages/picked_group_fdr/__main__.py", line 8, in <module>
        main(sys.argv[1:])
    File "/home/user/sft/env_picked_group_fdr/venv/lib/python3.10/site-packages/picked_group_fdr/picked_group_fdr.py", line 42, in main
        run_picked_group_fdr(args)
    File "/home/user/sft/env_picked_group_fdr/venv/lib/python3.10/site-packages/picked_group_fdr/picked_group_fdr.py", line 246, in run_picked_group_fdr
        run_method(
    File "/home/user/sft/env_picked_group_fdr/venv/lib/python3.10/site-packages/picked_group_fdr/picked_group_fdr.py", line 310, in run_method
        (
    ValueError: too many values to unpack (expected 3)
    

In summary, would you be so kind as to recommend the logic for the Mokapot rescored sage results for quantification and picked protein group FDR approach?

Even if you do not support rescoring with ms2rescore, what about combining sage + mokapot?

Versions and packages:

sage: 
    version: 0.14.6
    source: bioconda
ms2rescore:
    version: 3.0.3
    source: pip; conda created clear venv with Python 3.10.8 and ms2rescore installed with pip with venv activated
mokapot:
    version: 0.10.0
    source: goes with ms2rescore
PickedGroupFDR:
    version: 0.7.3
    source: pip; conda created clear venv with Python 3.10.8 and picked_group_fdr installed with pip with venv activated
@MatthewThe
Copy link
Member

Thank you for your detailed issue report!

At the moment there is indeed no support for rescored sage results. I will look how hard it would be to make this possible.

@unholyparrot
Copy link
Author

Thank you for your detailed issue report!

At the moment there is indeed no support for rescored sage results. I will look how hard it would be to make this possible.

Thank you so much! Looking forward to the updates!

@MatthewThe
Copy link
Member

Could you share your ms2rescore output files? I'm having trouble to get ms2rescore working on my machine...

@unholyparrot
Copy link
Author

Could you share your ms2rescore output files? I'm having trouble to get ms2rescore working on my machine...

The attachment does not fit in 25 MB. Do you mind if I send the attachment by email to [email protected]?

@MatthewThe
Copy link
Member

Yes, please do!

@MatthewThe
Copy link
Member

MatthewThe commented Dec 3, 2024

I have now added support for the ms2rescore mokapot format: https://github.com/kusterlab/picked_group_fdr/tree/develop/data/ms2rescore_example

Please let me know if this works for you.

Note that this does not include quantification support.
The LFQ output from Sage only contains peptides below 1% FDR of the original Sage run, so newly identified peptides below 1% FDR by MS2Rescore simply do not have any quant information to work with.

One option to make quantification with rescoring possible is to build in support for FlashLFQ output, which MS2Rescore actually generates an input file for.
If that is something you are interested in, feel free to create a new issue with this feature request.

EDIT: Note that the 1% FDR quant cutoff is hard-coded in Sage: https://github.com/lazear/sage/blob/73eaf49a4e53179b9bd5329bc1f085835f5f3987/crates/sage/src/lfq.rs#L96

@unholyparrot
Copy link
Author

Yes, that actually worked for me, thank you very much!

I think support for FlashLFQ would also be very useful in the future. If you want to use FlashLFQ, be aware of the recent Mokapot update.

See wfondrie/mokapot#131 for details. I think as the updated version is in main but not in release yet, the results from ms2rescore for FlashLFQ still contain file with RT in parts of minute, which might confuse you during development.

Do you think, as the Sage FDR filtering for LFQ is hardcoded, it would be possible to include iBAQ calculation for selected group fdr as an analogue? Does this seem possible as we already get all the peptides of the protein group as well as the original fasta database?

I'll try to look into this question and the newly obtained results of this commit and come back with another issue for iBAQ.

Thanks again for the quick fix!

@MatthewThe
Copy link
Member

Great, glad it worked! I will definitely look into FlashLFQ support for the future.

I'm not sure how we would calculate iBAQ values, it would suffer from the exact same problem of not having quantification information for the newly identified peptides.

If you're thinking to use the ms1_intensity column of results.sage.tsv, this is not the MS1 intensity of the entire precursor peak, but the MS1 intensity at the moment the MS2 spectrum was recorded. This cannot be used for comparing intensities across samples and has been removed in Sage v0.14.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants