Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create high confidence datasets for ER signal sequence (true positives and true negatives) #1195

Open
ValWood opened this issue Aug 6, 2024 · 12 comments
Assignees

Comments

@ValWood
Copy link
Member

ValWood commented Aug 6, 2024

around 100? in each set?

@ValWood
Copy link
Member Author

ValWood commented Aug 6, 2024

Suggest use well known proteins that already have a SigPep for positive set

The negative set should be fairly easy too

@ValWood
Copy link
Member Author

ValWood commented Aug 6, 2024

Here are
111 likely true positives
https://www.pombase.org/results/from/id/9f10f4a5-a24d-4da0-8e22-ebd15948e650
can you see if you agree?

@PCarme
Copy link

PCarme commented Aug 6, 2024

They seem good to me !

@ValWood
Copy link
Member Author

ValWood commented Aug 6, 2024

Here is a set of true negatives.
https://www.pombase.org/results/from/id/e8866138-2f9e-4896-8c2b-1501b53b2ae1
I included some mitochondrial proteins as the predictors seem quite bad at those;.

@kimrutherford
Copy link
Member

Phobius on my desktop predicts 76 of the 111 likely true positives have signal peptides.
It predicts the 2 of the true negatives have signal peptides (SPAC3G9.01 and SPBC16C6.08c). But SignalP predicts that those two don't have signal peptides.

@kimrutherford
Copy link
Member

Phobius on my desktop predicts 76 of the 111 likely true positives have signal peptides.

SignalP in "fast" mode predicts 68. It finds one that Phobius doesn't (SPAC959.05c) and there are 9 that Phobius finds that "fast" SignalP doesn't:

  • SPAC12G12.12
  • SPAC1486.02c
  • SPAC630.12
  • SPAC8F11.10c
  • SPBC1105.05
  • SPBC4C3.08
  • SPBC530.16
  • SPBC8D2.17
  • SPCC736.04c

@kimrutherford
Copy link
Member

SignalP in "fast" mode predicts 68.

In slow/accurate mode it find fewer matches: 63.

In that mode there are 14 found by Phobius that SignalP doesn't report:

  • SPAC12G12.12
  • SPAC1486.02c
  • SPAC17G8.08c
  • SPAC630.12
  • SPAC8F11.10c
  • SPBC1683.08
  • SPBC4B4.08
  • SPBC4C3.08
  • SPBC530.16
  • SPBC8D2.17
  • SPBC947.10
  • SPCC1235.14
  • SPCC548.07c
  • SPCC736.04c

@ValWood
Copy link
Member Author

ValWood commented Aug 6, 2024

In that mode there are 14 found by Phobius that SignalP doesn't report:

These are all expected to have signal peptides

@ValWood
Copy link
Member Author

ValWood commented Aug 6, 2024

and what is the threshold? Or is it a binary cut-off?

@kimrutherford
Copy link
Member

and what is the threshold? Or is it a binary cut-off?

Phobius just reports signal peptide or not but I haven't investigated if there are any command line options to tweak things.

For SignalP the cutoff seems to be a likelihood of 0.5

@kimrutherford
Copy link
Member

Here are the likelihood scores for the 14 genes that SignalP says don't have signal peptides. Mostly very low. SignalP doesn't report the coordinates if the likelihood is less than the cutoff (0.5).

gene likelihood
SPAC17G8.08c SPAC17G8.08c_gdt2_Golgi Ca_2___H___ antiporter Gdt2 0.4427
SPCC548.07c SPCC548.07c_ght1_plasma membrane high-affinity glucose_proton symporter Ght1 0.3333
SPCC736.04c SPCC736.04c_gma12_Golgi alpha-1,2-galactosyltransferase Gma12 0.2774
SPAC1486.02c SPAC1486.02c_dsc2_Golgi Dsc E3 ligase complex transmembrane subunit, C-terminal UBA domain Dsc2 0.1697
SPBC8D2.17 SPBC8D2.17_gmh4_Golgi alpha-1,6-galactosyltransferase Gmh4 0.1669
SPBC947.10 SPBC947.10_dsc1_Golgi Dsc E3 ligase complex subunit Dsc1 0.1666
SPAC630.12 SPAC630.12_ted2_GPI-remodelling mannose-ethanolamine phosphate phosphodiesterase Ted2 0.1666
SPCC1235.14 SPCC1235.14_ght5_plasma membrane high-affinity glucose_fructose_proton symporter Ght5 0.1251
SPBC1683.08 SPBC1683.08_ght4_plasma membrane hexose_proton symporter, unknown specificity Ght4 0.0122
SPAC12G12.12 SPAC12G12.12_gms2_Golgi UDP-galactose transmembrane transporter Gms2 0.0000
SPBC4B4.08 SPBC4B4.08_ght2_plasma membrane glucose_fructose_proton symporter Ght2 0.0000
SPAC8F11.10c SPAC8F11.10c_pvg1_Golgi pyruvyltransferase Pvg1 0.0000
SPBC4C3.08 SPBC4C3.08_otg2_alpha-1,3-galactosyltransferase Otg2 0.0000
SPBC530.16 SPBC530.16_ksh1_Golgi kish family protein Ksh1 0.0000

@kimrutherford
Copy link
Member

I've just tried the "111 likely true positives" sequence in DeepSig and there were 69 matches. All of those were also predicted by Phobius. There were 17 predicted by Phobius that were not predicted by DeepSig.

Disappointing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants