Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix modifications #66

Closed
manulera opened this issue Jul 19, 2023 · 10 comments
Closed

Fix modifications #66

manulera opened this issue Jul 19, 2023 · 10 comments

Comments

@manulera
Copy link
Contributor

No description provided.

@manulera
Copy link
Contributor Author

manulera commented Aug 10, 2023

Hi @kimruterford, just summarising what we discussed in the call today. Similar files are generated by the pipeline for protein modifications as for alleles, and can be used to fix them in the PHAF files and in canto.

They are in this folder: https://github.com/pombase/allele_qc/tree/master/results

How to use the proposed fixes

For the fixing, the unique identifier of a fix is systematic_id, sequence_position, reference (probably reference can be omitted, but just in case let's use it because that's what the script uses as unique identifier).

Important exceptions

Important for when you write the script that applies the changes from protein_modification_auto_fix.tsv, there is a column in the file solution_index, explained in pombase/canto#2689 (comment). If this column has a value, it means that the pipeline found two possible solutions, and a decision has to be made.

TLDR: If there is a value in solution_index, do not apply the fix.

Another special case to take into account is decribed in #62. It can happen that someone has reported a modification on a residue that no longer exists in the currect gene structure (probably assigned with a high-throughput pipeline). For those cases, I have set the value of the column change_sequence_position_to to ?. Right now we only have an example:

SPAC57A7.12	ssz1	MOD:00046	experimental evidence	S12	present_during(GO:0000087)	PMID:21712547	4896	2011-06-28	S12	?	old_coords_fix, revision 8148: complement(join(1515089..1516663,1516789..1516914))

But more are likely to happen in the future. These can either be deleted, or kept knowing that they have a sequence error.

Related to #63

kimrutherford added a commit to pombase/canto that referenced this issue Aug 12, 2023
kimrutherford added a commit to pombase/canto that referenced this issue Aug 12, 2023
@kimrutherford
Copy link
Member

I've now applied these changes to Canto. I'll apply the fixes to the modifications in SVN next.

08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
28333b01f58bc586: changing SPBC4F6.12 PMID:34133210 MOD:00696 S3, S24, S31, T55, T64, S67, S97, S136, T214 to S3,S24,S31,T55,T64,S67,S97,S136,T214
521475f7c063d784: changing SPBC16G5.15c PMID:18235227 MOD:00046 S321A to S321
5339c3839d6a7634: changing SPAC1834.04 PMID:20299449 MOD:00723 K4 to K5
5339c3839d6a7634: changing SPAC1834.04 PMID:20299449 MOD:00723 K4 to K5
5339c3839d6a7634: changing SPAC1834.04 PMID:20299449 MOD:00723 K9 to K10
5339c3839d6a7634: changing SPAC1834.04 PMID:20299449 MOD:00723 K56 to K57
536dc2e074eee139: changing SPCC4E9.01c PMID:25993311 MOD:00696 S10|S22|S43|S150|S439|S496 to S10,S22,S43,S150,S439,S496
536dc2e074eee139: changing SPCC4E9.01c PMID:25993311 MOD:00696 T60|T70 to T60,T70
767451d8f8ef6abe: changing SPAC6G9.08 PMID:21182284 MOD:00046 S129 to S130                                                                                    
767451d8f8ef6abe: changing SPAC6G9.08 PMID:21182284 MOD:00046 S133 to S134                                                                                    
767451d8f8ef6abe: changing SPAC6G9.08 PMID:21182284 MOD:00046 S359 to S360                                                                                    
767451d8f8ef6abe: changing SPAC6G9.08 PMID:21182284 MOD:00047 T143 to T144                                                                                    
7bf1fc1e6f06a613: changing SPCC338.08 PMID:33836577 MOD:00047 T89, T154, T155 to T89,T154,T155                                                                
7bf1fc1e6f06a613: changing SPCC338.08 PMID:33836577 MOD:00046 S77, S151 to S77,S151                                                                           
884c35ae47e3fec8: changing SPBC1A4.03c PMID:30635402 MOD:00046 S1363, S1364 to S1363,S1364                                                                    
99f58cdf989ca814: changing SPCC622.08c PMID:19965387 MOD:00046 S121 to S122                                                                                   
99f58cdf989ca814: changing SPAC19G12.06c PMID:19965387 MOD:00046 S121 to S122
9b5edbe6f0efcb45: changing SPAC1834.04 PMID:17369611 MOD:00723 K56 to K57
9b5edbe6f0efcb45: changing SPBC8D2.04 PMID:17369611 MOD:00723 K56 to K57
9b5edbe6f0efcb45: changing SPBC1105.11c PMID:17369611 MOD:00723 K56 to K57
9d9a265db15a87cd: changing SPBP23A10.10 PMID:27191590 MOD:00696 S630, S632 to S630,S632                                                                      
a09af17a2956146d: changing SPAC1834.04 PMID:31468675 MOD:01148 K14 to K15
a09af17a2956146d: changing SPBC8D2.04 PMID:31468675 MOD:01148 K14 to K15
a09af17a2956146d: changing SPBC1105.11c PMID:31468675 MOD:01148 K14 to K15
b2ae716b0ad7c3cb: changing SPCC338.17c PMID:28438891 MOD:00046 S163, S164,S165, S174, S209, S216, S219,S223, S226, S444, S507, S544, S545, S553 to S163,S164,S165,S174,S209,S216,S219,S223,S226,S444,S507,S544,S545,S553
c0af69aa51ff9eff: changing SPAC1834.04 PMID:11792803 MOD:00046 S10 to S11
c0af69aa51ff9eff: changing SPBC8D2.04 PMID:11792803 MOD:00046 S10 to S11
c0af69aa51ff9eff: changing SPBC1105.11c PMID:11792803 MOD:00046 S10 to S11
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T10A to T10
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T46A to T46
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T60A to T60
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T104A to T104
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T134A to T134
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T374A to T374
db3533d819cff33d: changing SPCC162.07 PMID:23297348 MOD:00046 S220 to S216
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 S202A to S202
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 S229A to S229
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 S244A to S244
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 S278A to S278
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 S294A to S294
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 T393A to T393
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 T831A to T831
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 T908A to T908
e68d23abf86a3c7c: changing SPAC17G8.10c PMID:34674264 MOD:00046 S4, S20, S166, S251, S266 to S4,S20,S166,S251,S266                                           
e865b65eeb6f06b0: changing SPAC1834.04 PMID:29136238 MOD:00696 Y41 to Y42
e865b65eeb6f06b0: changing SPBC8D2.04 PMID:29136238 MOD:00696 Y41 to Y42
e865b65eeb6f06b0: changing SPBC1105.11c PMID:29136238 MOD:00696 Y41 to Y42
f30149c5fcc7f553: changing SPAC17G8.10c PMID:29975113 MOD:01148 K3, K26, K54, K82, K124, K164, K174, K237, K262 to K3,K26,K54,K82,K124,K164,K174,K237,K262
f45b7c9c20201a38: changing SPAC20H4.06c PMID:36361590 MOD:00046 S239, S308, S312 to S239,S308,S312
f45b7c9c20201a38: changing SPCC188.11 PMID:36361590 MOD:00046 S228, S236 to S228,S236
f45b7c9c20201a38: changing SPAC4D7.03 PMID:36361590 MOD:00047 T657, T666, T669 to T657,T666,T669
f7e6c33889ea1fa0: changing SPAC11E3.03 PMID:20935472 MOD:00696 S47 to S87
fe6e8e353ea78411: changing SPAP8A3.08 PMID:10364209 MOD:00046 S2A to S2
fe6e8e353ea78411: changing SPAP8A3.08 PMID:10364209 MOD:00046 S6A to S6

@kimrutherford
Copy link
Member

kimrutherford commented Aug 12, 2023

I'll apply the fixes to the modifications in SVN next.

That's done too now. I'll check Chado after tomorrow's load.

Edit - these changes were made:

skipping change where new position is unknown: SPAC57A7.12 MOD:00046 S12->? PMID:21712547
external_data/modification_files/PMID_21712547_modifications.tsv: changing SPAC57A7.12 MOD:00046 S500 to S484
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S200 to S184
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S202 to S186
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S212 to S196
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S224 to S208
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S229 to S213
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S232 to S216
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S237 to S221
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S239 to S223
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S309 to S293
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S316 to S300
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S337 to S321
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S345 to S329
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S354 to S338
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S376 to S360
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S381 to S365
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S383 to S367
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S409 to S393
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S411 to S395
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S419 to S403
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S426 to S410
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S445 to S429
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S447 to S431
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S455 to S439
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00047 T129 to T113
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00047 T257 to T241
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00047 T352 to T336
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00047 T375 to T359
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00047 T439 to T423
external_data/modification_files/PMID_30726745_modifications.tsv: changing SPAC3H1.05 MOD:00046 S440 to S410
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S202 to S186
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S212 to S196
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S239 to S223
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S309 to S293
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S337 to S321
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S345 to S329
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S354 to S338
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S381 to S365
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S383 to S367
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S434 to S418
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00047 T129 to T113
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00047 T257 to T241
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00047 T352 to T336
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00047 T375 to T359

@manulera
Copy link
Contributor Author

manulera commented Aug 14, 2023

Hi @kimrutherford, it seems like most of them went through, except for a few. The one with the "?" (expected), but also some histone_fix ones.

https://github.com/pombase/allele_qc/blob/master/results/protein_modification_auto_fix.tsv

systematic_id	primary_name	modification	evidence	sequence_position	annotation_extension	reference	taxon	date	sequence_error	change_sequence_position_to	auto_fix_comment	solution_index
SPAC1834.03c	hhf1	MOD:00663	Inferred from Sequence or Structural Similarity	K20		PB_REF:0000001	4896	2010-03-11	K20	K21	histone_fix	
SPAC1834.04	hht1	MOD:00663		K14	present_during(GO:0031508)	PMID:14561399	4896	2010-03-11	K14	K15	histone_fix	
SPAC1834.04	hht1	MOD:00663	Inferred from Sequence or Structural Similarity	K4		PB_REF:0000001	4896	2010-03-11	K4	K5	histone_fix	
SPAC1834.04	hht1	MOD:00663		K9	present_during(GO:0031508)	PMID:14561399	4896	2010-03-11	K9	K10	histone_fix	
SPAC57A7.12	ssz1	MOD:00046	experimental evidence	S12	present_during(GO:0000087)	PMID:21712547	4896	2011-06-28	S12	?	old_coords_fix, revision 8148: complement(join(1515089..1516663,1516789..1516914))	
SPBC1105.11c	hht3	MOD:00427	Inferred from Sequence or Structural Similarity	K4		PB_REF:0000001	4896	2010-03-11	K4	K5	histone_fix	
SPBC1105.11c	hht3	MOD:00427	Inferred from Sequence or Structural Similarity	K9		PB_REF:0000001	4896	2010-03-11	K9	K10	histone_fix	
SPBC1105.12	hhf3	MOD:00427	Inferred from Sequence or Structural Similarity	K20		PB_REF:0000001	4896	2010-03-11	K20	K21	histone_fix	
SPBC8D2.03c	hhf2	MOD:00427	Inferred from Sequence or Structural Similarity	K20		PB_REF:0000001	4896	2010-03-11	K20	K21	histone_fix	
SPBC8D2.04	hht2	MOD:00427	Inferred from Sequence or Structural Similarity	K4		PB_REF:0000001	4896	2010-03-11	K4	K5	histone_fix	
SPBC8D2.04	hht2	MOD:00427	Inferred from Sequence or Structural Similarity	K9		PB_REF:0000001	4896	2010-03-11	K9	K10	histone_fix	
SPCC622.09	htb1	MOD:01148	Inferred from Direct Assay	K119		PMID:17374714	4896	2007-07-16	K119	K120	histone_fix	

@manulera
Copy link
Contributor Author

I think I see why:

  • Some have PB_REF:0000001 as a reference (don't know what that is)
  • Some are missing the evidence value
  • This one I cannot say:
SPCC622.09	htb1	MOD:01148	Inferred from Direct Assay	K119		PMID:17374714	4896	2007-07-16	K119	K120	histone_fix	

@manulera
Copy link
Contributor Author

Hi @kimrutherford, as I said today, some new alleles have appeared in the allele list that did not exist before, for instance

SPAC13C5.03	D543->stop	tht1	tht1-D543*		nonsense mutation	PMID:9442101

The reason why it did not appear before is because this allele has no annotations in canto, and was dropped and not ran through the previous pipeline. Not sure how we want to handle that, maybe you can filter that list before exporting it. I am pretty sure there is a lot of garbage on alleles without annotations.

Also, the misterious unfixed modification K119 might be related to the ones mentioned in #83 ?

@kimrutherford
Copy link
Member

Not sure how we want to handle that, maybe you can filter that list before exporting it. I am pretty sure there is a lot of garbage on alleles without annotations.

Hi Manu. The Canto allele export file has an "annotation_count" column. Could you ignore alleles where that column is zero?

@manulera
Copy link
Contributor Author

The Canto allele export file has an "anno

Yes, I can use that to filter them out.

manulera added a commit that referenced this issue Aug 16, 2023
@manulera
Copy link
Contributor Author

Hi @kimrutherford, I did this in 3ef4c14

I am not just removing the alleles that have zero annotations in the canto file, in case there would be a case in which there is an allele in Canto without annotations, but with annotations in the PHAF files. See below to check that it makes sense

https://github.com/pombase/allele_qc/blob/master/filter_alleles_pombase.py

@manulera
Copy link
Contributor Author

I think we can close this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants