Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unaccounted Hybridisation states #8

Open
JavierSanchez-Utges opened this issue Apr 5, 2024 · 8 comments
Open

Unaccounted Hybridisation states #8

JavierSanchez-Utges opened this issue Apr 5, 2024 · 8 comments
Assignees

Comments

@JavierSanchez-Utges
Copy link

Many of the proteins on my dataset are crashing on the featurize_protein.py, with a KeyError. The hybridisation state of the atom is SP, but only SP2 and SP3 are accounted for in the dictionary. How could this be fixed? Thanks!

@Michael-C-Strobel
Copy link
Contributor

Thank you for bringing this problem to our attention. Usually this is an issue because OpenBabel is returning an SP atom when it infers bonds, which we wouldn't expect for amino acids. You could try minimizing the structures in your dataset or checking for missing atoms.

@JavierSanchez-Utges
Copy link
Author

JavierSanchez-Utges commented Apr 7, 2024

That is interesting. For what I see, it happens multiple times n in my structure dataset, comprised by 2K human protein structures from PDBe. Perhaps the vector representation of hybridisation could be modified to a 3-element vector instead of 2? But I guess it would be a different model then, different features.

I have noticed another KeyError: 'SP3D', e.g., atom 784 of PDB: 4y88, chain A. For KeyError: 'SP', atom 3545 of PDB: 6en6, chain D.

@zwsmith200
Copy link
Member

Are these modified amino acids? Our policy is that we would rather have GrASP crash when we see something non-standard or low-resolution (when OB fails bond perception) so we aren't silently making predictions on features it has never seen.

@JavierSanchez-Utges
Copy link
Author

Both of these examples, and a few others of atoms crashing due to unaccounted hybridisation states are all altLoc atoms. It might be that because of the multiple alternative locations and proximity of the atoms, wrong bonds are being inferred? The structures are really good quality.

Perhaps a step to deal with altlocs might solve this.

@zwsmith200
Copy link
Member

Okay, that makes sense. OB is probably parsing both altLocs which breaks bond perception. As far as I understand, MDAnalysis doesn't have a standard way to handle these for us to pre-process them. If it's not too many I would fix the input structures by hand or if you find a robust way to handle them I can look into adding it.

@zwsmith200 zwsmith200 self-assigned this Apr 7, 2024
@zwsmith200
Copy link
Member

I will add a check/warning that detects altLocs in the parsing code to save time debugging in the future.

@JavierSanchez-Utges
Copy link
Author

So, there is this script: https://github.com/harryjubb/pdbtools/blob/master/clean_pdb.py from Harry Jubb's group. It was to pre-process the structures before running an older version of Arpeggio (https://github.com/harryjubb/arpeggio). Takes PDB format as input, and deals with altLocs, chain breaks, etc. I will try running it and then run parse_files.py, see if that helps these issues.

@zwsmith200
Copy link
Member

I recommend printing something when there are altLocs so you have a record of those systems. @bodhivani said she looked at both when working with them in case reorganization changed how accessible the site was and/or changed the predictions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants