Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mol2 parser within MoSDeF ecosystem #562

Merged
merged 90 commits into from
Oct 5, 2021

Conversation

CalCraven
Copy link
Contributor

@CalCraven CalCraven commented Jun 28, 2021

Overview

We have some support for mol2 file formats, but currently use the load_mol2 from mdtraj. We have run into some issues because of various assumptions they make in their readers, as well as the variety implemented in the TRIPOS mol2 file format. See issue #921 and #893. This is also a relevant thread for the rdkit package's issues with their own implementation PR#415.

Details

Current Issues to Address:

  • Replace mdtraj parser
  • Support varying file formatting from different sources
  • Support different system configurations (course-grained, bio-molecules, virtual sites)
  • Support reading into various aspects of MoSDeF (mBuild Compound, mBuild Lattice, gmso Topology)
  • Obvious warnings/errors when reading in nonstandardized formats

Sources of mol2 file formats to act as references

(please note if any important ones are missing)

TRIPOS Standard Record Type Indicators (RTI's) Source

Can be supported with the current state of MoSDeF

Atom – mBuild compound
Bond – mBuild compound
CRYSIN – mBuild lattice
FF_PBC – mBuild compound box
FFCON_DISTANCE – topology
FFCON_ANGLE – topology
FFCON_TORSION – topology
FFCON_RANGE – topology
FFCON_MULTI – topology
SUBSTRUCTURE – subtopology/separate compounds

Could be supported in the future

CENTROID – virtual sites
CENTER_OF_MASS – virtual sites
EXTENSION_POINT – virtual sites
LINE – virtual sites
LSPLANE – virtual sites
NORMAL – virtual sites
MOLECULE – mBuild compound (residue)

Probably will not support

ALT_TYPE
ANCHOR_ATOM
ASSOCIATED_ANNOTATION
COMMENT
DICT
DATA_FILE
RING_CLOSURE
ROTABLE_BOND
SEARCH_DIST – setting constraints
SEARCH_OPTIONS – Running a SYBYL sim
SET – Color schemes
U_FEAT - Unity support
UNITY_ATOM_ATTR - Unity support
UNITY_BOND_ATTR - Unity support

Sample Tripos File Outline Shortened from

@MOLECULE
1ABE.pdb
4651 4696 305 0 0
PROTEIN
AMBER99
@ATOM
1 N 18.3590 26.8690 52.9550 N.pl3 1 GLY2 -0.4157
2 CA 19.6350 26.6320 53.6710 C.3 1 GLY2 -0.0252
3 C 19.9620 27.8960 54.5040 C.2 1 GLY2 0.5973
4 O 19.2860 28.0330 55.5390 O.2 1 GLY2 -0.5679
5 N 20.9160 28.7080 54.0680 N.am 2 LEU3 -0.4157
6 CA 21.3040 29.9430 54.7530 C.3 2 LEU3 -0.0518
@BOND
1 1 4650 1
2 2 1 1
3 2 4638 1
4 2 4636 1
5 3 2 1
6 3 5 am
@SUBSTRUCTURE
1 GLY2 2 RESIDUE 4 A GLY 1 ROOT
2 LEU3 6 RESIDUE 4 A LEU 2
3 LYS4 14 RESIDUE 4 A LYS 2
4 LEU5 23 RESIDUE 4 A LEU 2

Sample mol2 file in mBuild

Betacristobalite

@codecov
Copy link

codecov bot commented Jun 28, 2021

Codecov Report

Merging #562 (bbad54c) into master (9e237e0) will increase coverage by 0.16%.
The diff coverage is 96.84%.

❗ Current head bbad54c differs from pull request most recent head 54e0b77. Consider uploading reports for the commit 54e0b77 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #562      +/-   ##
==========================================
+ Coverage   90.49%   90.65%   +0.16%     
==========================================
  Files          55       57       +2     
  Lines        4410     4507      +97     
==========================================
+ Hits         3991     4086      +95     
- Misses        419      421       +2     
Impacted Files Coverage Δ
gmso/formats/mol2.py 95.83% <95.83%> (ø)
gmso/abc/abstract_site.py 94.11% <100.00%> (+0.89%) ⬆️
gmso/core/topology.py 96.45% <100.00%> (+0.09%) ⬆️
gmso/abc/abstract_potential.py 87.03% <0.00%> (-3.88%) ⬇️
gmso/utils/connectivity.py 97.14% <0.00%> (-0.08%) ⬇️
gmso/external/convert_openmm.py 100.00% <0.00%> (ø)
gmso/abc/metadata_mixin.py 100.00% <0.00%> (ø)
gmso/core/atom.py 95.77% <0.00%> (+1.40%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8ab67d...54e0b77. Read the comment docs.

@justinGilmer
Copy link
Collaborator

Is this supposed to be a PR or an issue @CalCraven ?

@CalCraven
Copy link
Contributor Author

Is this supposed to be a PR or an issue @CalCraven ?

@justinGilmer
It's a PR. I just wrote up the discussion first to have it all in one place, I'm working on some code outline that will get pushed once the structure is right. But I figured this is a good place to put everything as I'm working on it, especially to document what we will and will not support.

CalCraven and others added 24 commits July 13, 2021 15:18
	* Make sure unyt values are input as Angstroms to proper conversion are done to mBuild

	* Throw an OSError if the filepath doesn't exit that is used for from_mol2
	* read in box information from CRYSIN record type indicator
	* assume box lenghts in angstroms and box angles in degrees
Copy link
Collaborator

@justinGilmer justinGilmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes along the way, but this is looking great @CalCraven !

gmso/core/topology.py Outdated Show resolved Hide resolved
gmso/formats/mol2.py Outdated Show resolved Hide resolved
gmso/formats/mol2.py Outdated Show resolved Hide resolved
gmso/formats/mol2.py Outdated Show resolved Hide resolved
Supported record type indicators include Atom, Bond, FF_PBC, and CRYSIN.
"""
supported_rti = {
"@<TRIPOS>ATOM\n": load_top_sites,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In favor of this too. If the specification does not allow anything else, then this is probably fine. Can people add comments or anything in this file? If so, what umesh suggests will make it robust enough

gmso/formats/mol2.py Outdated Show resolved Hide resolved
gmso/formats/mol2.py Outdated Show resolved Hide resolved
gmso/tests/test_mol2.py Outdated Show resolved Hide resolved
gmso/tests/test_mol2.py Outdated Show resolved Hide resolved
gmso/tests/test_mol2.py Outdated Show resolved Hide resolved
CalCraven and others added 5 commits September 15, 2021 12:56
	* Add .strip() method to strip lines with RTI in order to increase robustness
	* Move a msg error statement for missing file path into conditional statement
	* Use with open as f instead of f=open(path,"r")
	* Add docstring for load_top_sites
	* Add information about skipping unsupported RTI
	* Use regex match for checking for warnings in test-mol2.py
Copy link
Contributor

@bc118 bc118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. From what I see this should read in proteins, now. Very excited!!

Copy link
Collaborator

@justinGilmer justinGilmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small question and once you resolve the merge conflicts, this should be g2g!

Comment on lines 54 to 68
line = f.readline()
while f:
# check for header character in line
if line.strip().startswith("@<TRIPOS>"):
# if header character in line, send to a function that will direct it properly
line = _parse_record_type_indicator(
f, line, topology, site_type
)
elif line == "":
# check for the end of file
break
else:
# else, skip to next line
line = f.readline()
topology.update_topology()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason for the internal while loop?

Suggested change
line = f.readline()
while f:
# check for header character in line
if line.strip().startswith("@<TRIPOS>"):
# if header character in line, send to a function that will direct it properly
line = _parse_record_type_indicator(
f, line, topology, site_type
)
elif line == "":
# check for the end of file
break
else:
# else, skip to next line
line = f.readline()
topology.update_topology()
for line in f:
# check for header character in line
if line.strip().startswith("@<TRIPOS>"):
# if header character in line, send to a function that will direct it properly
line = _parse_record_type_indicator(
f, line, topology, site_type
)
elif line == "":
continue
else:
continue
topology.update_topology()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried my best to write something that could use a for loop instead of a while loop. It starts to get tricky again when it finds a header in the line. That passes it to the _parse_record_type_indicator function, which reads through the rest of that heading. In the case of a record type indicator (rti) starting a new section without a space after the previous section, that line must be passed back to the parser in order to continue forwards and identify which rti you are dealing with. With a for loop, every time through you must read the next line. In the instance as discussed above:
"""
@ATOM
Atom information
@BOND
bond information
"""

The next line cannot be read automatically because it will already be read when looking to find the end of the ATOM rti section.

Does that make sense? There may be a way to accomplish this while still using a for loops, but I tried exactly what you suggested and I run into this error still.

… conventions of residue_number and residue_name
gmso/formats/mol2.py Outdated Show resolved Hide resolved
gmso/formats/mol2.py Outdated Show resolved Hide resolved
gmso/formats/mol2.py Outdated Show resolved Hide resolved
gmso/formats/mol2.py Outdated Show resolved Hide resolved
gmso/formats/mol2.py Outdated Show resolved Hide resolved
gmso/formats/mol2.py Outdated Show resolved Hide resolved
CalCraven and others added 5 commits September 28, 2021 14:25
Co-authored-by: Umesh Timalsina <umesh.timalsina@vanderbilt.edu>
Co-authored-by: Umesh Timalsina <umesh.timalsina@vanderbilt.edu>
Co-authored-by: Umesh Timalsina <umesh.timalsina@vanderbilt.edu>
@daico007 daico007 merged commit bc1bd58 into mosdef-hub:master Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants