-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with mol-BBBP Dataset Splitting #487
Comments
@mxqmxqmxq I actually requested the same thing in my issue #485. I think that this is related to many definitions of scaffold split: rdkit/rdkit#6844. Namely, there are a few variants:
Maybe you could run those variants with your code and check which one is the closest to OGB variant? Also, for simple scaffold split like yours, we have a lightweight function in scikit-fingerprints (https://github.com/scikit-fingerprints/scikit-fingerprints), |
@mxqmxqmxq I ran simple pipeline with ECFP4 (binary, 2048 bits) + RF (default sklearn settings) with different variants of the split, modifying the scikit-fingerprints code locally a bit:
So I could not exactly reproduce the OGB split in any way. |
|
I noticed that you have provided a pre-split BBBP dataset in a ZIP file, using a scaffold split with an 8:1:1 ratio. However, when I tried re-splitting the dataset myself and running my model, I observed that the results from my custom split were significantly better than those from your provided split result. Could you please confirm if the dataset was indeed split using the standard scaffold method? Additionally, is the source code for this splitting process available? I would appreciate your prompt response.
before auc :0.72 ------ current auc : 0.84
The text was updated successfully, but these errors were encountered: