Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Scripts to create pretraining dataset #65

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

peastman
Copy link
Member

This PR will have the scripts to generate the pretraining dataset discussed in #64. So far I've implemented the dipeptides subset. Let me know if this looks good. @giadefa I'd especially appreciate your feedback on what conformations to include, since you have experience on pretraining with large amounts of semi-empirical data.

The script only takes a few hours to run on my laptop. It generates about 310 MB of output data. I estimate the complete pretraining dataset will be around 10 GB, assuming we include the same molecules as the standard dataset and the same level of sampling for the other subsets.

@giadefa
Copy link
Member

giadefa commented Jun 21, 2023

Get one conformation per molecule and prefer more molecules keeping the budget constant

@peastman
Copy link
Member Author

Get one conformation per molecule and prefer more molecules keeping the budget constant

Computational budget isn't a problem. This method is super cheap. We can include more molecules and also lots of conformations per molecule.

Based on your experience, how large should it be, and how should we select the conformations?

@giadefa
Copy link
Member

giadefa commented Jun 21, 2023 via email

@peastman
Copy link
Member Author

Why just one? And again, how large should the dataset be?

@giadefa
Copy link
Member

giadefa commented Jun 21, 2023 via email

@peastman
Copy link
Member Author

We don't rely on RDKit to generate the conformations, just starting points for MD simulations.

@jchodera
Copy link
Member

If you're going to run MD for generating conformations, we probably do want multiple overdispersed starting points in case crossing torsional barriers is difficult. If the RDKit conformers end up being too similar, this shouldn't be too much of a problem---it's like running more MD, especially if you allow some "burn-in" equilibration before collecting samples from each conformation.

I fear the only way to optimize the selection of N conformers x M snapshots/conformer is to train some models and assess generalization. There's no real a priori way to know what is optimal here, though there are probably reasonable lower bounds (N >= 3, M > 10?).

@peastman
Copy link
Member Author

The current code asks RDKit to generate 10 conformers. Starting from each one, it runs MD to generate 10 conformations at each of four temperatures, for a total of 400 conformations per molecule.

@giadefa what is the problem with training on more than 10 million points? ANI-1 has 20 million, and people train on it all the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants