[WIP] Scripts to create pretraining dataset #65

peastman · 2023-06-13T17:50:33Z

This PR will have the scripts to generate the pretraining dataset discussed in #64. So far I've implemented the dipeptides subset. Let me know if this looks good. @giadefa I'd especially appreciate your feedback on what conformations to include, since you have experience on pretraining with large amounts of semi-empirical data.

The script only takes a few hours to run on my laptop. It generates about 310 MB of output data. I estimate the complete pretraining dataset will be around 10 GB, assuming we include the same molecules as the standard dataset and the same level of sampling for the other subsets.

giadefa · 2023-06-21T12:37:19Z

Get one conformation per molecule and prefer more molecules keeping the budget constant

peastman · 2023-06-21T15:07:31Z

Get one conformation per molecule and prefer more molecules keeping the budget constant

Computational budget isn't a problem. This method is super cheap. We can include more molecules and also lots of conformations per molecule.

Based on your experience, how large should it be, and how should we select the conformations?

giadefa · 2023-06-21T15:13:04Z

generate conformers as you wish (rdkit), just one, and use more molecules

…

On Wed, Jun 21, 2023 at 5:07 PM Peter Eastman ***@***.***> wrote: Get one conformation per molecule and prefer more molecules keeping the budget constant Computational budget isn't a problem. This method is super cheap. We can include more molecules and also lots of conformations per molecule. Based on your experience, how large should it be, and how should we select the conformations? — Reply to this email directly, view it on GitHub <#65 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOWYZLOJ27KDF3WMLSDXMME33ANCNFSM6AAAAAAZFHQULE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

peastman · 2023-06-21T15:19:11Z

Why just one? And again, how large should the dataset be?

giadefa · 2023-06-21T15:24:29Z

Rdkit is not very good to generate more than one or two. Given a certain budget is better to have more molecules than more conformations. Realistically training on more than 10M points starts to be problematic.

…

On Wed, Jun 21, 2023 at 5:19 PM Peter Eastman ***@***.***> wrote: Why just one? And again, how large should the dataset be? — Reply to this email directly, view it on GitHub <#65 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOXMJWAXGYAYQOWUPCLXMMGHVANCNFSM6AAAAAAZFHQULE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

peastman · 2023-06-21T15:26:56Z

We don't rely on RDKit to generate the conformations, just starting points for MD simulations.

jchodera · 2023-06-22T07:37:02Z

If you're going to run MD for generating conformations, we probably do want multiple overdispersed starting points in case crossing torsional barriers is difficult. If the RDKit conformers end up being too similar, this shouldn't be too much of a problem---it's like running more MD, especially if you allow some "burn-in" equilibration before collecting samples from each conformation.

I fear the only way to optimize the selection of N conformers x M snapshots/conformer is to train some models and assess generalization. There's no real a priori way to know what is optimal here, though there are probably reasonable lower bounds (N >= 3, M > 10?).

peastman · 2023-06-22T16:27:37Z

The current code asks RDKit to generate 10 conformers. Starting from each one, it runs MD to generate 10 conformations at each of four temperatures, for a total of 400 conformations per molecule.

@giadefa what is the problem with training on more than 10 million points? ANI-1 has 20 million, and people train on it all the time.

peastman added 2 commits June 13, 2023 10:43

Script to create dipeptides for pretraining dataset

d213191

Scripts to create solvated amino acids and DES monomers

2ff86b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Scripts to create pretraining dataset #65

[WIP] Scripts to create pretraining dataset #65

peastman commented Jun 13, 2023

giadefa commented Jun 21, 2023

peastman commented Jun 21, 2023

giadefa commented Jun 21, 2023 via email

peastman commented Jun 21, 2023

giadefa commented Jun 21, 2023 via email

peastman commented Jun 21, 2023

jchodera commented Jun 22, 2023

peastman commented Jun 22, 2023

[WIP] Scripts to create pretraining dataset #65

Are you sure you want to change the base?

[WIP] Scripts to create pretraining dataset #65

Conversation

peastman commented Jun 13, 2023

giadefa commented Jun 21, 2023

peastman commented Jun 21, 2023

giadefa commented Jun 21, 2023 via email

peastman commented Jun 21, 2023

giadefa commented Jun 21, 2023 via email

peastman commented Jun 21, 2023

jchodera commented Jun 22, 2023

peastman commented Jun 22, 2023