Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
data		data
images		images
moses		moses
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
setup.py		setup.py

Repository files navigation

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Deep generative models such as generative adversarial networks, variational autoencoders, and autoregressive models are rapidly growing in popularity for the discovery of new molecules and materials. In this work, we introduce MOlecular SEtS (MOSES), a benchmarking platform to support research on machine learning for drug discovery. MOSES implements several popular molecular generation models and includes a set of metrics that evaluate the diversity and quality of generated molecules. MOSES is meant to standardize the research on molecular generation and facilitate the sharing and comparison of new models. Additionally, we provide a large-scale comparison of existing state of the art models and elaborate on current challenges for generative models that might prove fertile ground for new research. Our platform and source code are freely available here.

For more details, please refer to the paper.

Dataset

We propose a biological molecule benchmark set refined from the ZINC database.

The set is based on the ZINC Clean Leads collection. It contains 4,591,276 molecules in total, filtered by molecular weight in the range from 250 to 350 Daltons, a number of rotatable bonds not greater than 7, and XlogP less than or equal to 3.5. We removed molecules containing charged atoms or atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms. The molecules were filtered via medicinal chemistry filters (MCFs) and PAINS filters.

The dataset contains 1,936,962 molecular structures. For experiments, we also provide a training, test and scaffold test sets containing 250k, 10k, and 10k molecules respectively. The scaffold test set contains unique Bemis-Murcko scaffolds that were not present in the training and test sets. We use this set to assess how well the model can generate previously unobserved scaffolds.

Models

Metrics

Besides standard uniqueness and validity metrics, MOSES provides other metrics to access the overall quality of generated molecules. Fragment similarity (Frag) and Scaffold similarity (Scaff) are cosine distances between vectors of fragment or scaffold frequencies correspondingly of the generated and test sets. Nearest neighbor similarity (SNN) is the average similarity of generated molecules to the nearest molecule from the test set. Internal diversity (IntDiv) is an average pairwise similarity of generated molecules. Fréchet ChemNet Distance (FCD) measures the difference in distributions of last layer activations of ChemNet.

Model	Valid (↑)	Unique@1k (↑)	Unique@10k (↑)	FCD (↓)		SNN (↓)		Frag (↑)		Scaff (↑)		IntDiv (↑)	Filters (↑)
Model	Valid (↑)	Unique@1k (↑)	Unique@10k (↑)	Test	TestSF	Test	TestSF	Test	TestSF	Test	TestSF	IntDiv (↑)	Filters (↑)
CharRNN	0.9598	1.0000	0.9993	0.3233	0.8355	0.4606	0.4492	0.9977	0.9962	0.7964	0.1281	0.8561	0.9920
VAE	0.9528	1.0000	0.9992	0.2540	0.6959	0.4684	0.4547	0.9978	0.9963	0.8277	0.0925	0.8548	0.9925
AAE	0.9341	1.0000	1.0000	1.3511	1.8587	0.4191	0.4113	0.9865	0.9852	0.6637	0.1538	0.8531	0.9759
ORGAN	0.8731	0.9910	0.9260	1.5748	2.4306	0.4745	0.4593	0.9897	0.9883	0.7843	0.0632	0.8526	0.9934
JTN-VAE	1.0000	0.9980	0.9972	4.3769	4.6299	0.3909	0.3902	0.9679	0.9699	0.3868	0.1163	0.8495	0.9566

For comparison of molecular properties, we computed the Frèchet distance between distributions of molecules in the generated and test sets. Below, we provide plots for lipophilicity (logP), Synthetic Accessibility (SA), Quantitative Estimation of Drug-likeness (QED), Natural Product-likeness (NP) and molecular weight.

logP	SA

NP	QED

weight

Calculation of metrics for all models

You can calculate all metrics with:

cd scripts
python run.py

If necessary, dataset will be downloaded, splited and all models will be trained. As result in current directory will appear metrics.csv with values. You can specify the device and model by running python run.py --device cuda:5 --model aae. For more details use python run.py --help.

Installation

Docker

Build an image based on the Dockerfile nvidia-docker image build --tag <image_name> moses/, where moses/ is a cloned repository from github.
Create a container from the created image, e.g. by running nvidia-docker run -it <container_name> --network="host" --shm-size 1G <image_name>
The dataset is already downloaded during image building and the current repository is available at /code inside the docker container.

Manually

Alternatively, install dependencies and MOSES manually:

Install RDKit for metrics calculation.
Install MOSES with python setup.py install
Use git lfs pull to download the dataset

Usage

Training of model

You can train model with:

cd scripts/<model name>
python train.py --train_load <path to train dataset> --model_save <path to model> --config_save <path to config> --vocab_save <path to vocabulary>

For more details use python train.py --help.

Calculation of metrics for trained model

You can calculate metrics with:

cd scripts/<model name>
python sample.py --model_load <path to model> --config_load <path to config> --vocab_load <path to vocabulary> --n_samples <number of smiles> --gen_save <path to generated smiles>
cd ../metrics
python eval.py --ref_path <path to referenced smiles> --gen_path <path to generated smiles>

All metrics output to screen. For more details use python sample.py --help and python eval.py --help.

You also can use python run.py --model <model name> for calculation metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Dataset

Models

Metrics

Calculation of metrics for all models

Installation

Docker

Manually

Usage

Training of model

Calculation of metrics for trained model

About

Releases

Packages

Languages

License

ksh981214/moses

Folders and files

Latest commit

History

Repository files navigation

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Dataset

Models

Metrics

Calculation of metrics for all models

Installation

Docker

Manually

Usage

Training of model

Calculation of metrics for trained model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages