docu: update readme, add image, add feature calc notebook (#90)

* docu: update readme, add image, add feature calc notebook * readme: add feature calculation example * readme: add published Molpipeline paper link * notebooks: add header to feature calculation
basf · Sep 18, 2024 · 0b7852c · 0b7852c
1 parent c92225e
commit 0b7852c
Show file tree

Hide file tree

Showing 3 changed files with 984 additions and 14 deletions.
diff --git a/.github/molpipeline.png b/.github/molpipeline.png
diff --git a/README.md b/README.md
@@ -1,35 +1,56 @@
 # MolPipeline
-MolPipeline is a Python package providing RDKit functionality in a Scikit-learn like fashion.
+MolPipeline is a Python package for processing molecules with RDKit in scikit-learn.
+
+<p align="center"><img src=".github/molpipeline.png" height="250"/></p>
 
 ## Background
 
-The open-source package [scikit-learn](https://scikit-learn.org/) provides a large variety of machine
+The [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine
 learning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to
 prepend custom data processing steps to the machine learning model.
-`MolPipeline` extends this concept to the field of chemoinformatics by
-wrapping default functionalities of [RDKit](https://www.rdkit.org/), such as reading and writing SMILES strings
+`MolPipeline` extends this concept to the field of cheminformatics by
+wrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings
 or calculating molecular descriptors from a molecule-object.
 
-A notable difference to the `Pipeline` class of scikit-learn is that the Pipline from `MolPipeline` allows for 
-instances to fail during processing without interrupting the whole pipeline.
-Such behaviour is useful when processing large datasets, where some SMILES strings might not encode valid molecules
-or some descriptors might not be calculable for certain molecules.
+MolPipeline aims to provide:
 
+- Automated end-to-end processing from molecule data sets to deployable machine learning models.
+- Scalable parallel processing and low memory usage through instance-based processing.
+- Standard pipeline building blocks for flexibly building custom pipelines for various
+cheminformatics tasks.
+- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a
+SMILES string that could not be parsed correctly).
+- Integrated and self-contained pipeline serialization for easy deployment and tracking
+in version control.
 
 ## Publications
 
-The publication is freely available [here](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036).
+[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing
+molecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863)
+\
+Further links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036)
+
+Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural
+fingerprint-based models, 2024
+\
+Further links: [repository](https://github.com/basf/neural-fingerprint-uncertainty)
 
 ## Installation
 ```commandline
 pip install molpipeline
 ```
 
-## Usage
+## Documentation
+
+The [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline.
+
+A nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb).
 
-See the [notebooks](notebooks) folder for basic and advanced examples of how to use Molpipeline.
+## Quick Start
 
-A basic example of how to use MolPipeline to create a fingerprint-based model is shown below (see also the [notebook](notebooks/01_getting_started_with_molpipeline.ipynb)): 
+### Model building
+
+Create a fingerprint-based prediction model:
 ```python
 from molpipeline import Pipeline
 from molpipeline.any2mol import AutoToMol
@@ -58,8 +79,42 @@ pipeline.predict(["CCC"])
 # output: array([0.29])
 ```
 
-Molpipeline also provides custom estimators for standard cheminformatics tasks that can be integrated into pipelines,
-like clustering for scaffold splits (see also the [notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb)):
+### Feature calculation
+
+Calculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can
+be calculated like this:
+```python
+from molpipeline import Pipeline
+from molpipeline.any2mol import AutoToMol
+from molpipeline.mol2any import MolToRDKitPhysChem
+
+pipeline_physchem = Pipeline(
+    [
+        ("auto2mol", AutoToMol()),
+        (
+            "physchem",
+            MolToRDKitPhysChem(
+                standardizer=None,
+                descriptor_list=["HeavyAtomMolWt", "TPSA", "NumHAcceptors"],
+            ),
+        ),
+    ],
+    n_jobs=-1,
+)
+physchem_matrix = pipeline_physchem.transform(["CCCCCC", "c1ccccc1(O)"])
+physchem_matrix
+# output: array([[72.066,  0.   ,  0.   ],
+#                [88.065, 20.23 ,  1.   ]])
+```
+
+MolPipeline provides further features and descriptors from RDKit, 
+for example Morgan (binary/count) fingerprints and MACCS keys.
+See the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples.
+
+### Clustering
+
+Molpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be
+clustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples.
 
 ```python
 from molpipeline.estimators import MurckoScaffoldClustering