MOLRAPTOR¶
Molecular data pipeline
Modular pipeline for fetching, curating, and encoding molecular datasets using PubChem data and RDKit's Morgan fingerprinting algorithm.
Pre-stable
MOLRAPTOR is currently in alpha-stage development (v0.1.x). Publication
on PyPI is prepared under the package name molraptor. Public APIs may
change before 1.0.
Workflow¶
Fetch
Retrieve molecular properties from PubChem REST API for a list of compound IDs (CIDs).
Curate
Filter and validate the dataset according to required columns and data types defined in the YAML config.
Encode
Generate Morgan fingerprints using RDKit and save ML-ready NumPy arrays and CSV artifacts.
Validate
Verify fingerprint matrix integrity — expected dimensions and absence of missing values.
Scope¶
| MOLRAPTOR does | MOLRAPTOR does not |
|---|---|
| Fetch molecular properties from PubChem. | Train machine learning models. |
| Curate and validate chemical datasets. | Perform dimensionality reduction. |
| Generate Morgan fingerprints via RDKit. | Support non-PubChem data sources (yet). |
Output ML-ready .npy and .csv artifacts. |
Handle 3D molecular structures. |
| Log failed CIDs for reproducibility. | Support alternative fingerprint types (yet). |
Quick Example¶
pip install molraptor
molraptor run --config examples/example_config.yaml
from molraptor import MolraptorConfig, run
config = MolraptorConfig.load("examples/example_config.yaml")
run(config)
Documentation¶
| Page | Purpose |
|---|---|
| Installation | Supported Python versions, local install, and optional dependencies. |
| Quick Start | Minimal CLI and Python workflow using the bundled example config. |
| CLI Reference | molraptor run, config files, verbose mode, and version checks. |
| Configuration | YAML configuration schema, inputs, and outputs. |
| API Reference | Public Python API generated from docstrings. |
| Release Notes | Version history and validation notes. |
Citation¶
If you use MOLRAPTOR in your research, please cite it using the metadata in CITATION.cff.
License¶
This project is licensed under the terms of the
GNU Lesser General Public License v3.0 or later.
SPDX identifier: LGPL-3.0-or-later.