Skip to content

MOLRAPTOR

Molecular data pipeline

MOLRAPTOR

Modular pipeline for fetching, curating, and encoding molecular datasets using PubChem data and RDKit's Morgan fingerprinting algorithm.

CI Version Python versions License: LGPL v3+

Pre-stable

MOLRAPTOR is currently in alpha-stage development (v0.1.x). Publication on PyPI is prepared under the package name molraptor. Public APIs may change before 1.0.

Workflow

Input CSV PubChem CIDs + labels
Fetch PubChem molecular properties
Curate molraptor run filter + validate
Encode RDKit Morgan fingerprints
Output .npy / .csv ML-ready artifacts
01

Fetch

Retrieve molecular properties from PubChem REST API for a list of compound IDs (CIDs).

02

Curate

Filter and validate the dataset according to required columns and data types defined in the YAML config.

03

Encode

Generate Morgan fingerprints using RDKit and save ML-ready NumPy arrays and CSV artifacts.

04

Validate

Verify fingerprint matrix integrity — expected dimensions and absence of missing values.

Scope

MOLRAPTOR does MOLRAPTOR does not
Fetch molecular properties from PubChem. Train machine learning models.
Curate and validate chemical datasets. Perform dimensionality reduction.
Generate Morgan fingerprints via RDKit. Support non-PubChem data sources (yet).
Output ML-ready .npy and .csv artifacts. Handle 3D molecular structures.
Log failed CIDs for reproducibility. Support alternative fingerprint types (yet).

Quick Example

pip install molraptor
molraptor run --config examples/example_config.yaml
from molraptor import MolraptorConfig, run

config = MolraptorConfig.load("examples/example_config.yaml")
run(config)

Documentation

Page Purpose
Installation Supported Python versions, local install, and optional dependencies.
Quick Start Minimal CLI and Python workflow using the bundled example config.
CLI Reference molraptor run, config files, verbose mode, and version checks.
Configuration YAML configuration schema, inputs, and outputs.
API Reference Public Python API generated from docstrings.
Release Notes Version history and validation notes.

Citation

If you use MOLRAPTOR in your research, please cite it using the metadata in CITATION.cff.

License

This project is licensed under the terms of the GNU Lesser General Public License v3.0 or later. SPDX identifier: LGPL-3.0-or-later.