Skip to content

Usage

This page covers installation, CLI usage, TOML profiles, and Python API examples for CHAMANP.

Installation

pip install chamanp

CHAMANP requires Python 3.11 or 3.12 and depends on RDKit, pandas, and numpy.

The command above installs the Python package and CLI. It does not install the repository-only example files used below, such as examples/example_chamanp.csv or source_data/coconut_taxonomy.json.

Quick Start

The fastest way to try CHAMANP with the included example is from a source checkout of the repository:

git clone https://github.com/NanoBiostructuresRG/chamanp.git
cd chamanp
python -m pip install -e .

The example uses:

  • examples/example_chamanp.csv, a small COCONUT-like molecular table.
  • source_data/coconut_taxonomy.json, a collection taxonomy for the reference COCONUT workflow.

Create example-chamanp.toml in the repository root:

database_path = "examples/example_chamanp.csv"
reports_path = "artifacts/reports"
collection_taxonomy_path = "source_data/coconut_taxonomy.json"
target_collections = ["ChEMBL NPs"]
collection_tag = "chembl_example"
collection_logic = "OR"
morgan_radius = 2
morgan_bits = 1024
selected_properties = [
  "identifier",
  "canonical_smiles",
  "name",
  "molecular_weight",
  "alogp",
  "topological_polar_surface_area",
  "np_likeness",
  "collections",
]
remove_stereo_duplicates = true

Validate the profile:

chamanp check-config example-chamanp.toml

Expected output:

Configuration OK: example-chamanp.toml

Run the preparation workflow:

chamanp run example-chamanp.toml

Expected CLI output:

CHAMANP run completed.
Status: completed
Output directory: artifacts

For this example dataset, the report records 15 input rows, 15 retained compounds for ChEMBL NPs, 0 invalid SMILES rows, and 15 fingerprinted molecules.

For most users, start with these essential outputs:

artifacts/filtered_chembl_example.csv
artifacts/valid_metadata_chembl_example.csv
artifacts/X_chembl_example.npy
artifacts/reports/report_dbprep_chembl_example.txt

Use these audit outputs when you need to inspect intermediate processing:

artifacts/curated_chembl_example.csv
artifacts/invalid_smiles_chembl_example.csv

In this small example, some CSV files may look identical because all rows match ChEMBL NPs and all SMILES can be fingerprinted. In larger datasets, the curated, filtered, valid-metadata, and invalid-SMILES files usually diverge as deduplication, collection filtering, and fingerprint validation occur.

If you installed CHAMANP from PyPI and are not working from a source checkout, use the same TOML structure with your own local CSV and taxonomy JSON paths.

How to Write the TOML Profile

A TOML profile is not generated from the CSV. It is a small configuration file that you write after inspecting your CSV and deciding what subset CHAMANP should prepare.

The CSV provides the molecular data. The TOML profile tells CHAMANP how to use that data:

TOML field How to choose it from your data
database_path Path to your input CSV file.
collection_taxonomy_path Path to the JSON file that lists valid collection names.
target_collections Collection label or labels to extract from the CSV collections column. These names must exist in the taxonomy JSON.
collection_logic Use OR to keep molecules present in any requested collection, or AND to keep only molecules present in all requested collections.
selected_properties Column names from the CSV that should be retained in the output tables.
reports_path Folder where the text report should be written.
collection_tag Short file-safe tag used in output filenames.
morgan_radius and morgan_bits RDKit Morgan fingerprint settings.
remove_stereo_duplicates Whether CHAMANP should collapse stereochemistry-related duplicate structures during curation.

For the included example CSV, the header starts with:

identifier,canonical_smiles,name,molecular_weight,alogp,topological_polar_surface_area,np_likeness,collections

Because canonical_smiles and collections are present, CHAMANP can curate molecules and filter by collection. Because the file contains labels such as ChEMBL NPs, the example TOML can request:

database_path = "examples/example_chamanp.csv"
collection_taxonomy_path = "source_data/coconut_taxonomy.json"
target_collections = ["ChEMBL NPs"]
collection_logic = "OR"

For your own CSV, create a TOML profile by changing the paths, collection names, retained columns, and output tag to match your dataset.

Python API Examples

from chamanp import ChamanpConfig

cfg = ChamanpConfig(
    DATABASE_PATH="examples/example_chamanp.csv",
    REPORTS_PATH="artifacts/reports",
    COLLECTION_TAXONOMY_PATH="source_data/coconut_taxonomy.json",
    TARGET_COLLECTIONS=["ChEMBL NPs"],
    COLLECTION_TAG="chembl",
    COLLECTION_LOGIC="OR",
    MORGAN_RADIUS=2,
    MORGAN_BITS=1024,
    SELECTED_PROPERTIES=[
        "identifier",
        "canonical_smiles",
        "name",
        "molecular_weight",
        "alogp",
        "topological_polar_surface_area",
        "np_likeness",
        "collections",
    ],
    REMOVE_STEREO_DUPLICATES=True,
)
from chamanp import ChamanpConfig, validate_config

cfg = ChamanpConfig.from_toml("my-chamanp-profile.toml")
validate_config(cfg)
from chamanp import ChamanpConfig, run

cfg = ChamanpConfig.from_toml("my-chamanp-profile.toml")
result = run(cfg)
print(result.valid_molecules_count)
print(result.fingerprints_path)
from chamanp import ChamanpConfig, ChamanpResult, run

cfg = ChamanpConfig.from_toml("my-chamanp-profile.toml")
result = run(cfg)

assert isinstance(result, ChamanpResult)
print(result.status)
print(result.report_path)

Public API

Symbol Description
ChamanpConfig Runtime configuration object
ChamanpResult Lightweight result returned by run()
validate_config Validate configuration before execution
run Execute the CHAMANP pipeline
__version__ Package version string

See the API Reference for public API documentation generated from the package docstrings.