Skip to content

API Reference

CHAMANP exposes a minimal, stable public API through four symbols and a version string. Internal implementation modules under chamanp/_core/ and chamanp/_utils/ are private and not part of this contract.

import chamanp
from chamanp import ChamanpConfig, ChamanpResult, validate_config, run

ChamanpConfig

Runtime configuration contract for a CHAMANP pipeline execution.

All fields have defaults matching the current COCONUT reference molecular dataset configuration. Construct a custom configuration by passing field values directly, or load an external profile with from_module or from_toml. Loaded configurations are not preflight-validated until validate_config is called.

Attributes:

Name Type Description
DATABASE_PATH str

Path to the input molecular dataset CSV file. Default: "source_data/coconut_05-2025.csv".

REPORTS_PATH str

Directory for pipeline execution reports. Default: "artifacts/reports".

COLLECTION_TAXONOMY_PATH str

Path to the collection taxonomy JSON file. Default: "source_data/coconut_taxonomy.json".

TARGET_COLLECTIONS list of str

Collection labels to include in the filtered dataset. Default: ["PubChem NPs"].

COLLECTION_TAG str

Short alphanumeric tag used in artifact file names. Default: "pubchem".

COLLECTION_LOGIC str

Logical operator applied when filtering across target collections. Must be "OR" or "AND". Default: "OR".

MORGAN_RADIUS int

Morgan fingerprint radius. Must be an integer >= 0. Default: 2.

MORGAN_BITS int

Morgan fingerprint bit length. Must be a positive integer. Default: 1024.

SELECTED_PROPERTIES list of str

Column names retained from the molecular dataset after curation. Default: the eight columns in DEFAULT_SELECTED_PROPERTIES.

REMOVE_STEREO_DUPLICATES bool

Whether stereochemical duplicates are removed during curation. Default: True.

Examples:

Construct a configuration with default values:

>>> from chamanp import ChamanpConfig
>>> config = ChamanpConfig()
>>> config.COLLECTION_TAG
'pubchem'

Construct a configuration with custom values:

>>> config = ChamanpConfig(
...     DATABASE_PATH="data/my_dataset.csv",
...     COLLECTION_TAXONOMY_PATH="data/taxonomy.json",
...     TARGET_COLLECTIONS=["Marine NPs"],
...     COLLECTION_TAG="marine",
...     COLLECTION_LOGIC="OR",
...     MORGAN_RADIUS=2,
...     MORGAN_BITS=2048,
... )
>>> config.COLLECTION_TAG
'marine'

from_module(module) classmethod

Build a ChamanpConfig from a module-like object.

Reads every ChamanpConfig field name as an attribute from module and returns a new ChamanpConfig instance. The loaded configuration is not preflight-validated; call validate_config to validate before execution.

Parameters:

Name Type Description Default
module module - like

An object that exposes all ChamanpConfig field names as uppercase attributes, such as a Python module imported with import config.

required

Returns:

Type Description
ChamanpConfig

A new configuration instance populated from the module attributes.

Raises:

Type Description
AttributeError

If any ChamanpConfig field name is absent from module.

Examples:

Load configuration from the repository-level config.py:

>>> import config
>>> from chamanp import ChamanpConfig
>>> cfg = ChamanpConfig.from_module(config)

from_toml(path) classmethod

Build a ChamanpConfig from a TOML file.

Reads configuration values from a TOML file at path. TOML keys must be lowercase versions of ChamanpConfig field names (for example, database_path maps to DATABASE_PATH). Unknown keys raise a ValueError. The loaded configuration is not preflight-validated; call validate_config to validate before execution.

Parameters:

Name Type Description Default
path str or path - like

File system path to the TOML configuration profile.

required

Returns:

Type Description
ChamanpConfig

A new configuration instance populated from the TOML file.

Raises:

Type Description
FileNotFoundError

If path does not exist.

ValueError

If the file is not valid TOML, or if it contains unknown keys.

Notes

from_toml does not perform preflight validation. File paths referenced in the loaded configuration are not checked for existence until validate_config is called.

Examples:

Load configuration from a user TOML profile:

>>> from chamanp import ChamanpConfig
>>> config = ChamanpConfig.from_toml("my-chamanp-profile.toml")

required_fields() classmethod

Return the names of all ChamanpConfig fields.

Returns:

Type Description
tuple of str

Names of all configuration fields in declaration order.


ChamanpResult

Frozen summary of a completed CHAMANP pipeline execution.

Returned by run after a successful pipeline run. All path fields are strings. Count fields may be None if the pipeline did not produce them. Execution failures are exception-based and do not produce a result object.

Notes

CHAMANP is currently pre-stable (Alpha). Field names and types may change before a stable release is declared.

Attributes:

Name Type Description
status str

Execution status. Always "completed" for a successful run.

version str

CHAMANP package version at the time of execution.

collection_tag str

Short collection tag used to name output artifacts, taken from ChamanpConfig.COLLECTION_TAG.

curated_path str

File system path to the curated molecular dataset CSV.

filtered_path str

File system path to the collection-filtered dataset CSV.

metadata_path str

File system path to the fingerprint metadata CSV.

fingerprints_path str

File system path to the Morgan fingerprint matrix (.npy).

invalid_smiles_path str

File system path to the invalid-SMILES traceability CSV.

report_path str

File system path to the pipeline execution report.

fingerprint_radius int

Morgan fingerprint radius used during generation, taken from ChamanpConfig.MORGAN_RADIUS.

fingerprint_bits int

Morgan fingerprint bit length used during generation, taken from ChamanpConfig.MORGAN_BITS.

total_input_size int or None

Total number of data rows in the input CSV, excluding the header.

total_after_dedup int or None

Number of rows remaining after stereochemical deduplication.

stereo_removed_count int or None

Number of rows removed during stereochemical deduplication (total_input_size - total_after_dedup).

filtered_count int or None

Number of molecular dataset entries remaining after collection filtering.

valid_molecules_count int or None

Number of molecular dataset entries for which a valid fingerprint was generated. (filtered_count - invalid_smiles_count).

invalid_smiles_count int or None

Number of compounds whose SMILES string could not be parsed by RDKit during fingerprint generation.

to_dict()

Return a plain-dictionary representation of the execution result.

Returns:

Type Description
dict

A dictionary with field names as keys and field values as values, produced by dataclasses.asdict.


validate_config

Validate a CHAMANP runtime configuration object.

Checks that required file paths exist, that collection settings are well-formed, and that fingerprint parameters are valid integers. COLLECTION_LOGIC and COLLECTION_TAG values are normalized in-place (stripped and upper-cased where applicable) before the validated configuration is returned.

Parameters:

Name Type Description Default
config ChamanpConfig

Configuration to validate. When None, a default ChamanpConfig is constructed and validated.

None

Returns:

Type Description
ChamanpConfig

The validated configuration object, with COLLECTION_LOGIC and COLLECTION_TAG normalized.

Raises:

Type Description
ConfigurationError

If one or more validation checks fail. The error message lists all failing checks.

Examples:

Validate a configuration before running the pipeline:

>>> from chamanp import ChamanpConfig, validate_config
>>> config = ChamanpConfig(
...     DATABASE_PATH="data/coconut.csv",
...     COLLECTION_TAXONOMY_PATH="data/taxonomy.json",
...     TARGET_COLLECTIONS=["PubChem NPs"],
...     COLLECTION_TAG="pubchem",
... )
>>> validated = validate_config(config)

run

Validate and execute CHAMANP, writing configured artifacts to disk.

Calls validate_config on config, then runs the private pipeline implementation. The pipeline curates the molecular dataset, filters by target collections, generates Morgan fingerprints, and writes a summary report. The pipeline writes configured artifacts to disk during execution.

Parameters:

Name Type Description Default
config ChamanpConfig

Runtime configuration. When None, a default ChamanpConfig is constructed, validated, and used.

None

Returns:

Type Description
ChamanpResult

A frozen result object containing execution status, artifact paths, and summary counts. See ChamanpResult for field descriptions.

Raises:

Type Description
ConfigurationError

If configuration validation fails before execution begins.

Notes

run validates the configuration, instantiates the private pipeline implementation internally, and writes configured artifacts to disk during execution.

The internal pipeline implementation is private and should not be imported or used directly.

Examples:

Run the pipeline with a custom configuration:

>>> from chamanp import ChamanpConfig, run
>>> config = ChamanpConfig(
...     DATABASE_PATH="data/coconut.csv",
...     COLLECTION_TAXONOMY_PATH="data/taxonomy.json",
...     TARGET_COLLECTIONS=["PubChem NPs"],
...     COLLECTION_TAG="pubchem",
... )
>>> result = run(config)