API Reference¶

CHAMANP exposes a minimal, stable public API through four symbols and a version string. Internal implementation modules under chamanp/_core/ and chamanp/_utils/ are private and not part of this contract.

import chamanp
from chamanp import ChamanpConfig, ChamanpResult, validate_config, run

ChamanpConfig¶

Runtime configuration contract for a CHAMANP pipeline execution.

All fields have defaults matching the current COCONUT reference molecular dataset configuration. Construct a custom configuration by passing field values directly, or load an external profile with from_module or from_toml. Loaded configurations are not preflight-validated until validate_config is called.

Attributes:

Name	Type	Description
`DATABASE_PATH`	`str`	Path to the input molecular dataset CSV file. Default: `"source_data/coconut_05-2025.csv"`.
`REPORTS_PATH`	`str`	Directory for pipeline execution reports. Default: `"artifacts/reports"`.
`COLLECTION_TAXONOMY_PATH`	`str`	Path to the collection taxonomy JSON file. Default: `"source_data/coconut_taxonomy.json"`.
`TARGET_COLLECTIONS`	`list of str`	Collection labels to include in the filtered dataset. Default: `["PubChem NPs"]`.
`COLLECTION_TAG`	`str`	Short alphanumeric tag used in artifact file names. Default: `"pubchem"`.
`COLLECTION_LOGIC`	`str`	Logical operator applied when filtering across target collections. Must be `"OR"` or `"AND"`. Default: `"OR"`.
`MORGAN_RADIUS`	`int`	Morgan fingerprint radius. Must be an integer >= 0. Default: `2`.
`MORGAN_BITS`	`int`	Morgan fingerprint bit length. Must be a positive integer. Default: `1024`.
`SELECTED_PROPERTIES`	`list of str`	Column names retained from the molecular dataset after curation. Default: the eight columns in `DEFAULT_SELECTED_PROPERTIES`.
`REMOVE_STEREO_DUPLICATES`	`bool`	Whether stereochemical duplicates are removed during curation. Default: `True`.

Examples:

Construct a configuration with default values:

>>> from chamanp import ChamanpConfig
>>> config = ChamanpConfig()
>>> config.COLLECTION_TAG
'pubchem'

Construct a configuration with custom values:

>>> config = ChamanpConfig(
...     DATABASE_PATH="data/my_dataset.csv",
...     COLLECTION_TAXONOMY_PATH="data/taxonomy.json",
...     TARGET_COLLECTIONS=["Marine NPs"],
...     COLLECTION_TAG="marine",
...     COLLECTION_LOGIC="OR",
...     MORGAN_RADIUS=2,
...     MORGAN_BITS=2048,
... )
>>> config.COLLECTION_TAG
'marine'

`from_module(module)` `classmethod` ¶

Build a ChamanpConfig from a module-like object.

Reads every ChamanpConfig field name as an attribute from module and returns a new ChamanpConfig instance. The loaded configuration is not preflight-validated; call validate_config to validate before execution.

Parameters:

Name	Type	Description	Default
`module`	`module - like`	An object that exposes all `ChamanpConfig` field names as uppercase attributes, such as a Python module imported with `import config`.	required

Returns:

Type	Description
`ChamanpConfig`	A new configuration instance populated from the module attributes.

Raises:

Type	Description
`AttributeError`	If any `ChamanpConfig` field name is absent from module.

Examples:

Load configuration from the repository-level config.py:

>>> import config
>>> from chamanp import ChamanpConfig
>>> cfg = ChamanpConfig.from_module(config)

`from_toml(path)` `classmethod` ¶

Build a ChamanpConfig from a TOML file.

Reads configuration values from a TOML file at path. TOML keys must be lowercase versions of ChamanpConfig field names (for example, database_path maps to DATABASE_PATH). Unknown keys raise a ValueError. The loaded configuration is not preflight-validated; call validate_config to validate before execution.

Parameters:

Name	Type	Description	Default
`path`	`str or path - like`	File system path to the TOML configuration profile.	required

Returns:

Type	Description
`ChamanpConfig`	A new configuration instance populated from the TOML file.

Raises:

Type	Description
`FileNotFoundError`	If path does not exist.
`ValueError`	If the file is not valid TOML, or if it contains unknown keys.

Notes

from_toml does not perform preflight validation. File paths referenced in the loaded configuration are not checked for existence until validate_config is called.

Examples:

Load configuration from a user TOML profile:

>>> from chamanp import ChamanpConfig
>>> config = ChamanpConfig.from_toml("my-chamanp-profile.toml")

`required_fields()` `classmethod` ¶

Return the names of all ChamanpConfig fields.

Returns:

Type	Description
`tuple of str`	Names of all configuration fields in declaration order.

ChamanpResult¶

Frozen summary of a completed CHAMANP pipeline execution.

Returned by run after a successful pipeline run. All path fields are strings. Count fields may be None if the pipeline did not produce them. Execution failures are exception-based and do not produce a result object.

Notes

CHAMANP is currently pre-stable (Alpha). Field names and types may change before a stable release is declared.

Attributes:

Name	Type	Description
`status`	`str`	Execution status. Always `"completed"` for a successful run.
`version`	`str`	CHAMANP package version at the time of execution.
`collection_tag`	`str`	Short collection tag used to name output artifacts, taken from `ChamanpConfig.COLLECTION_TAG`.
`curated_path`	`str`	File system path to the curated molecular dataset CSV.
`filtered_path`	`str`	File system path to the collection-filtered dataset CSV.
`metadata_path`	`str`	File system path to the fingerprint metadata CSV.
`fingerprints_path`	`str`	File system path to the Morgan fingerprint matrix (`.npy`).
`invalid_smiles_path`	`str`	File system path to the invalid-SMILES traceability CSV.
`report_path`	`str`	File system path to the pipeline execution report.
`fingerprint_radius`	`int`	Morgan fingerprint radius used during generation, taken from `ChamanpConfig.MORGAN_RADIUS`.
`fingerprint_bits`	`int`	Morgan fingerprint bit length used during generation, taken from `ChamanpConfig.MORGAN_BITS`.
`total_input_size`	`int or None`	Total number of data rows in the input CSV, excluding the header.
`total_after_dedup`	`int or None`	Number of rows remaining after stereochemical deduplication.
`stereo_removed_count`	`int or None`	Number of rows removed during stereochemical deduplication (`total_input_size - total_after_dedup`).
`filtered_count`	`int or None`	Number of molecular dataset entries remaining after collection filtering.
`valid_molecules_count`	`int or None`	Number of molecular dataset entries for which a valid fingerprint was generated. (`filtered_count - invalid_smiles_count`).
`invalid_smiles_count`	`int or None`	Number of compounds whose SMILES string could not be parsed by RDKit during fingerprint generation.

`to_dict()` ¶

Return a plain-dictionary representation of the execution result.

Returns:

Type	Description
`dict`	A dictionary with field names as keys and field values as values, produced by `dataclasses.asdict`.

validate_config¶

Validate a CHAMANP runtime configuration object.

Checks that required file paths exist, that collection settings are well-formed, and that fingerprint parameters are valid integers. COLLECTION_LOGIC and COLLECTION_TAG values are normalized in-place (stripped and upper-cased where applicable) before the validated configuration is returned.

Parameters:

Name	Type	Description	Default
`config`	`ChamanpConfig`	Configuration to validate. When `None`, a default `ChamanpConfig` is constructed and validated.	`None`

Returns:

Type	Description
`ChamanpConfig`	The validated configuration object, with `COLLECTION_LOGIC` and `COLLECTION_TAG` normalized.

Raises:

Type	Description
`ConfigurationError`	If one or more validation checks fail. The error message lists all failing checks.

Examples:

Validate a configuration before running the pipeline:

>>> from chamanp import ChamanpConfig, validate_config
>>> config = ChamanpConfig(
...     DATABASE_PATH="data/coconut.csv",
...     COLLECTION_TAXONOMY_PATH="data/taxonomy.json",
...     TARGET_COLLECTIONS=["PubChem NPs"],
...     COLLECTION_TAG="pubchem",
... )
>>> validated = validate_config(config)

run¶

Validate and execute CHAMANP, writing configured artifacts to disk.

Calls validate_config on config, then runs the private pipeline implementation. The pipeline curates the molecular dataset, filters by target collections, generates Morgan fingerprints, and writes a summary report. The pipeline writes configured artifacts to disk during execution.

Parameters:

Name	Type	Description	Default
`config`	`ChamanpConfig`	Runtime configuration. When `None`, a default `ChamanpConfig` is constructed, validated, and used.	`None`

Returns:

Type	Description
`ChamanpResult`	A frozen result object containing execution status, artifact paths, and summary counts. See `ChamanpResult` for field descriptions.

Raises:

Type	Description
`ConfigurationError`	If configuration validation fails before execution begins.

Notes

run validates the configuration, instantiates the private pipeline implementation internally, and writes configured artifacts to disk during execution.

The internal pipeline implementation is private and should not be imported or used directly.

Examples:

Run the pipeline with a custom configuration:

>>> from chamanp import ChamanpConfig, run
>>> config = ChamanpConfig(
...     DATABASE_PATH="data/coconut.csv",
...     COLLECTION_TAXONOMY_PATH="data/taxonomy.json",
...     TARGET_COLLECTIONS=["PubChem NPs"],
...     COLLECTION_TAG="pubchem",
... )
>>> result = run(config)

API Reference¶

ChamanpConfig¶

from_module(module) classmethod ¶

from_toml(path) classmethod ¶

required_fields() classmethod ¶

ChamanpResult¶

to_dict() ¶

validate_config¶

run¶

`from_module(module)` `classmethod` ¶

`from_toml(path)` `classmethod` ¶

`required_fields()` `classmethod` ¶

`to_dict()` ¶