API Reference

Complete reference for all public classes, pipelines, and functions in HARMONSMILE. Documentation is auto-generated from NumPy-style docstrings in the source code.

Configuration

PubChemConfig

`harmonsmile.PubChemConfig` `dataclass`

Immutable configuration for :class:~harmonsmile.pipelines.PubChemIngest.

Parameters:

Name	Type	Description	Default
`input_path`	`str`	Path to the input file (CSV, TSV, XLSX). Must not be empty or contain path traversal patterns ('..').	required
`output_path`	`str`	Path to the output CSV file. Must not be empty or contain path traversal patterns ('..').	required
`error_log`	`str`	Path to the error log file. Defaults to 'logs/errors.txt'.	`'logs/errors.txt'`
`cid_col`	`str`	Name of the PubChem CID column. Must not be empty or whitespace-only. Defaults to 'PubChem CID'.	`'PubChem CID'`
`props`	`tuple of str`	PubChem properties to fetch. Must contain at least one valid property name. Defaults to all available properties.	`('SMILES', 'ConnectivitySMILES', 'MolecularFormula', 'MolecularWeight', 'InChI', 'InChIKey', 'XLogP', 'TPSA', 'Charge', 'HBondDonorCount', 'HBondAcceptorCount', 'RotatableBondCount', 'HeavyAtomCount')`

Raises:

Type	Description
`ValueError`	If `input_path` is empty or contains '..'.
`ValueError`	If `output_path` is empty or contains '..'.
`ValueError`	If `cid_col` is empty or whitespace-only.
`ValueError`	If `props` is empty or contains invalid property names.

Examples:

>>> from harmonsmile import PubChemConfig
>>> cfg = PubChemConfig(
...     input_path="examples/example_pubchem.csv",
...     output_path="results/pubchem_harmonized.csv",
... )

ChEMBLConfig

`harmonsmile.ChEMBLConfig` `dataclass`

Immutable configuration for :class:~harmonsmile.pipelines.ChEMBLIngest.

Parameters:

Name	Type	Description	Default
`input_path`	`str`	Path to the input file (CSV, TSV, XLSX). Must not be empty or contain path traversal patterns ('..').	required
`output_path`	`str`	Path to the output CSV file. Must not be empty or contain path traversal patterns ('..').	required
`chembl_id_col`	`str`	Name of the ChEMBL ID column in the input file. Must not be empty or whitespace-only. Defaults to 'ChEMBL ID'.	`'ChEMBL ID'`
`error_log`	`str`	Path to the error log file. Defaults to 'logs/errors.txt'.	`'logs/errors.txt'`

Raises:

Type	Description
`ValueError`	If `input_path` is empty or contains '..'.
`ValueError`	If `output_path` is empty or contains '..'.
`ValueError`	If `chembl_id_col` is empty or whitespace-only.

Examples:

>>> from harmonsmile import ChEMBLConfig
>>> cfg = ChEMBLConfig(
...     input_path="examples/example_chembl.csv",
...     output_path="results/chembl_harmonized.csv",
... )

SMILESConfig

`harmonsmile.SMILESConfig` `dataclass`

Immutable configuration for :class:~harmonsmile.pipelines.SMILESPrep.

Parameters:

Name	Type	Description	Default
`input_path`	`str`	Path to the input file (CSV, TSV, XLSX). Must not be empty or contain path traversal patterns ('..').	required
`output_path`	`str`	Path to the output CSV file. Must not be empty or contain path traversal patterns ('..').	required
`smiles_col`	`str`	Name of the column containing SMILES strings. Must not be empty or whitespace-only.	required
`error_log`	`str`	Path to the error log file. Defaults to 'logs/errors.txt'.	`'logs/errors.txt'`

Raises:

Type	Description
`ValueError`	If `input_path` is empty or contains '..'.
`ValueError`	If `output_path` is empty or contains '..'.
`ValueError`	If `smiles_col` is empty or whitespace-only.

Examples:

>>> from harmonsmile import SMILESConfig
>>> cfg = SMILESConfig(
...     input_path="examples/example_smiles.csv",
...     smiles_col="SMILES",
...     output_path="results/smiles_harmonized.csv",
... )

Pipelines

PubChemIngest

`harmonsmile.PubChemIngest`

Pipeline for ingesting and harmonizing PubChem compound data.

Fetches properties from the PubChem REST API and appends a standardized SMILES_RDKit column using RDKit canonicalization.

Parameters:

Name	Type	Description	Default
`cfg`	`PubChemConfig`	Pipeline configuration.	required
`client`	`_PubChemClient`	PubChem API client. Created automatically if not provided.	`None`
`std`	`RDKitStandardizer`	SMILES standardizer. Created automatically if not provided.	`None`

Examples:

>>> from harmonsmile import PubChemIngest, PubChemConfig
>>> cfg = PubChemConfig(
...     input_path="examples/example_pubchem.csv",
...     output_path="results/pubchem_harmonized.csv",
... )
>>> df = PubChemIngest(cfg).run()

`run()`

Execute the PubChem ingestion pipeline.

Returns:

Type	Description
`DataFrame`	DataFrame with original columns plus fetched properties and standardized SMILES_RDKit column.

Raises:

Type	Description
`ValueError`	If the configured CID column is not found in the input file.
`ValueError`	If the input file has zero rows.

ChEMBLIngest

`harmonsmile.ChEMBLIngest`

Pipeline for ingesting and harmonizing ChEMBL compound data.

Fetches properties from the ChEMBL REST API by ChEMBL ID, applies RDKit canonicalization to produce a standardized SMILES_RDKit column, and saves the result as a CSV file.

Parameters:

Name	Type	Description	Default
`cfg`	`ChEMBLConfig`	Pipeline configuration.	required
`client`	`_ChEMBLClient or None`	ChEMBL API client. Created automatically if not provided.	`None`
`std`	`RDKitStandardizer or None`	SMILES standardizer. Created automatically if not provided.	`None`

Examples:

>>> from harmonsmile import ChEMBLIngest, ChEMBLConfig
>>> cfg = ChEMBLConfig(
...     input_path="examples/example_chembl.csv",
...     output_path="results/chembl_harmonized.csv",
... )
>>> df = ChEMBLIngest(cfg).run()

`run()`

Execute the ChEMBL ingestion pipeline.

Returns:

Type	Description
`DataFrame`	DataFrame with original columns plus fetched and renamed ChEMBL properties and a standardized SMILES_RDKit column.

Raises:

Type	Description
`ValueError`	If the configured ChEMBL ID column is not found in the input file.
`ValueError`	If the input file has zero rows.

SMILESPrep

`harmonsmile.SMILESPrep`

Pipeline for harmonizing SMILES from any tabular source.

Reads a tabular file, applies RDKit canonicalization to the specified SMILES column, and saves the result with an appended SMILES_RDKit column.

Parameters:

Name	Type	Description	Default
`cfg`	`SMILESConfig`	Pipeline configuration.	required
`std`	`RDKitStandardizer`	SMILES standardizer. Created automatically if not provided.	`None`

Examples:

>>> from harmonsmile import SMILESPrep, SMILESConfig
>>> cfg = SMILESConfig(
...     input_path="examples/example_smiles.csv",
...     smiles_col="SMILES",
...     output_path="results/smiles_harmonized.csv",
... )
>>> df = SMILESPrep(cfg).run()

`run()`

Execute the SMILES preparation pipeline.

Returns:

Type	Description
`DataFrame`	DataFrame with original columns plus standardized SMILES_RDKit column.

Raises:

Type	Description
`ValueError`	If the specified SMILES column is not found in the input file.
`ValueError`	If the input file has zero rows.

Standardization

RDKitStandardizer

`harmonsmile.RDKitStandardizer`

Standardize SMILES strings using RDKit.

Converts input SMILES to a consistent canonical form following the COCONUT 2.0 convention: canonical + isomeric + Kekulized.

`to_conn_kek(smiles)` `staticmethod`

Convert SMILES to canonical + connectivity-only + Kekulized form.

Stereochemistry is stripped. Useful for connectivity-based comparisons where chirality is not relevant.

Parameters:

Name	Type	Description	Default
`smiles`	`str`	Input SMILES string.	required

Returns:

Type	Description
`str or None`	Standardized SMILES without stereochemistry, or None if invalid.

Examples:

>>> RDKitStandardizer.to_conn_kek("C[C@@H](O)F")
'CC(O)F'
>>> RDKitStandardizer.to_conn_kek("invalid")
>>> RDKitStandardizer.to_conn_kek("")

`to_iso_kek(smiles)` `staticmethod`

Convert SMILES to canonical + isomeric + Kekulized form.

Parameters:

Name	Type	Description	Default
`smiles`	`str`	Input SMILES string.	required

Returns:

Type	Description
`str or None`	Standardized SMILES, or None if input is invalid.

Notes

Chiral centers (e.g. [C@@H]) are preserved because RDKit encodes tetrahedral stereochemistry independently of kekulization.

E/Z geometry on double bonds (/ and \ in SMILES) is preserved only when RDKit can unambiguously determine the configuration after parsing and sanitization. For some double bonds — particularly those in conjugated systems or where the source SMILES omits directional bonds on one side — RDKit cannot resolve the geometry and silently drops the / and \ notation. This is a known RDKit behavior, not a bug in harmonsmile. If E/Z fidelity is critical for your use case, validate SMILES_RDKit against the source SMILES.

Examples:

>>> RDKitStandardizer.to_iso_kek("c1ccccc1")
'C1=CC=CC=C1'
>>> RDKitStandardizer.to_iso_kek("invalid")
>>> RDKitStandardizer.to_iso_kek("")

I/O Utilities

load_table

`harmonsmile.load_table(path)`

Load a tabular file into a DataFrame.

Supports CSV, TSV, TXT, XLSX, XLSM, and XLS formats. Automatically detects delimiter for text files; falls back to semicolon separator with latin-1 encoding if auto-detection fails.

Parameters:

Name	Type	Description	Default
`path`	`str or PathLike`	Path to the input file.	required

Returns:

Type	Description
`DataFrame`	Loaded DataFrame with cleaned 'id' and 'PubChem CID' columns if present.

Raises:

Type	Description
`FileNotFoundError`	If the file does not exist at the given path.
`ValueError`	If the file format is not supported.
`ValueError`	If the loaded DataFrame has zero rows.

Examples:

>>> df = load_table("examples/example_chembl.csv")
>>> df = load_table("examples/example_pubchem.csv")

save_table

`harmonsmile.save_table(df, path)`

Save a DataFrame to a CSV file.

Parent directories are created automatically if they do not exist.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame to save.	required
`path`	`str or PathLike`	Output file path. Parent directories are created as needed.	required

Examples:

>>> import pandas as pd
>>> df = pd.DataFrame({"SMILES": ["C1=CC=CC=C1"], "SMILES_RDKit": ["C1=CC=CC=C1"]})
>>> save_table(df, "results/output.csv")

API Reference

Configuration

PubChemConfig

harmonsmile.PubChemConfig dataclass

ChEMBLConfig

harmonsmile.ChEMBLConfig dataclass

SMILESConfig

harmonsmile.SMILESConfig dataclass

Pipelines

PubChemIngest

harmonsmile.PubChemIngest

run()

ChEMBLIngest

harmonsmile.ChEMBLIngest

run()

SMILESPrep

harmonsmile.SMILESPrep

run()

Standardization

RDKitStandardizer

harmonsmile.RDKitStandardizer

to_conn_kek(smiles) staticmethod

to_iso_kek(smiles) staticmethod

I/O Utilities

load_table

harmonsmile.load_table(path)

save_table

harmonsmile.save_table(df, path)

`harmonsmile.PubChemConfig` `dataclass`

`harmonsmile.ChEMBLConfig` `dataclass`

`harmonsmile.SMILESConfig` `dataclass`

`harmonsmile.PubChemIngest`

`run()`

`harmonsmile.ChEMBLIngest`

`run()`

`harmonsmile.SMILESPrep`

`run()`

`harmonsmile.RDKitStandardizer`

`to_conn_kek(smiles)` `staticmethod`

`to_iso_kek(smiles)` `staticmethod`

`harmonsmile.load_table(path)`

`harmonsmile.save_table(df, path)`