Skip to content

API Reference

Complete reference for all public classes, pipelines, and functions in HARMONSMILE. Documentation is auto-generated from NumPy-style docstrings in the source code.


Configuration

PubChemConfig

harmonsmile.PubChemConfig dataclass

Immutable configuration for :class:~harmonsmile.pipelines.PubChemIngest.

Parameters:

Name Type Description Default
input_path str

Path to the input file (CSV, TSV, XLSX). Must not be empty or contain path traversal patterns ('..').

required
output_path str

Path to the output CSV file. Must not be empty or contain path traversal patterns ('..').

required
error_log str

Path to the error log file. Defaults to 'logs/errors.txt'.

'logs/errors.txt'
cid_col str

Name of the PubChem CID column. Must not be empty or whitespace-only. Defaults to 'PubChem CID'.

'PubChem CID'
props tuple of str

PubChem properties to fetch. Must contain at least one valid property name. Defaults to all available properties.

('SMILES', 'ConnectivitySMILES', 'MolecularFormula', 'MolecularWeight', 'InChI', 'InChIKey', 'XLogP', 'TPSA', 'Charge', 'HBondDonorCount', 'HBondAcceptorCount', 'RotatableBondCount', 'HeavyAtomCount')

Raises:

Type Description
ValueError

If input_path is empty or contains '..'.

ValueError

If output_path is empty or contains '..'.

ValueError

If cid_col is empty or whitespace-only.

ValueError

If props is empty or contains invalid property names.

Examples:

>>> from harmonsmile import PubChemConfig
>>> cfg = PubChemConfig(
...     input_path="examples/example_pubchem.csv",
...     output_path="results/pubchem_harmonized.csv",
... )

ChEMBLConfig

harmonsmile.ChEMBLConfig dataclass

Immutable configuration for :class:~harmonsmile.pipelines.ChEMBLIngest.

Parameters:

Name Type Description Default
input_path str

Path to the input file (CSV, TSV, XLSX). Must not be empty or contain path traversal patterns ('..').

required
output_path str

Path to the output CSV file. Must not be empty or contain path traversal patterns ('..').

required
chembl_id_col str

Name of the ChEMBL ID column in the input file. Must not be empty or whitespace-only. Defaults to 'ChEMBL ID'.

'ChEMBL ID'
error_log str

Path to the error log file. Defaults to 'logs/errors.txt'.

'logs/errors.txt'

Raises:

Type Description
ValueError

If input_path is empty or contains '..'.

ValueError

If output_path is empty or contains '..'.

ValueError

If chembl_id_col is empty or whitespace-only.

Examples:

>>> from harmonsmile import ChEMBLConfig
>>> cfg = ChEMBLConfig(
...     input_path="examples/example_chembl.csv",
...     output_path="results/chembl_harmonized.csv",
... )

SMILESConfig

harmonsmile.SMILESConfig dataclass

Immutable configuration for :class:~harmonsmile.pipelines.SMILESPrep.

Parameters:

Name Type Description Default
input_path str

Path to the input file (CSV, TSV, XLSX). Must not be empty or contain path traversal patterns ('..').

required
output_path str

Path to the output CSV file. Must not be empty or contain path traversal patterns ('..').

required
smiles_col str

Name of the column containing SMILES strings. Must not be empty or whitespace-only.

required
error_log str

Path to the error log file. Defaults to 'logs/errors.txt'.

'logs/errors.txt'

Raises:

Type Description
ValueError

If input_path is empty or contains '..'.

ValueError

If output_path is empty or contains '..'.

ValueError

If smiles_col is empty or whitespace-only.

Examples:

>>> from harmonsmile import SMILESConfig
>>> cfg = SMILESConfig(
...     input_path="examples/example_smiles.csv",
...     smiles_col="SMILES",
...     output_path="results/smiles_harmonized.csv",
... )

Pipelines

PubChemIngest

harmonsmile.PubChemIngest

Pipeline for ingesting and harmonizing PubChem compound data.

Fetches properties from the PubChem REST API and appends a standardized SMILES_RDKit column using RDKit canonicalization.

Parameters:

Name Type Description Default
cfg PubChemConfig

Pipeline configuration.

required
client _PubChemClient

PubChem API client. Created automatically if not provided.

None
std RDKitStandardizer

SMILES standardizer. Created automatically if not provided.

None

Examples:

>>> from harmonsmile import PubChemIngest, PubChemConfig
>>> cfg = PubChemConfig(
...     input_path="examples/example_pubchem.csv",
...     output_path="results/pubchem_harmonized.csv",
... )
>>> df = PubChemIngest(cfg).run()

run()

Execute the PubChem ingestion pipeline.

Returns:

Type Description
DataFrame

DataFrame with original columns plus fetched properties and standardized SMILES_RDKit column.

Raises:

Type Description
ValueError

If the configured CID column is not found in the input file.

ValueError

If the input file has zero rows.


ChEMBLIngest

harmonsmile.ChEMBLIngest

Pipeline for ingesting and harmonizing ChEMBL compound data.

Fetches properties from the ChEMBL REST API by ChEMBL ID, applies RDKit canonicalization to produce a standardized SMILES_RDKit column, and saves the result as a CSV file.

Parameters:

Name Type Description Default
cfg ChEMBLConfig

Pipeline configuration.

required
client _ChEMBLClient or None

ChEMBL API client. Created automatically if not provided.

None
std RDKitStandardizer or None

SMILES standardizer. Created automatically if not provided.

None

Examples:

>>> from harmonsmile import ChEMBLIngest, ChEMBLConfig
>>> cfg = ChEMBLConfig(
...     input_path="examples/example_chembl.csv",
...     output_path="results/chembl_harmonized.csv",
... )
>>> df = ChEMBLIngest(cfg).run()

run()

Execute the ChEMBL ingestion pipeline.

Returns:

Type Description
DataFrame

DataFrame with original columns plus fetched and renamed ChEMBL properties and a standardized SMILES_RDKit column.

Raises:

Type Description
ValueError

If the configured ChEMBL ID column is not found in the input file.

ValueError

If the input file has zero rows.


SMILESPrep

harmonsmile.SMILESPrep

Pipeline for harmonizing SMILES from any tabular source.

Reads a tabular file, applies RDKit canonicalization to the specified SMILES column, and saves the result with an appended SMILES_RDKit column.

Parameters:

Name Type Description Default
cfg SMILESConfig

Pipeline configuration.

required
std RDKitStandardizer

SMILES standardizer. Created automatically if not provided.

None

Examples:

>>> from harmonsmile import SMILESPrep, SMILESConfig
>>> cfg = SMILESConfig(
...     input_path="examples/example_smiles.csv",
...     smiles_col="SMILES",
...     output_path="results/smiles_harmonized.csv",
... )
>>> df = SMILESPrep(cfg).run()

run()

Execute the SMILES preparation pipeline.

Returns:

Type Description
DataFrame

DataFrame with original columns plus standardized SMILES_RDKit column.

Raises:

Type Description
ValueError

If the specified SMILES column is not found in the input file.

ValueError

If the input file has zero rows.


Standardization

RDKitStandardizer

harmonsmile.RDKitStandardizer

Standardize SMILES strings using RDKit.

Converts input SMILES to a consistent canonical form following the COCONUT 2.0 convention: canonical + isomeric + Kekulized.

to_conn_kek(smiles) staticmethod

Convert SMILES to canonical + connectivity-only + Kekulized form.

Stereochemistry is stripped. Useful for connectivity-based comparisons where chirality is not relevant.

Parameters:

Name Type Description Default
smiles str

Input SMILES string.

required

Returns:

Type Description
str or None

Standardized SMILES without stereochemistry, or None if invalid.

Examples:

>>> RDKitStandardizer.to_conn_kek("C[C@@H](O)F")
'CC(O)F'
>>> RDKitStandardizer.to_conn_kek("invalid")
>>> RDKitStandardizer.to_conn_kek("")

to_iso_kek(smiles) staticmethod

Convert SMILES to canonical + isomeric + Kekulized form.

Parameters:

Name Type Description Default
smiles str

Input SMILES string.

required

Returns:

Type Description
str or None

Standardized SMILES, or None if input is invalid.

Notes

Chiral centers (e.g. [C@@H]) are preserved because RDKit encodes tetrahedral stereochemistry independently of kekulization.

E/Z geometry on double bonds (/ and \ in SMILES) is preserved only when RDKit can unambiguously determine the configuration after parsing and sanitization. For some double bonds — particularly those in conjugated systems or where the source SMILES omits directional bonds on one side — RDKit cannot resolve the geometry and silently drops the / and \ notation. This is a known RDKit behavior, not a bug in harmonsmile. If E/Z fidelity is critical for your use case, validate SMILES_RDKit against the source SMILES.

Examples:

>>> RDKitStandardizer.to_iso_kek("c1ccccc1")
'C1=CC=CC=C1'
>>> RDKitStandardizer.to_iso_kek("invalid")
>>> RDKitStandardizer.to_iso_kek("")

I/O Utilities

load_table

harmonsmile.load_table(path)

Load a tabular file into a DataFrame.

Supports CSV, TSV, TXT, XLSX, XLSM, and XLS formats. Automatically detects delimiter for text files; falls back to semicolon separator with latin-1 encoding if auto-detection fails.

Parameters:

Name Type Description Default
path str or PathLike

Path to the input file.

required

Returns:

Type Description
DataFrame

Loaded DataFrame with cleaned 'id' and 'PubChem CID' columns if present.

Raises:

Type Description
FileNotFoundError

If the file does not exist at the given path.

ValueError

If the file format is not supported.

ValueError

If the loaded DataFrame has zero rows.

Examples:

>>> df = load_table("examples/example_chembl.csv")
>>> df = load_table("examples/example_pubchem.csv")

save_table

harmonsmile.save_table(df, path)

Save a DataFrame to a CSV file.

Parent directories are created automatically if they do not exist.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to save.

required
path str or PathLike

Output file path. Parent directories are created as needed.

required

Examples:

>>> import pandas as pd
>>> df = pd.DataFrame({"SMILES": ["C1=CC=CC=C1"], "SMILES_RDKit": ["C1=CC=CC=C1"]})
>>> save_table(df, "results/output.csv")