API Reference
Complete reference for all public classes, pipelines, and functions in HARMONSMILE. Documentation is auto-generated from NumPy-style docstrings in the source code.
Configuration
PubChemConfig
harmonsmile.PubChemConfig
dataclass
Immutable configuration for :class:~harmonsmile.pipelines.PubChemIngest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_path
|
str
|
Path to the input file (CSV, TSV, XLSX). Must not be empty or contain path traversal patterns ('..'). |
required |
output_path
|
str
|
Path to the output CSV file. Must not be empty or contain path traversal patterns ('..'). |
required |
error_log
|
str
|
Path to the error log file. Defaults to 'logs/errors.txt'. |
'logs/errors.txt'
|
cid_col
|
str
|
Name of the PubChem CID column. Must not be empty or whitespace-only. Defaults to 'PubChem CID'. |
'PubChem CID'
|
props
|
tuple of str
|
PubChem properties to fetch. Must contain at least one valid property name. Defaults to all available properties. |
('SMILES', 'ConnectivitySMILES', 'MolecularFormula', 'MolecularWeight', 'InChI', 'InChIKey', 'XLogP', 'TPSA', 'Charge', 'HBondDonorCount', 'HBondAcceptorCount', 'RotatableBondCount', 'HeavyAtomCount')
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
Examples:
>>> from harmonsmile import PubChemConfig
>>> cfg = PubChemConfig(
... input_path="examples/example_pubchem.csv",
... output_path="results/pubchem_harmonized.csv",
... )
ChEMBLConfig
harmonsmile.ChEMBLConfig
dataclass
Immutable configuration for :class:~harmonsmile.pipelines.ChEMBLIngest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_path
|
str
|
Path to the input file (CSV, TSV, XLSX). Must not be empty or contain path traversal patterns ('..'). |
required |
output_path
|
str
|
Path to the output CSV file. Must not be empty or contain path traversal patterns ('..'). |
required |
chembl_id_col
|
str
|
Name of the ChEMBL ID column in the input file. Must not be empty or whitespace-only. Defaults to 'ChEMBL ID'. |
'ChEMBL ID'
|
error_log
|
str
|
Path to the error log file. Defaults to 'logs/errors.txt'. |
'logs/errors.txt'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
Examples:
>>> from harmonsmile import ChEMBLConfig
>>> cfg = ChEMBLConfig(
... input_path="examples/example_chembl.csv",
... output_path="results/chembl_harmonized.csv",
... )
SMILESConfig
harmonsmile.SMILESConfig
dataclass
Immutable configuration for :class:~harmonsmile.pipelines.SMILESPrep.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_path
|
str
|
Path to the input file (CSV, TSV, XLSX). Must not be empty or contain path traversal patterns ('..'). |
required |
output_path
|
str
|
Path to the output CSV file. Must not be empty or contain path traversal patterns ('..'). |
required |
smiles_col
|
str
|
Name of the column containing SMILES strings. Must not be empty or whitespace-only. |
required |
error_log
|
str
|
Path to the error log file. Defaults to 'logs/errors.txt'. |
'logs/errors.txt'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
Examples:
>>> from harmonsmile import SMILESConfig
>>> cfg = SMILESConfig(
... input_path="examples/example_smiles.csv",
... smiles_col="SMILES",
... output_path="results/smiles_harmonized.csv",
... )
Pipelines
PubChemIngest
harmonsmile.PubChemIngest
Pipeline for ingesting and harmonizing PubChem compound data.
Fetches properties from the PubChem REST API and appends a standardized SMILES_RDKit column using RDKit canonicalization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
PubChemConfig
|
Pipeline configuration. |
required |
client
|
_PubChemClient
|
PubChem API client. Created automatically if not provided. |
None
|
std
|
RDKitStandardizer
|
SMILES standardizer. Created automatically if not provided. |
None
|
Examples:
>>> from harmonsmile import PubChemIngest, PubChemConfig
>>> cfg = PubChemConfig(
... input_path="examples/example_pubchem.csv",
... output_path="results/pubchem_harmonized.csv",
... )
>>> df = PubChemIngest(cfg).run()
run()
Execute the PubChem ingestion pipeline.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with original columns plus fetched properties and standardized SMILES_RDKit column. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the configured CID column is not found in the input file. |
ValueError
|
If the input file has zero rows. |
ChEMBLIngest
harmonsmile.ChEMBLIngest
Pipeline for ingesting and harmonizing ChEMBL compound data.
Fetches properties from the ChEMBL REST API by ChEMBL ID, applies RDKit canonicalization to produce a standardized SMILES_RDKit column, and saves the result as a CSV file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
ChEMBLConfig
|
Pipeline configuration. |
required |
client
|
_ChEMBLClient or None
|
ChEMBL API client. Created automatically if not provided. |
None
|
std
|
RDKitStandardizer or None
|
SMILES standardizer. Created automatically if not provided. |
None
|
Examples:
>>> from harmonsmile import ChEMBLIngest, ChEMBLConfig
>>> cfg = ChEMBLConfig(
... input_path="examples/example_chembl.csv",
... output_path="results/chembl_harmonized.csv",
... )
>>> df = ChEMBLIngest(cfg).run()
run()
Execute the ChEMBL ingestion pipeline.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with original columns plus fetched and renamed ChEMBL properties and a standardized SMILES_RDKit column. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the configured ChEMBL ID column is not found in the input file. |
ValueError
|
If the input file has zero rows. |
SMILESPrep
harmonsmile.SMILESPrep
Pipeline for harmonizing SMILES from any tabular source.
Reads a tabular file, applies RDKit canonicalization to the specified SMILES column, and saves the result with an appended SMILES_RDKit column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
SMILESConfig
|
Pipeline configuration. |
required |
std
|
RDKitStandardizer
|
SMILES standardizer. Created automatically if not provided. |
None
|
Examples:
>>> from harmonsmile import SMILESPrep, SMILESConfig
>>> cfg = SMILESConfig(
... input_path="examples/example_smiles.csv",
... smiles_col="SMILES",
... output_path="results/smiles_harmonized.csv",
... )
>>> df = SMILESPrep(cfg).run()
run()
Execute the SMILES preparation pipeline.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with original columns plus standardized SMILES_RDKit column. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the specified SMILES column is not found in the input file. |
ValueError
|
If the input file has zero rows. |
Standardization
RDKitStandardizer
harmonsmile.RDKitStandardizer
Standardize SMILES strings using RDKit.
Converts input SMILES to a consistent canonical form following the COCONUT 2.0 convention: canonical + isomeric + Kekulized.
to_conn_kek(smiles)
staticmethod
Convert SMILES to canonical + connectivity-only + Kekulized form.
Stereochemistry is stripped. Useful for connectivity-based comparisons where chirality is not relevant.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
smiles
|
str
|
Input SMILES string. |
required |
Returns:
| Type | Description |
|---|---|
str or None
|
Standardized SMILES without stereochemistry, or None if invalid. |
Examples:
>>> RDKitStandardizer.to_conn_kek("C[C@@H](O)F")
'CC(O)F'
>>> RDKitStandardizer.to_conn_kek("invalid")
>>> RDKitStandardizer.to_conn_kek("")
to_iso_kek(smiles)
staticmethod
Convert SMILES to canonical + isomeric + Kekulized form.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
smiles
|
str
|
Input SMILES string. |
required |
Returns:
| Type | Description |
|---|---|
str or None
|
Standardized SMILES, or None if input is invalid. |
Notes
Chiral centers (e.g. [C@@H]) are preserved because RDKit encodes
tetrahedral stereochemistry independently of kekulization.
E/Z geometry on double bonds (/ and \ in SMILES) is preserved
only when RDKit can unambiguously determine the configuration after
parsing and sanitization. For some double bonds — particularly those
in conjugated systems or where the source SMILES omits directional
bonds on one side — RDKit cannot resolve the geometry and silently
drops the / and \ notation. This is a known RDKit behavior, not a
bug in harmonsmile. If E/Z fidelity is critical for your use case,
validate SMILES_RDKit against the source SMILES.
Examples:
>>> RDKitStandardizer.to_iso_kek("c1ccccc1")
'C1=CC=CC=C1'
>>> RDKitStandardizer.to_iso_kek("invalid")
>>> RDKitStandardizer.to_iso_kek("")
I/O Utilities
load_table
harmonsmile.load_table(path)
Load a tabular file into a DataFrame.
Supports CSV, TSV, TXT, XLSX, XLSM, and XLS formats. Automatically detects delimiter for text files; falls back to semicolon separator with latin-1 encoding if auto-detection fails.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str or PathLike
|
Path to the input file. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Loaded DataFrame with cleaned 'id' and 'PubChem CID' columns if present. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist at the given path. |
ValueError
|
If the file format is not supported. |
ValueError
|
If the loaded DataFrame has zero rows. |
Examples:
>>> df = load_table("examples/example_chembl.csv")
>>> df = load_table("examples/example_pubchem.csv")
save_table
harmonsmile.save_table(df, path)
Save a DataFrame to a CSV file.
Parent directories are created automatically if they do not exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame to save. |
required |
path
|
str or PathLike
|
Output file path. Parent directories are created as needed. |
required |
Examples:
>>> import pandas as pd
>>> df = pd.DataFrame({"SMILES": ["C1=CC=CC=C1"], "SMILES_RDKit": ["C1=CC=CC=C1"]})
>>> save_table(df, "results/output.csv")