Changelog
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.2.2 - 2026-05-21
Added
- MkDocs + Material theme + mkdocstrings documentation infrastructure.
docs/index.md— home page mirroringREADME.mdcontent.docs/api.md— auto-generated API reference from NumPy-style docstrings.docs/changelog.md— changelog page includingCHANGELOG.mdvia snippets.mkdocs.yml— MkDocs configuration with Material theme, mkdocstrings plugin (NumPy style), and navigation structure..github/workflows/docs.yml— new workflow that builds and deploys documentation to GitHub Pages on every push tomain.pyproject.toml: newdocsoptional dependency group (mkdocs-material>=9.5,mkdocstrings[python]>=0.25).pyproject.toml:DocumentationURL added under[project.urls].
Removed
API.md— replaced by auto-generateddocs/api.md.
Internal
requirements-dev.txt: documentation dependencies added.version.py: version bumped to0.2.2.
0.2.1 - 2026-05-20
Changed
ci.yml: Python version matrix updated to 3.11 and 3.12 (Python 3.10 dropped — end-of-life October 2026).ci.yml: import boundary check replaced with explicitassertstatements for all seven key public symbols (PubChemConfig,ChEMBLConfig,SMILESConfig,RDKitStandardizer,PubChemIngest,ChEMBLIngest,SMILESPrep). Same assertions applied inside the wheel smoke install step.ci.yml: removed redundantsetuptools>=68andwheelfrom the install step — unnecessary withhatchlingas the build backend.ci.yml: fixed trailing whitespace inpython -m buildline.
Added
ci.yml: new "Smoke install sdist" step — validates the.tar.gzsource distribution independently from the wheel.
Internal
.vscode/removed from Git index (was previously tracked despite being listed in.gitignore).
0.2.0 - 2026-05-20
Changed (breaking)
config.py:Configremoved and replaced by three dedicated frozen dataclasses —PubChemConfig,ChEMBLConfig, andSMILESConfig— one per pipeline. All three share the same validation pattern (input_path,output_pathmust not be empty or contain..; pipeline-specific column fields must not be empty or whitespace-only).PubChemIngest.__init__: now acceptscfg: PubChemConfiginstead ofcfg: Config.ChEMBLIngest.__init__: now acceptscfg: ChEMBLConfiginstead of directinput_path,output_path, andchembl_id_colparameters.SMILESPrep.__init__: now acceptscfg: SMILESConfiginstead of directinput_path,smiles_col, andoutput_pathparameters.
Added
PubChemConfig— exported in__all__, replacesConfigforPubChemIngest.ChEMBLConfig— new public config dataclass forChEMBLIngestwithinput_path,output_path,chembl_id_col, anderror_logfields.SMILESConfig— new public config dataclass forSMILESPrepwithinput_path,output_path,smiles_col, anderror_logfields.
Docs
__init__.py: module docstring updated — classes section and all examples reflect the new config classes.__main__.py: docstring examples updated; removed stale--coconut-*example that survived from v0.1.2.pipelines.py: all three pipeline docstrings updated with new config types and examples.API.md: updated to v0.2.0 —Configremoved, three new config classes documented with full parameter and validation tables.README.md: all Python API examples updated to use new config classes; version badge bumped to v0.2.0.
Tests
test_config.py:TestConfigrenamed toTestPubChemConfig;TestChEMBLConfigandTestSMILESConfigadded (10 cases each).test_pipelines.py: all pipeline instantiation updated to use new config classes.test_security.py:TestConfigSecurityrenamed toTestPubChemConfigSecurity;TestChEMBLConfigSecurityandTestSMILESConfigSecurityadded.test_cli.py: batch ChEMBL and SMILES tests updated to read paths from config object instead of kwargs.- Final count: 146 tests passing (+27 net vs v0.1.3).
Internal
_cli.py: internal instantiation updated to usePubChemConfig,ChEMBLConfig, andSMILESConfig.version.py: version bumped to0.2.0.
0.1.3 - 2026-05-19
Changed
_cli.py: removed deprecated--coconut-in,--coconut-out, and--coconut-smilesflags and their migration logic. These aliases were introduced in v0.1.2 as a compatibility bridge; use--smiles-in,--smiles-out, and--smiles-colinstead.pipelines.py: removedCoconutPrepdeprecated alias forSMILESPrep.__init__.py: removedCoconutPrepfrom imports and__all__.
Fixed
config.py:Config.__post_init__now raisesValueErrorwhencid_colis an empty or whitespace-only string.config.py:Config.__post_init__now rejectsinput_pathvalues containing..(path traversal), consistent with the existing check onoutput_path.io.py:load_tablenow raisesFileNotFoundErrorwith a descriptive message when the input file does not exist, instead of propagating a raw pandas or OS error.io.py:load_tablenow raisesValueErrorwhen the loaded DataFrame has zero rows.io.py:_sanitize_cidno longer raises on unexpected input types (bool,list,dict, etc.); thepd.isna()call is now wrapped in a try/except.pipelines.py:PubChemIngest.run(),ChEMBLIngest.run(), andSMILESPrep.run()now raiseValueErrorearly when the input DataFrame is empty, instead of silently producing an empty output file.pipelines.py:PubChemIngest.run()now emits alogger.warningwhen theSMILEScolumn is absent after fetching, making the silent omission ofSMILES_RDKitvisible.pipelines.py:SMILESPrep.run()now creates output directories only afterload_tablesucceeds, avoiding the creation of empty directories when the input file does not exist.
Improved
_cli.py:--helpoutput now includes a concrete usage epilog with examples for every pipeline (PubChem batch, ChEMBL batch, SMILES batch, single-entry CID, single-entry ChEMBL ID, multi-pipeline call)._cli.py: the "nothing to run" error message now lists all valid flag combinations and includes usage examples.
Docs
config.py:Configdocstring updated to document newcid_colandinput_pathvalidation in theRaisessection.io.load_table:Raisessection updated to documentFileNotFoundErrorand empty-DataFrameValueError.io.save_table: documented the automatic creation of missing parent directories.pipelines.PubChemIngest.run(),ChEMBLIngest.run(),SMILESPrep.run():Raisessections updated to document empty-DataFrameValueError.
Tests
- Added 14 new test cases and removed 4 obsolete ones (deprecated coconut CLI aliases), for a net total of 119 tests (+10 vs v0.1.2).
- New cases cover:
Configvalidation (cid_colempty or whitespace,input_pathpath traversal),_sanitize_cidedge types (bool,list,dict),load_tableerror paths (nonexistent file, empty file), empty-DataFrame guards in all three pipelines, missing-SMILES-column warning inPubChemIngest, and a newtest_pipelines.pymodule.
0.1.2 - 2026-05-18
Added
- CLI: Single Entry mode —
harmonsmile --pubchem-cid <CID>andharmonsmile --chembl-id <ID>fetch and harmonize a single compound without an input file. Output saved automatically toresults/CID{cid}_harmonsmile.csvandresults/{id}_harmonsmile.csv. examples/directory with real-world fetch scripts and capsaicin datasets for PubChem, ChEMBL, and SMILES batch modes.- Unit tests for CLI (
test_cli.py, 32 tests) covering batch modes, single entry modes, deprecated aliases, mutual exclusion, and validation.
Changed
- CLI:
--coconut-in,--coconut-out,--coconut-smilesrenamed to--smiles-in,--smiles-out,--smiles-colfor source-agnostic naming. Deprecated aliases kept withDeprecationWarning. - CLI help groups renamed:
"COCONUT / independent"→"SMILES (batch)";"PubChem"→"PubChem (batch)";"ChEMBL"→"ChEMBL (batch)". API.mdformalized with complete reference for all public classes and methods.
Fixed
ChEMBLIngest: duplicatenamecolumn in output when input file already contained anamecolumn and ChEMBL API returnedpref_name.
Docs
RDKitStandardizer.to_iso_kek(): added Notes section documenting E/Z geometry behavior — chiral centers are preserved; E/Z on double bonds may be lost for ambiguous cases during kekulization (known RDKit behavior).
0.1.1 - 2026-05-18
Added
RDKitStandardizerclass with two SMILES normalization methods:to_iso_kek()— canonical + isomeric + Kekulized SMILES (COCONUT 2.0 convention)to_conn_kek()— canonical + connectivity-only + Kekulized SMILESPubChemIngestpipeline: fetches all available properties from PubChem REST API (SMILES, ConnectivitySMILES, MolecularWeight, MolecularFormula, InChI, InChIKey, XLogP, TPSA, Charge, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, HeavyAtomCount) and appends standardizedSMILES_RDKitcolumn.ChEMBLIngestpipeline: fetches properties from ChEMBL REST API by ChEMBL ID (canonical_smiles, InChI, InChIKey, MW, MolecularFormula, ALogP, TPSA, HBA, HBD, RotatableBonds, HeavyAtoms, QED, Ro5Violations) and appends standardizedSMILES_RDKitcolumn.SMILESPreppipeline: standardizes SMILES from any CSV/Excel file using RDKit — accepts any tabular source (COCONUT, ChEMBL downloads, in-house databases, etc.)._PubChemClientwith configurable retries, exponential backoff, persistentrequests.Session, context manager protocol, and pluggable logger._ChEMBLClientwith same design as_PubChemClient— ChEMBL ID format validation, exponential backoff, context manager protocol.Configfrozen dataclass for pipeline configuration with__post_init__validation.load_table()andsave_table()I/O utilities supporting CSV, TSV, XLSX, and XLS formats, withPathLikesupport.version.pyas single source of truth for package metadata (__version__,PROJECT_NAME,PROJECT_VERSION,PROJECT_STATUS).- Command-line interface via
harmonsmileentry point andpython -m harmonsmile, with paired argument validation, grouped help output, and--versionflag. pyproject.tomlfor PyPI packaging (build backend: hatchling).- SPDX license headers (
LGPL-3.0-or-later) in all source files. - NumPy-style docstrings with Examples in all public modules and classes.
CITATION.cfffor software citation.CHANGELOG.mdfollowing Keep a Changelog format.environment.ymlandrequirements-dev.txtfor reproducible environments.- Unit test suite with pytest covering standardize, config, io, pubchem, chembl, and security.
Security
_PubChemClient: bounds validation onsleep(0.1–10.0 s) andretries(1–10)._PubChemClient.fetch_props(): CID sanitization strips non-numeric characters before URL construction._ChEMBLClient: same bounds validation; ChEMBL ID format validated against^CHEMBL\d+$regex before network calls.Config: path traversal guard rejectsoutput_pathcontaining...Config:VALID_PUBCHEM_PROPSallowlist validates requested PubChem properties.
Changed
- License changed from MIT to GNU Lesser General Public License v3.0 or later (LGPL-3.0-or-later).
__main__.pynow delegates toharmonsmile._cliinstead ofcli.harmonize, making the package self-contained and installable from PyPI.- Console status messages changed to English for international audience.
Configis now immutable (frozen=True).CoconutPreprenamed toSMILESPrepto reflect its universal scope.CoconutPrepremains available as a deprecated alias and will be removed in a future release.PubChemClientrenamed to_PubChemClient(private) to prevent direct use that could abuse the PubChem REST API.PubChemClientremains available as a deprecated alias and will be removed in a future release.- Default
propsinConfigexpanded to include all available PubChem properties. - Development status set to Alpha (
3 - Alpha) reflecting first public release.
Fixed
- Double
time.sleep()call in_PubChemClient.fetch_props()that caused unnecessary delays on successful requests. - Missing column validation in
PubChemIngest.run()before initiating network calls. - Incorrect guard condition for
SMILES_RDKitcounter inPubChemIngest.run(). - Unguarded
Chem.MolToSmiles()call inRDKitStandardizerthat could raise unhandled C++ exceptions for unusual aromaticity models. - Fallback encoding in
load_table()changed tolatin-1to correctly handle non-UTF-8 encoded files.
Removed
- Redundant
cli/scripts (harmonize.py,ingest_pubchem.py,prep_coconut.py) superseded by the unifiedharmonsmileentry point. - Unused
id_colfield fromConfigdataclass.
Future Releases (Planned)
[0.3.0] — ML-ready features
- Standardized pipeline to generate ECFP fingerprints (with/without chirality).
- InChI / InChIKey generation for deduplication and robust cross-database matching.