Usage¶

This page shows how to run HDDFlyzer from the command line and how to interpret the result folder it creates.

Installation¶

After v0.1.5 is merged, tagged, and published through the manual PyPI workflow:

pip install hddflyzer

Python 3.11 or newer is required. RDKit is best installed from conda-forge before installing HDDFlyzer.

conda create -n hddflyzer_env python=3.11
conda activate hddflyzer_env
conda install -c conda-forge rdkit
pip install hddflyzer

UMAP support uses umap-learn when available and is listed as an optional dependency.

Quick Start¶

Prepare a molecule registry and run the standard workflow for the aocd collection:

hddflyzer data prepare aocd
hddflyzer pipeline run aocd

The aocd value is the dataset tag. HDDFlyzer uses that tag to locate input data and to write outputs under results/aocd/.

Input Data¶

HDDFlyzer starts from a local molecular collection stored as a CSV file. By default, input collections are placed in:

examples/

For a tag named aocd, a typical input file is:

examples/valid_metadata_aocd.csv

When you run:

hddflyzer data prepare aocd

HDDFlyzer searches the input directory for a CSV file whose filename contains the tag aocd. You can also pass an explicit CSV path:

hddflyzer data prepare aocd path/to/input.csv

The input table must contain a SMILES column. Columns whose names contain smiles or canonical_smiles are detected automatically. A compound identifier column is optional; HDDFlyzer detects identifier, id, compound_id, or molecule_id when present, and otherwise creates identifiers automatically.

To use a different input directory, set HDDFLYZER_DATA_DIR:

$env:HDDFLYZER_DATA_DIR = "C:\path\to\csvs"
hddflyzer data prepare aocd

HDDFLYZER_DATA_DIR=/path/to/csvs hddflyzer data prepare aocd

Running the Workflow¶

The canonical workflow follows this shape:

compound collection
  -> registry
  -> descriptors and similarity
  -> dimensionality reduction
  -> visualization
  -> manifest/results

Run the full workflow with:

hddflyzer pipeline run aocd

You can also run selected stages:

hddflyzer pipeline run aocd --skip-dimred
hddflyzer pipeline run aocd --stages chem.features,chem.pruning

Understanding the Result Folder¶

All workflow outputs are written under:

results/<tag>/

For aocd, this becomes:

results/aocd/

The preparation step creates the canonical molecule registry:

results/aocd/registry/molecules.csv

This registry records stable identifiers, raw and canonical SMILES, validity flags, source provenance, and row-level input metadata. Downstream descriptor, similarity, dimensionality-reduction, and visualization steps use this registry as the shared molecule base.

Important result files include:

manifest.json
workflow_summary.md
registry, chemistry, feature, dimensionality-reduction, and figure outputs
operation metadata

Representative outputs include:

canonical molecule registry;
descriptor tables;
Tanimoto similarity matrix;
PCA, t-SNE, and UMAP projection coordinates;
figures;
result manifest and workflow summary.

Workflow Modules¶

Data

hddflyzer data prepare builds the canonical molecule registry from a local collection.

Chemistry

hddflyzer chem computes descriptors, Tanimoto similarity, and feature pruning.

Dim. reduction & viz

hddflyzer dimred and hddflyzer viz run PCA, t-SNE, UMAP, and generate figures.

Module	Subcommand	Description	Output category
`data`	`prepare`	Build canonical molecule registry	`registry`
`chem`	`features`	Compute molecular descriptors	`chem`
`chem`	`tanimoto`	Compute Tanimoto similarity matrix	`chem`
`chem`	`pruning`	Prune low-variance and correlated features	`chem`
`dimred`	`pca`	PCA projection	`dimred`
`dimred`	`tsne`	t-SNE projection	`dimred`
`dimred`	`umap`	UMAP projection	`dimred`
`viz`	`pca analysis`	Generate PCA figures	`figures`

Common CLI Commands¶

# Data preparation
hddflyzer data prepare aocd
hddflyzer data prepare aocd path/to/input.csv

# Pipeline control
hddflyzer pipeline run aocd
hddflyzer pipeline run aocd --skip-dimred
hddflyzer pipeline run aocd --stages chem.features,chem.pruning

# Module-level commands
hddflyzer chem tanimoto aocd
hddflyzer chem features aocd
hddflyzer chem pruning aocd
hddflyzer dimred pca aocd
hddflyzer viz pca analysis aocd

Current Scope and Boundaries¶

HDDFlyzer is not currently:

a docking workflow;
a web dashboard;
a cloud or server workflow;
an automatic clustering system;
an enrichment workflow;
an automatic chemical interpretation engine;
a published PyPI package or public release until the v0.1.5 tag and manual publishing workflow have completed successfully.

Safety Notes¶

Security defaults

Pickle loading is blocked by default in all public loaders. Use allow_pickle=True only with trusted local files.
Run tags reject empty values, path traversal, absolute paths, and path separators.
Reconstructed artifacts must remain inside results/<tag>/.
update_manifest() rejects files outside the run directory.