Skip to content

API Overview and Reference

This page shows how to use HDDFlyzer programmatically from Python, then lists the current public API surface generated from NumPy-style docstrings.

HDDFlyzer is local pre-release research software. The objects below represent the supported Python surface at this stage; internal helpers are intentionally not listed here.


API Layers

Layer Purpose
Pipeline Execute the workflow and inspect execution summaries.
Results Reconstruct completed runs and select generated artifacts.
Science Wrap loaded artifacts as descriptor, similarity, or projection spaces.
Visualization Resolve visualization inputs and plot from reconstructed results.

Execute a Workflow from Python

The workflow engine can be called from Python at three levels of control:

from hddflyzer.pipeline import execute_workflow, run_workflow, run_pipeline

# High-level: returns WorkflowExecution
execution = execute_workflow("aocd")

# Mid-level: returns reconstructed WorkflowRun
run = run_workflow("aocd")

# Low-level: returns list[StageResult]
results = run_pipeline("aocd")

Reconstruct Completed Runs

Completed runs can be reconstructed from results/<tag>/manifest.json:

from hddflyzer.results import load_workflow_run

run = load_workflow_run("aocd")

run.workflow_contract
run.outputs(category="chem")
run.outputs(category="dimred")

The reconstructed run exposes the workflow contract, current outputs, output categories, artifact metadata, and semantic artifact selectors.

Select and Load Artifacts

Artifacts can be selected by semantic kind:

artifact = run.artifact(
    kind="descriptor_table",
    category="chem",
    operation="chem.features",
)

loaded = run.load_artifact(
    kind="descriptor_table",
    category="chem",
    operation="chem.features",
)

loaded.data
loaded.metadata

Use required="path/fragment.csv" when a kind/category query needs disambiguation.

Supported artifact kinds include:

Kind Description
molecule_registry Canonical molecule registry
descriptor_table Molecular descriptor matrix
tanimoto_matrix Pairwise Tanimoto similarity
projection_coordinates PCA / t-SNE / UMAP coordinates
figure Generated plot files
metadata Operation metadata
workflow_summary Human-readable run summary
unknown Unclassified artifact

Scientific Views

Loaded artifacts can be wrapped as lightweight scientific spaces:

  • DescriptorSpace
  • SimilaritySpace
  • ProjectionSpace

WorkflowRun can load scientific views over existing artifacts:

descriptors = run.descriptor_space(
    category="chem",
    operation="chem.features",
)

similarity = run.similarity_space(category="chem")

projection = run.projection_space(
    category="dimred",
    operation="dimred.pca",
)

These views expose molecule identifiers when available:

descriptors.molecule_ids
similarity.molecule_ids
projection.molecule_ids

These views operate on existing artifacts. They do not recalculate descriptors, similarity matrices, or projections, and they do not create plots.

The science layer also provides molecule identity helpers, cross-space alignment, structural metrics, descriptor-projection correlation, neighborhood preservation, and group comparison for explicitly defined groups.

Alignment and Science Helpers

from hddflyzer.science import (
    align_spaces,
    compare_descriptor_groups,
    descriptor_projection_correlations,
    projection_neighborhood_preservation,
    similarity_projection_correlation,
    similarity_projection_neighbor_overlap,
)

descriptors_aligned, projection_aligned = align_spaces(descriptors, projection)

global_corr      = similarity_projection_correlation(similarity, projection)
neighbor_overlap = similarity_projection_neighbor_overlap(similarity, projection, k=10)
desc_ranking     = descriptor_projection_correlations(descriptors, projection)
local_preserv    = projection_neighborhood_preservation(similarity, projection, k=10)
group_diff       = compare_descriptor_groups(descriptors, labels="class_label")

Science helpers operate on existing artifacts

These helpers operate on existing artifacts. They do not recalculate descriptors, similarity, or projections; do not generate plots; and do not perform automatic clustering, enrichment, or chemical interpretation.

Visualization from Reconstructed Results

from hddflyzer.viz import resolve_viz_inputs, plot_hddf_scatters

inputs = resolve_viz_inputs(
    run,
    kind="descriptor_table",
    category="chem",
    required="features/full/features.csv",
)

plot_hddf_scatters(inputs)

plot_hddf_scatters() also accepts a loaded descriptor-table artifact directly.


Reference

Pipeline

PipelineContext dataclass

Runtime options shared by pipeline stages.

Attributes:

Name Type Description
tag str

Dataset tag for the run. Pipeline stages use this tag to resolve inputs and outputs under results/<tag>/.

save_pickle bool, default=False

Whether stages that support optional pickle output should write it.

continue_on_error bool, default=False

Whether the pipeline runner should continue after a failed stage.

options dict

Additional stage options. The core pipeline keeps this mapping generic so stages can receive small, stage-specific values without changing the shared context contract.


StageResult dataclass

Result returned by a pipeline stage.

Attributes:

Name Type Description
name str

Stage name, for example "chem.features".

ok bool

True when the stage completed successfully.

message str, default=""

Optional human-readable status or error message.


Stage

Bases: Protocol

Executable pipeline stage interface.

Stages are small objects with a name and a run method. They are consumed by run_pipeline and return StageResult instances.

Attributes:

Name Type Description
name str

Unique stage name used for selection and reporting.

run(context)

Execute the stage with the given pipeline context.

Parameters:

Name Type Description Default
context PipelineContext

Shared runtime options for the current workflow execution.

required

Returns:

Type Description
StageResult

Result describing whether the stage succeeded.


WorkflowExecution dataclass

Complete programmatic result of a workflow execution.

Attributes:

Name Type Description
tag str

Sanitized dataset tag that was executed.

stage_results list of StageResult

Per-stage execution results returned by run_pipeline.

run WorkflowRun or None

Reconstructed workflow run when manifest.json could be loaded. None means execution produced stage results but the run could not be reconstructed.

failed_stages property

list of StageResult: Stage results with ok set to False.

ok property

bool: Whether every recorded stage succeeded.


run_pipeline(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False)

Run an ordered HDDFlyzer pipeline for a dataset tag.

Parameters:

Name Type Description Default
tag str

Dataset tag to process.

required
stage_names iterable of str

Stage names to run. When omitted, the default stage sequence is used.

None
include_sample bool

Whether to include the Tanimoto sampling stage in the default stage sequence.

True
include_dimred bool

Whether to include dimensionality-reduction stages in the default stage sequence.

True
save_pickle bool

Whether stages that support optional pickle output should write it.

False
continue_on_error bool

Whether to continue executing stages after a failed stage.

False

Returns:

Type Description
list of StageResult

Per-stage results in execution order.

Raises:

Type Description
ValueError

If stage_names contains an unknown stage name.

Notes

This function performs the file-based workflow and writes the normal HDDFlyzer outputs under results/<tag>/. It returns stage status only; use execute_workflow when both status and reconstructed results are needed.


run_workflow(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False)

Run a pipeline and return the reconstructed workflow run.

Parameters:

Name Type Description Default
tag str

Dataset tag to process.

required
stage_names iterable of str

Stage names to run. When omitted, the default stage sequence is used.

None
include_sample bool

Whether to include the Tanimoto sampling stage.

True
include_dimred bool

Whether to include dimensionality-reduction stages.

True
save_pickle bool

Whether stages that support optional pickle output should write it.

False
continue_on_error bool

Whether to continue after failed stages.

False

Returns:

Type Description
WorkflowRun

Reconstructed run loaded from results/<tag>/manifest.json.

Raises:

Type Description
RuntimeError

If one or more stages fail and continue_on_error is False.

FileNotFoundError

If the run manifest cannot be found after execution.

ValueError

If the run manifest is invalid or cannot be reconstructed.


execute_workflow(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False)

Run a pipeline and return execution status plus reconstructed results.

Parameters:

Name Type Description Default
tag str

Dataset tag to process.

required
stage_names iterable of str

Stage names to run. When omitted, the default stage sequence is used.

None
include_sample bool

Whether to include the Tanimoto sampling stage.

True
include_dimred bool

Whether to include dimensionality-reduction stages.

True
save_pickle bool

Whether stages that support optional pickle output should write it.

False
continue_on_error bool

Whether to continue after failed stages.

False

Returns:

Type Description
WorkflowExecution

Execution object containing stage results, global success status, and the reconstructed WorkflowRun when available.

Notes

execute_workflow is the common programmatic contract used by the CLI. It does not hide stage failures; inspect execution.stage_results and execution.ok for status.


Results

ResultArtifact dataclass

A result file with workflow and scientific semantics.

Attributes:

Name Type Description
path Path

Absolute or resolved filesystem path for the artifact.

relative_path str

Manifest-relative path inside results/<tag>/.

category str

Workflow area, such as "chem", "dimred", or "viz/figures".

kind str

Semantic artifact kind, for example "descriptor_table" or "projection_coordinates".

operation str or None, default=None

Manifest operation that produced the artifact, when known.

metadata dict or None, default=None

Operation metadata associated with the artifact, when available.


LoadedArtifact dataclass

Loaded data plus its semantic result artifact.

Attributes:

Name Type Description
artifact ResultArtifact

Artifact that was loaded.

data Any

Loaded Python object. The type depends on artifact.kind.

metadata dict

Loader metadata and artifact metadata useful for traceability.


classify_artifact(relative_path, operation=None)

Classify a manifest output path into a semantic artifact kind.

Parameters:

Name Type Description Default
relative_path str

Manifest-relative output path.

required
operation str

Operation name that produced the output, when known.

None

Returns:

Type Description
str

Semantic artifact kind. Unknown paths return "unknown".


load_artifact(artifact, allow_pickle=False)

Load a supported result artifact into Python data.

Parameters:

Name Type Description Default
artifact ResultArtifact

Semantic artifact to load.

required
allow_pickle bool

Whether pickle-backed table artifacts may be loaded. Pickle is disabled by default and should be enabled only for trusted local files.

False

Returns:

Type Description
LoadedArtifact

Loaded data, metadata, and the source artifact.

Raises:

Type Description
FileNotFoundError

If the artifact file or required companion file is missing.

ValueError

If the artifact kind is unsupported, the file format is invalid, pickle loading is not allowed, or the loaded data violates its minimum contract.

Notes

Supported loaded kinds include descriptor tables, projection coordinates, molecule registries, metadata JSON, workflow summaries, and Tanimoto matrices. Tanimoto matrices are loaded with numpy.load(..., allow_pickle=False).


WorkflowRun dataclass

Queryable view of a completed HDDFlyzer run.

WorkflowRun reconstructs a completed run from an existing results/<tag>/manifest.json file. It does not execute pipeline stages, recalculate outputs, or create new state.

Attributes:

Name Type Description
tag str

Dataset tag recorded in the manifest.

manifest_path Path

Path to the reconstructed manifest.json file.

manifest dict

Parsed manifest content.

current_outputs property

list of str: Currently registered manifest output paths.

operations property

list of dict: Operation records from the manifest.

output_categories property

dict: Output paths grouped by workflow area.

workflow_contract property

dict: Workflow contract recorded in the manifest.

artifact(kind=None, category=None, operation=None, required=None)

Return exactly one artifact matching the requested filters.

Parameters:

Name Type Description Default
kind str | None

Filters passed to artifacts.

None
category str | None

Filters passed to artifacts.

None
operation str | None

Filters passed to artifacts.

None
required str | None

Filters passed to artifacts.

None

Returns:

Type Description
ResultArtifact

The single matching artifact.

Raises:

Type Description
FileNotFoundError

If no artifact matches the filters.

ValueError

If multiple artifacts match the filters.

artifacts(kind=None, category=None, operation=None, required=None)

Return semantic result artifacts derived from manifest outputs.

Parameters:

Name Type Description Default
kind str

Semantic artifact kind to select.

None
category str

Output category to select.

None
operation str

Producing operation to select.

None
required str or iterable of str

Required path fragment or fragments used to disambiguate outputs.

None

Returns:

Type Description
list of ResultArtifact

Matching artifacts with resolved paths, categories, kinds, and operation metadata.

Raises:

Type Description
ValueError

If a manifest output path is absolute, contains traversal, or would resolve outside the run directory.

descriptor_space(category=None, operation=None, required=None, allow_pickle=False)

Load a descriptor-table artifact as a scientific descriptor space.

Parameters:

Name Type Description Default
category str | None

Artifact filters used to select one descriptor table.

None
operation str | None

Artifact filters used to select one descriptor table.

None
required str | None

Artifact filters used to select one descriptor table.

None
allow_pickle bool

Whether pickle-backed descriptor tables may be loaded.

False

Returns:

Type Description
DescriptorSpace

Descriptor-space view over an existing loaded artifact.

Notes

This method does not recalculate descriptors.

load_artifact(kind=None, category=None, operation=None, required=None, allow_pickle=False)

Select and load exactly one artifact.

Parameters:

Name Type Description Default
kind str | None

Filters passed to artifact.

None
category str | None

Filters passed to artifact.

None
operation str | None

Filters passed to artifact.

None
required str | None

Filters passed to artifact.

None
allow_pickle bool

Whether pickle-backed table artifacts may be loaded. Enable only for trusted local files.

False

Returns:

Type Description
LoadedArtifact

Loaded data, metadata, and source artifact.

Raises:

Type Description
FileNotFoundError

If no artifact matches or a required file is missing.

ValueError

If multiple artifacts match, loading is unsupported, pickle loading is blocked, or loaded data is invalid.

operation_metadata(operation_name)

Return metadata for the latest recorded operation.

Parameters:

Name Type Description Default
operation_name str

Operation name, for example "chem.features".

required

Returns:

Type Description
dict or None

Operation metadata when present.

operations_by_stage(stage)

Return operation records associated with a workflow stage.

Parameters:

Name Type Description Default
stage str

Workflow stage or area, such as "chem" or "dimred".

required

Returns:

Type Description
list of dict

Matching operation records in manifest order.

outputs(category=None)

Return output paths from the reconstructed manifest.

Parameters:

Name Type Description Default
category str

Workflow category to select. When omitted, all current outputs are returned.

None

Returns:

Type Description
list of str

Manifest-relative output paths.

projection_space(category=None, operation=None, required=None, allow_pickle=False)

Load projection coordinates as a scientific projection space.

Parameters:

Name Type Description Default
category str | None

Artifact filters used to select one projection-coordinate table.

None
operation str | None

Artifact filters used to select one projection-coordinate table.

None
required str | None

Artifact filters used to select one projection-coordinate table.

None
allow_pickle bool

Whether pickle-backed projection tables may be loaded.

False

Returns:

Type Description
ProjectionSpace

Projection-space view over existing dimensionality-reduction coordinates.

Notes

This method does not recalculate PCA, t-SNE, UMAP, or other projections.

similarity_space(category=None, operation=None, required=None, allow_pickle=False)

Load a Tanimoto matrix artifact as a scientific similarity space.

Parameters:

Name Type Description Default
category str | None

Artifact filters used to select one Tanimoto matrix.

None
operation str | None

Artifact filters used to select one Tanimoto matrix.

None
required str | None

Artifact filters used to select one Tanimoto matrix.

None
allow_pickle bool

Present for API consistency. Tanimoto matrices are loaded from .npz with pickle disabled.

False

Returns:

Type Description
SimilaritySpace

Similarity-space view over an existing Tanimoto matrix.

Notes

This method does not recalculate fingerprints or similarity.

summary()

Return a compact programmatic summary of the reconstructed run.

Returns:

Type Description
dict

Summary containing tag, manifest path, operation count, current output count, output categories, and workflow contract.

to_dict()

Return a dictionary representation of this reconstructed run.

Returns:

Type Description
dict

Dictionary containing tag, manifest path, and manifest content.


load_workflow_run(tag, results_dir=None)

Load a completed run from results/<tag>/manifest.json.

Parameters:

Name Type Description Default
tag str

Run tag to reconstruct. Tags are validated and must not contain path separators, traversal, or absolute paths.

required
results_dir Path or str

Root results directory. Defaults to hddflyzer.config.settings.RESULTS_DIR.

None

Returns:

Type Description
WorkflowRun

Reconstructed run backed by the parsed manifest.

Raises:

Type Description
FileNotFoundError

If manifest.json does not exist.

ValueError

If the tag is invalid, the resolved manifest escapes results_dir, or the manifest JSON/structure is invalid.


Science

DescriptorSpace dataclass

Descriptor table interpreted as a molecular descriptor space.

Attributes:

Name Type Description
artifact ResultArtifact

Source descriptor-table artifact.

data DataFrame

Loaded descriptor table.

metadata dict

Loader and operation metadata.

n_molecules int

Number of rows in data.

feature_names tuple of str

Descriptor feature columns, excluding common identity columns.

molecule_ids property

tuple of str: Molecular identifiers when present, else empty.


SimilaritySpace dataclass

Pairwise molecular similarity matrix with aligned identifiers.

Attributes:

Name Type Description
artifact ResultArtifact

Source Tanimoto matrix artifact.

matrix ndarray

Pairwise similarity matrix.

ids tuple of str

Identifiers aligned to matrix rows and columns.

metadata dict

Loader and operation metadata.

n_molecules int

Number of molecules in the similarity matrix.

molecule_ids property

tuple of str: Molecular identifiers aligned to the matrix.


ProjectionSpace dataclass

Dimensionality-reduction coordinates for a molecular collection.

Attributes:

Name Type Description
artifact ResultArtifact

Source projection-coordinate artifact.

data DataFrame

Loaded coordinate table.

metadata dict

Loader and operation metadata.

coordinate_columns tuple of str

Numeric coordinate columns used as the projection axes.

n_molecules int

Number of rows in data.

molecule_ids property

tuple of str: Molecular identifiers when present, else empty.


to_descriptor_space(loaded)

Convert a loaded descriptor table artifact into a descriptor space.

Parameters:

Name Type Description Default
loaded LoadedArtifact

Loaded artifact with kind "descriptor_table".

required

Returns:

Type Description
DescriptorSpace

Scientific view over the loaded descriptor table.

Raises:

Type Description
ValueError

If the artifact kind or data type is incompatible.

Notes

This converter wraps existing loaded data and does not recalculate descriptors.


to_similarity_space(loaded)

Convert a loaded Tanimoto matrix artifact into a similarity space.

Parameters:

Name Type Description Default
loaded LoadedArtifact

Loaded artifact with kind "tanimoto_matrix".

required

Returns:

Type Description
SimilaritySpace

Scientific view over the loaded similarity matrix.

Raises:

Type Description
ValueError

If the artifact kind, data type, or ID alignment is incompatible.

Notes

This converter wraps an existing matrix and does not recalculate fingerprints or similarity.


to_projection_space(loaded)

Convert loaded projection coordinates into a projection space.

Parameters:

Name Type Description Default
loaded LoadedArtifact

Loaded artifact with kind "projection_coordinates".

required

Returns:

Type Description
ProjectionSpace

Scientific view over existing projection coordinates.

Raises:

Type Description
ValueError

If the artifact kind, data type, or coordinate columns are incompatible.

Notes

This converter does not recalculate PCA, t-SNE, UMAP, or other projections.


shared_molecule_ids(space_a, space_b)

Return shared molecule IDs, preserving first-space order.

Parameters:

Name Type Description Default
space_a object

Objects exposing a molecule_ids attribute.

required
space_b object

Objects exposing a molecule_ids attribute.

required

Returns:

Type Description
tuple of str

IDs present in both spaces, ordered as in space_a.


has_aligned_molecule_ids(space_a, space_b)

Return whether two spaces have the same non-empty molecule ID order.

Parameters:

Name Type Description Default
space_a object

Objects exposing a molecule_ids attribute.

required
space_b object

Objects exposing a molecule_ids attribute.

required

Returns:

Type Description
bool

True only when both ID sequences are non-empty and identical.


align_spaces(*spaces)

Return spaces filtered and reordered to shared molecule IDs.

Parameters:

Name Type Description Default
*spaces DescriptorSpace, ProjectionSpace, or SimilaritySpace

Two or more scientific spaces to align.

()

Returns:

Type Description
tuple

New space instances of the same types, filtered to shared IDs and ordered according to the first space.

Raises:

Type Description
ValueError

If fewer than two spaces are provided, any space type is unsupported, any space has empty molecule_ids, or no IDs are shared.

Notes

Alignment uses existing data only. It does not recalculate descriptors, similarity matrices, or projections, and it does not mutate the input spaces.


SpaceMetricResult dataclass

Scalar result of a structural comparison between scientific spaces.

Attributes:

Name Type Description
name str

Metric name.

value float

Scalar metric value. Some metrics may return nan when a correlation is undefined.

metadata dict

Metric metadata such as molecule counts, pair counts, or coordinate columns.


DescriptorProjectionCorrelationResult dataclass

Ranked descriptor/projection coordinate correlations.

Attributes:

Name Type Description
data DataFrame

Ranked table with descriptor-coordinate correlation rows.

metadata dict

Result metadata including molecule count, feature names, and coordinate columns.

top_features(n=10)

Return the top ranked descriptor-coordinate rows.

Parameters:

Name Type Description Default
n int

Number of rows to return.

10

Returns:

Type Description
DataFrame

Copy of the top n rows.

Raises:

Type Description
ValueError

If n is not a positive integer.


NeighborhoodPreservationResult dataclass

Per-molecule neighborhood preservation diagnostics.

Attributes:

Name Type Description
data DataFrame

Per-molecule table with neighbor overlap counts, fractions, and neighbor ID lists.

metadata dict

Result metadata including molecule count, k, and coordinate columns.

worst_preserved(n=10)

Return molecules with the lowest neighborhood overlap.

Parameters:

Name Type Description Default
n int

Number of rows to return.

10

Returns:

Type Description
DataFrame

Copy of the lowest-overlap rows, sorted by overlap fraction and molecule ID.

Raises:

Type Description
ValueError

If n is not a positive integer.


DescriptorGroupComparisonResult dataclass

Ranked descriptor differences for explicit groups.

Attributes:

Name Type Description
data DataFrame

Group-feature summary table ranked by absolute deviation from the global descriptor mean.

metadata dict

Result metadata including molecule count, retained groups, feature names, and minimum group size.

top_differences(n=10)

Return the top group-feature differences.

Parameters:

Name Type Description Default
n int

Number of rows to return.

10

Returns:

Type Description
DataFrame

Copy of the top n rows.

Raises:

Type Description
ValueError

If n is not a positive integer.


similarity_projection_correlation(similarity, projection)

Correlate pairwise similarity with projected-space proximity.

Parameters:

Name Type Description Default
similarity SimilaritySpace

Similarity space containing an existing pairwise similarity matrix.

required
projection ProjectionSpace

Projection space containing existing coordinates.

required

Returns:

Type Description
SpaceMetricResult

Scalar correlation result and metadata.

Raises:

Type Description
ValueError

If inputs have wrong types, molecule IDs cannot be aligned, fewer than two aligned molecules are available, or projection coordinates are insufficient.

Notes

The function aligns inputs by molecule ID and uses existing artifacts only. It does not recalculate similarity or projections.


similarity_projection_neighbor_overlap(similarity, projection, k=10)

Return mean overlap between similarity and projection neighbors.

Parameters:

Name Type Description Default
similarity SimilaritySpace

Similarity space containing an existing pairwise similarity matrix.

required
projection ProjectionSpace

Projection space containing existing coordinates.

required
k int

Number of neighbors to compare for each molecule.

10

Returns:

Type Description
SpaceMetricResult

Mean neighbor-overlap fraction and metadata.

Raises:

Type Description
ValueError

If inputs have wrong types, molecule IDs cannot be aligned, k is invalid, or projection coordinates are insufficient.

Notes

The function uses existing similarity and projection artifacts only. It does not perform clustering or automatic chemical interpretation.


descriptor_projection_correlations(descriptors, projection)

Correlate numeric descriptors with projection coordinates.

Parameters:

Name Type Description Default
descriptors DescriptorSpace

Descriptor space containing numeric descriptor columns.

required
projection ProjectionSpace

Projection space with at least two coordinate columns.

required

Returns:

Type Description
DescriptorProjectionCorrelationResult

Ranked descriptor-coordinate correlations.

Raises:

Type Description
ValueError

If inputs have wrong types, molecule IDs cannot be aligned, no numeric descriptor features exist, or the projection lacks sufficient coordinates.

Notes

Inputs are aligned by molecule ID before calculation. The function uses existing descriptor values and projection coordinates only; it does not recalculate descriptors or projections and does not make automatic chemical interpretations.


projection_neighborhood_preservation(similarity, projection, k=10)

Evaluate local neighbor preservation for each molecule.

Parameters:

Name Type Description Default
similarity SimilaritySpace

Similarity space containing an existing pairwise similarity matrix.

required
projection ProjectionSpace

Projection space containing existing coordinates.

required
k int

Number of neighbors to compare for each molecule.

10

Returns:

Type Description
NeighborhoodPreservationResult

Per-molecule overlap diagnostics and metadata.

Raises:

Type Description
ValueError

If inputs have wrong types, IDs cannot be aligned, k is invalid, or the projection lacks sufficient coordinates.

Notes

This diagnostic compares neighbors from existing similarity and projection spaces. It does not recalculate fingerprints, similarity, projections, or clusters.


compare_descriptor_groups(descriptors, labels, *, min_group_size=2)

Compare numeric descriptors across explicit groups.

Parameters:

Name Type Description Default
descriptors DescriptorSpace

Descriptor space containing numeric descriptor columns.

required
labels str or sequence

Group labels. A string is interpreted as a column name in descriptors.data. A sequence must be aligned with descriptors.molecule_ids and have length descriptors.n_molecules.

required
min_group_size int

Minimum number of molecules required for a group to be included.

2

Returns:

Type Description
DescriptorGroupComparisonResult

Ranked group-feature summary table and metadata.

Raises:

Type Description
ValueError

If descriptors is not a DescriptorSpace, no numeric descriptor features exist, labels are invalid, or min_group_size is invalid.

Notes

This function compares groups explicitly provided by the user or by an existing column. It does not perform clustering, enrichment, or automatic chemical interpretation.


Visualization

VizInputs dataclass

Resolved file inputs for visualization code.

Attributes:

Name Type Description
category str

Workflow category used to resolve inputs.

root Path

Root directory of the reconstructed run.

paths tuple of pathlib.Path

Existing input paths selected from the run manifest.

as_dict()

Return a serializable representation of the resolved inputs.

Returns:

Type Description
dict

Dictionary with category, root, and string paths.


resolve_viz_inputs(run, category=None, required=None, kind=None)

Resolve visualization input paths from a reconstructed workflow run.

Parameters:

Name Type Description Default
run WorkflowRun

Reconstructed run whose manifest contains registered outputs.

required
category str

Output category to select, such as "chem" or "dimred".

None
required str or iterable of str

Required path fragment or fragments used to select specific files.

None
kind str

Semantic artifact kind to select through run.artifacts.

None

Returns:

Type Description
VizInputs

Existing paths suitable for visualization functions.

Raises:

Type Description
ValueError

If no outputs or artifacts match the requested category/kind.

FileNotFoundError

If required filters match nothing or a registered input path is missing on disk.

Notes

This function resolves inputs from an existing WorkflowRun. It does not create files, run pipeline stages, or generate plots.


plot_hddf_scatters(source)

Generate scatter plots for HDDF descriptor pairs.

Parameters:

Name Type Description Default
source str, VizInputs, or LoadedArtifact

Input source. A string is interpreted as a collection tag. VizInputs should contain a features CSV. LoadedArtifact must have kind "descriptor_table" and hold a pandas DataFrame.

required

Returns:

Type Description
bool

True when the plot is written successfully, otherwise False.

Notes

This function writes results/<tag>/figures/correlations/hddf_corr_scatters_trendline.png. When a reconstructed input object is supplied, the function uses existing descriptor-table data and does not recalculate descriptors.