API Overview and Reference¶

This page shows how to use HDDFlyzer programmatically from Python, then lists the current public API surface generated from NumPy-style docstrings.

HDDFlyzer is local pre-release research software. The objects below represent the supported Python surface at this stage; internal helpers are intentionally not listed here.

API Layers¶

Layer	Purpose
Pipeline	Execute the workflow and inspect execution summaries.
Results	Reconstruct completed runs and select generated artifacts.
Science	Wrap loaded artifacts as descriptor, similarity, or projection spaces.
Visualization	Resolve visualization inputs and plot from reconstructed results.

Execute a Workflow from Python¶

The workflow engine can be called from Python at three levels of control:

from hddflyzer.pipeline import execute_workflow, run_workflow, run_pipeline

# High-level: returns WorkflowExecution
execution = execute_workflow("aocd")

# Mid-level: returns reconstructed WorkflowRun
run = run_workflow("aocd")

# Low-level: returns list[StageResult]
results = run_pipeline("aocd")

Reconstruct Completed Runs¶

Completed runs can be reconstructed from results/<tag>/manifest.json:

from hddflyzer.results import load_workflow_run

run = load_workflow_run("aocd")

run.workflow_contract
run.outputs(category="chem")
run.outputs(category="dimred")

The reconstructed run exposes the workflow contract, current outputs, output categories, artifact metadata, and semantic artifact selectors.

Select and Load Artifacts¶

Artifacts can be selected by semantic kind:

artifact = run.artifact(
    kind="descriptor_table",
    category="chem",
    operation="chem.features",
)

loaded = run.load_artifact(
    kind="descriptor_table",
    category="chem",
    operation="chem.features",
)

loaded.data
loaded.metadata

Use required="path/fragment.csv" when a kind/category query needs disambiguation.

Supported artifact kinds include:

Kind	Description
`molecule_registry`	Canonical molecule registry
`descriptor_table`	Molecular descriptor matrix
`tanimoto_matrix`	Pairwise Tanimoto similarity
`projection_coordinates`	PCA / t-SNE / UMAP coordinates
`figure`	Generated plot files
`metadata`	Operation metadata
`workflow_summary`	Human-readable run summary
`unknown`	Unclassified artifact

Scientific Views¶

Loaded artifacts can be wrapped as lightweight scientific spaces:

DescriptorSpace
SimilaritySpace
ProjectionSpace

WorkflowRun can load scientific views over existing artifacts:

descriptors = run.descriptor_space(
    category="chem",
    operation="chem.features",
)

similarity = run.similarity_space(category="chem")

projection = run.projection_space(
    category="dimred",
    operation="dimred.pca",
)

These views expose molecule identifiers when available:

descriptors.molecule_ids
similarity.molecule_ids
projection.molecule_ids

These views operate on existing artifacts. They do not recalculate descriptors, similarity matrices, or projections, and they do not create plots.

The science layer also provides molecule identity helpers, cross-space alignment, structural metrics, descriptor-projection correlation, neighborhood preservation, and group comparison for explicitly defined groups.

Alignment and Science Helpers¶

from hddflyzer.science import (
    align_spaces,
    compare_descriptor_groups,
    descriptor_projection_correlations,
    projection_neighborhood_preservation,
    similarity_projection_correlation,
    similarity_projection_neighbor_overlap,
)

descriptors_aligned, projection_aligned = align_spaces(descriptors, projection)

global_corr      = similarity_projection_correlation(similarity, projection)
neighbor_overlap = similarity_projection_neighbor_overlap(similarity, projection, k=10)
desc_ranking     = descriptor_projection_correlations(descriptors, projection)
local_preserv    = projection_neighborhood_preservation(similarity, projection, k=10)
group_diff       = compare_descriptor_groups(descriptors, labels="class_label")

Science helpers operate on existing artifacts

These helpers operate on existing artifacts. They do not recalculate descriptors, similarity, or projections; do not generate plots; and do not perform automatic clustering, enrichment, or chemical interpretation.

Visualization from Reconstructed Results¶

from hddflyzer.viz import resolve_viz_inputs, plot_hddf_scatters

inputs = resolve_viz_inputs(
    run,
    kind="descriptor_table",
    category="chem",
    required="features/full/features.csv",
)

plot_hddf_scatters(inputs)

plot_hddf_scatters() also accepts a loaded descriptor-table artifact directly.

Reference¶

Pipeline¶

`PipelineContext` `dataclass` ¶

Runtime options shared by pipeline stages.

Attributes:

Name	Type	Description
`tag`	`str`	Dataset tag for the run. Pipeline stages use this tag to resolve inputs and outputs under `results/<tag>/`.
`save_pickle`	`bool, default=False`	Whether stages that support optional pickle output should write it.
`continue_on_error`	`bool, default=False`	Whether the pipeline runner should continue after a failed stage.
`options`	`dict`	Additional stage options. The core pipeline keeps this mapping generic so stages can receive small, stage-specific values without changing the shared context contract.

`StageResult` `dataclass` ¶

Result returned by a pipeline stage.

Attributes:

Name	Type	Description
`name`	`str`	Stage name, for example `"chem.features"`.
`ok`	`bool`	`True` when the stage completed successfully.
`message`	`str, default=""`	Optional human-readable status or error message.

`Stage` ¶

Bases: Protocol

Executable pipeline stage interface.

Stages are small objects with a name and a run method. They are consumed by run_pipeline and return StageResult instances.

Attributes:

Name	Type	Description
`name`	`str`	Unique stage name used for selection and reporting.

`run(context)` ¶

Execute the stage with the given pipeline context.

Parameters:

Name	Type	Description	Default
`context`	`PipelineContext`	Shared runtime options for the current workflow execution.	required

Returns:

Type	Description
`StageResult`	Result describing whether the stage succeeded.

`WorkflowExecution` `dataclass` ¶

Complete programmatic result of a workflow execution.

Attributes:

Name	Type	Description
`tag`	`str`	Sanitized dataset tag that was executed.
`stage_results`	`list of StageResult`	Per-stage execution results returned by `run_pipeline`.
`run`	`WorkflowRun or None`	Reconstructed workflow run when `manifest.json` could be loaded. `None` means execution produced stage results but the run could not be reconstructed.

`failed_stages` `property` ¶

list of StageResult: Stage results with ok set to False.

`ok` `property` ¶

bool: Whether every recorded stage succeeded.

`run_pipeline(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False)` ¶

Run an ordered HDDFlyzer pipeline for a dataset tag.

Parameters:

Name	Type	Description	Default
`tag`	`str`	Dataset tag to process.	required
`stage_names`	`iterable of str`	Stage names to run. When omitted, the default stage sequence is used.	`None`
`include_sample`	`bool`	Whether to include the Tanimoto sampling stage in the default stage sequence.	`True`
`include_dimred`	`bool`	Whether to include dimensionality-reduction stages in the default stage sequence.	`True`
`save_pickle`	`bool`	Whether stages that support optional pickle output should write it.	`False`
`continue_on_error`	`bool`	Whether to continue executing stages after a failed stage.	`False`

Returns:

Type	Description
`list of StageResult`	Per-stage results in execution order.

Raises:

Type	Description
`ValueError`	If `stage_names` contains an unknown stage name.

Notes

This function performs the file-based workflow and writes the normal HDDFlyzer outputs under results/<tag>/. It returns stage status only; use execute_workflow when both status and reconstructed results are needed.

`run_workflow(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False)` ¶

Run a pipeline and return the reconstructed workflow run.

Parameters:

Name	Type	Description	Default
`tag`	`str`	Dataset tag to process.	required
`stage_names`	`iterable of str`	Stage names to run. When omitted, the default stage sequence is used.	`None`
`include_sample`	`bool`	Whether to include the Tanimoto sampling stage.	`True`
`include_dimred`	`bool`	Whether to include dimensionality-reduction stages.	`True`
`save_pickle`	`bool`	Whether stages that support optional pickle output should write it.	`False`
`continue_on_error`	`bool`	Whether to continue after failed stages.	`False`

Returns:

Type	Description
`WorkflowRun`	Reconstructed run loaded from `results/<tag>/manifest.json`.

Raises:

Type	Description
`RuntimeError`	If one or more stages fail and `continue_on_error` is `False`.
`FileNotFoundError`	If the run manifest cannot be found after execution.
`ValueError`	If the run manifest is invalid or cannot be reconstructed.

`execute_workflow(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False)` ¶

Run a pipeline and return execution status plus reconstructed results.

Parameters:

Name	Type	Description	Default
`tag`	`str`	Dataset tag to process.	required
`stage_names`	`iterable of str`	Stage names to run. When omitted, the default stage sequence is used.	`None`
`include_sample`	`bool`	Whether to include the Tanimoto sampling stage.	`True`
`include_dimred`	`bool`	Whether to include dimensionality-reduction stages.	`True`
`save_pickle`	`bool`	Whether stages that support optional pickle output should write it.	`False`
`continue_on_error`	`bool`	Whether to continue after failed stages.	`False`

Returns:

Type	Description
`WorkflowExecution`	Execution object containing stage results, global success status, and the reconstructed `WorkflowRun` when available.

Notes

execute_workflow is the common programmatic contract used by the CLI. It does not hide stage failures; inspect execution.stage_results and execution.ok for status.

Results¶

`ResultArtifact` `dataclass` ¶

A result file with workflow and scientific semantics.

Attributes:

Name	Type	Description
`path`	`Path`	Absolute or resolved filesystem path for the artifact.
`relative_path`	`str`	Manifest-relative path inside `results/<tag>/`.
`category`	`str`	Workflow area, such as `"chem"`, `"dimred"`, or `"viz/figures"`.
`kind`	`str`	Semantic artifact kind, for example `"descriptor_table"` or `"projection_coordinates"`.
`operation`	`str or None, default=None`	Manifest operation that produced the artifact, when known.
`metadata`	`dict or None, default=None`	Operation metadata associated with the artifact, when available.

`LoadedArtifact` `dataclass` ¶

Loaded data plus its semantic result artifact.

Attributes:

Name	Type	Description
`artifact`	`ResultArtifact`	Artifact that was loaded.
`data`	`Any`	Loaded Python object. The type depends on `artifact.kind`.
`metadata`	`dict`	Loader metadata and artifact metadata useful for traceability.

`classify_artifact(relative_path, operation=None)` ¶

Classify a manifest output path into a semantic artifact kind.

Parameters:

Name	Type	Description	Default
`relative_path`	`str`	Manifest-relative output path.	required
`operation`	`str`	Operation name that produced the output, when known.	`None`

Returns:

Type	Description
`str`	Semantic artifact kind. Unknown paths return `"unknown"`.

`load_artifact(artifact, allow_pickle=False)` ¶

Load a supported result artifact into Python data.

Parameters:

Name	Type	Description	Default
`artifact`	`ResultArtifact`	Semantic artifact to load.	required
`allow_pickle`	`bool`	Whether pickle-backed table artifacts may be loaded. Pickle is disabled by default and should be enabled only for trusted local files.	`False`

Returns:

Type	Description
`LoadedArtifact`	Loaded data, metadata, and the source artifact.

Raises:

Type	Description
`FileNotFoundError`	If the artifact file or required companion file is missing.
`ValueError`	If the artifact kind is unsupported, the file format is invalid, pickle loading is not allowed, or the loaded data violates its minimum contract.

Notes

Supported loaded kinds include descriptor tables, projection coordinates, molecule registries, metadata JSON, workflow summaries, and Tanimoto matrices. Tanimoto matrices are loaded with numpy.load(..., allow_pickle=False).

`WorkflowRun` `dataclass` ¶

Queryable view of a completed HDDFlyzer run.

WorkflowRun reconstructs a completed run from an existing results/<tag>/manifest.json file. It does not execute pipeline stages, recalculate outputs, or create new state.

Attributes:

Name	Type	Description
`tag`	`str`	Dataset tag recorded in the manifest.
`manifest_path`	`Path`	Path to the reconstructed `manifest.json` file.
`manifest`	`dict`	Parsed manifest content.

`current_outputs` `property` ¶

list of str: Currently registered manifest output paths.

`operations` `property` ¶

list of dict: Operation records from the manifest.

`output_categories` `property` ¶

dict: Output paths grouped by workflow area.

`workflow_contract` `property` ¶

dict: Workflow contract recorded in the manifest.

`artifact(kind=None, category=None, operation=None, required=None)` ¶

Return exactly one artifact matching the requested filters.

Parameters:

Name	Type	Description	Default
`kind`	`str \| None`	Filters passed to `artifacts`.	`None`
`category`	`str \| None`	Filters passed to `artifacts`.	`None`
`operation`	`str \| None`	Filters passed to `artifacts`.	`None`
`required`	`str \| None`	Filters passed to `artifacts`.	`None`

Returns:

Type	Description
`ResultArtifact`	The single matching artifact.

Raises:

Type	Description
`FileNotFoundError`	If no artifact matches the filters.
`ValueError`	If multiple artifacts match the filters.

`artifacts(kind=None, category=None, operation=None, required=None)` ¶

Return semantic result artifacts derived from manifest outputs.

Parameters:

Name	Type	Description	Default
`kind`	`str`	Semantic artifact kind to select.	`None`
`category`	`str`	Output category to select.	`None`
`operation`	`str`	Producing operation to select.	`None`
`required`	`str or iterable of str`	Required path fragment or fragments used to disambiguate outputs.	`None`

Returns:

Type	Description
`list of ResultArtifact`	Matching artifacts with resolved paths, categories, kinds, and operation metadata.

Raises:

Type	Description
`ValueError`	If a manifest output path is absolute, contains traversal, or would resolve outside the run directory.

`descriptor_space(category=None, operation=None, required=None, allow_pickle=False)` ¶

Load a descriptor-table artifact as a scientific descriptor space.

Parameters:

Name	Type	Description	Default
`category`	`str \| None`	Artifact filters used to select one descriptor table.	`None`
`operation`	`str \| None`	Artifact filters used to select one descriptor table.	`None`
`required`	`str \| None`	Artifact filters used to select one descriptor table.	`None`
`allow_pickle`	`bool`	Whether pickle-backed descriptor tables may be loaded.	`False`

Returns:

Type	Description
`DescriptorSpace`	Descriptor-space view over an existing loaded artifact.

Notes

This method does not recalculate descriptors.

`load_artifact(kind=None, category=None, operation=None, required=None, allow_pickle=False)` ¶

Select and load exactly one artifact.

Parameters:

Name	Type	Description	Default
`kind`	`str \| None`	Filters passed to `artifact`.	`None`
`category`	`str \| None`	Filters passed to `artifact`.	`None`
`operation`	`str \| None`	Filters passed to `artifact`.	`None`
`required`	`str \| None`	Filters passed to `artifact`.	`None`
`allow_pickle`	`bool`	Whether pickle-backed table artifacts may be loaded. Enable only for trusted local files.	`False`

Returns:

Type	Description
`LoadedArtifact`	Loaded data, metadata, and source artifact.

Raises:

Type	Description
`FileNotFoundError`	If no artifact matches or a required file is missing.
`ValueError`	If multiple artifacts match, loading is unsupported, pickle loading is blocked, or loaded data is invalid.

`operation_metadata(operation_name)` ¶

Return metadata for the latest recorded operation.

Parameters:

Name	Type	Description	Default
`operation_name`	`str`	Operation name, for example `"chem.features"`.	required

Returns:

Type	Description
`dict or None`	Operation metadata when present.

`operations_by_stage(stage)` ¶

Return operation records associated with a workflow stage.

Parameters:

Name	Type	Description	Default
`stage`	`str`	Workflow stage or area, such as `"chem"` or `"dimred"`.	required

Returns:

Type	Description
`list of dict`	Matching operation records in manifest order.

`outputs(category=None)` ¶

Return output paths from the reconstructed manifest.

Parameters:

Name	Type	Description	Default
`category`	`str`	Workflow category to select. When omitted, all current outputs are returned.	`None`

Returns:

Type	Description
`list of str`	Manifest-relative output paths.

`projection_space(category=None, operation=None, required=None, allow_pickle=False)` ¶

Load projection coordinates as a scientific projection space.

Parameters:

Name	Type	Description	Default
`category`	`str \| None`	Artifact filters used to select one projection-coordinate table.	`None`
`operation`	`str \| None`	Artifact filters used to select one projection-coordinate table.	`None`
`required`	`str \| None`	Artifact filters used to select one projection-coordinate table.	`None`
`allow_pickle`	`bool`	Whether pickle-backed projection tables may be loaded.	`False`

Returns:

Type	Description
`ProjectionSpace`	Projection-space view over existing dimensionality-reduction coordinates.

Notes

This method does not recalculate PCA, t-SNE, UMAP, or other projections.

`similarity_space(category=None, operation=None, required=None, allow_pickle=False)` ¶

Load a Tanimoto matrix artifact as a scientific similarity space.

Parameters:

Name	Type	Description	Default
`category`	`str \| None`	Artifact filters used to select one Tanimoto matrix.	`None`
`operation`	`str \| None`	Artifact filters used to select one Tanimoto matrix.	`None`
`required`	`str \| None`	Artifact filters used to select one Tanimoto matrix.	`None`
`allow_pickle`	`bool`	Present for API consistency. Tanimoto matrices are loaded from `.npz` with pickle disabled.	`False`

Returns:

Type	Description
`SimilaritySpace`	Similarity-space view over an existing Tanimoto matrix.

Notes

This method does not recalculate fingerprints or similarity.

`summary()` ¶

Return a compact programmatic summary of the reconstructed run.

Returns:

Type	Description
`dict`	Summary containing tag, manifest path, operation count, current output count, output categories, and workflow contract.

`to_dict()` ¶

Return a dictionary representation of this reconstructed run.

Returns:

Type	Description
`dict`	Dictionary containing tag, manifest path, and manifest content.

`load_workflow_run(tag, results_dir=None)` ¶

Load a completed run from results/<tag>/manifest.json.

Parameters:

Name	Type	Description	Default
`tag`	`str`	Run tag to reconstruct. Tags are validated and must not contain path separators, traversal, or absolute paths.	required
`results_dir`	`Path or str`	Root results directory. Defaults to `hddflyzer.config.settings.RESULTS_DIR`.	`None`

Returns:

Type	Description
`WorkflowRun`	Reconstructed run backed by the parsed manifest.

Raises:

Type	Description
`FileNotFoundError`	If `manifest.json` does not exist.
`ValueError`	If the tag is invalid, the resolved manifest escapes `results_dir`, or the manifest JSON/structure is invalid.

Science¶

`DescriptorSpace` `dataclass` ¶

Descriptor table interpreted as a molecular descriptor space.

Attributes:

Name	Type	Description
`artifact`	`ResultArtifact`	Source descriptor-table artifact.
`data`	`DataFrame`	Loaded descriptor table.
`metadata`	`dict`	Loader and operation metadata.
`n_molecules`	`int`	Number of rows in `data`.
`feature_names`	`tuple of str`	Descriptor feature columns, excluding common identity columns.

`molecule_ids` `property` ¶

tuple of str: Molecular identifiers when present, else empty.

`SimilaritySpace` `dataclass` ¶

Pairwise molecular similarity matrix with aligned identifiers.

Attributes:

Name	Type	Description
`artifact`	`ResultArtifact`	Source Tanimoto matrix artifact.
`matrix`	`ndarray`	Pairwise similarity matrix.
`ids`	`tuple of str`	Identifiers aligned to matrix rows and columns.
`metadata`	`dict`	Loader and operation metadata.
`n_molecules`	`int`	Number of molecules in the similarity matrix.

`molecule_ids` `property` ¶

tuple of str: Molecular identifiers aligned to the matrix.

`ProjectionSpace` `dataclass` ¶

Dimensionality-reduction coordinates for a molecular collection.

Attributes:

Name	Type	Description
`artifact`	`ResultArtifact`	Source projection-coordinate artifact.
`data`	`DataFrame`	Loaded coordinate table.
`metadata`	`dict`	Loader and operation metadata.
`coordinate_columns`	`tuple of str`	Numeric coordinate columns used as the projection axes.
`n_molecules`	`int`	Number of rows in `data`.

`molecule_ids` `property` ¶

tuple of str: Molecular identifiers when present, else empty.

`to_descriptor_space(loaded)` ¶

Convert a loaded descriptor table artifact into a descriptor space.

Parameters:

Name	Type	Description	Default
`loaded`	`LoadedArtifact`	Loaded artifact with kind `"descriptor_table"`.	required

Returns:

Type	Description
`DescriptorSpace`	Scientific view over the loaded descriptor table.

Raises:

Type	Description
`ValueError`	If the artifact kind or data type is incompatible.

Notes

This converter wraps existing loaded data and does not recalculate descriptors.

`to_similarity_space(loaded)` ¶

Convert a loaded Tanimoto matrix artifact into a similarity space.

Parameters:

Name	Type	Description	Default
`loaded`	`LoadedArtifact`	Loaded artifact with kind `"tanimoto_matrix"`.	required

Returns:

Type	Description
`SimilaritySpace`	Scientific view over the loaded similarity matrix.

Raises:

Type	Description
`ValueError`	If the artifact kind, data type, or ID alignment is incompatible.

Notes

This converter wraps an existing matrix and does not recalculate fingerprints or similarity.

`to_projection_space(loaded)` ¶

Convert loaded projection coordinates into a projection space.

Parameters:

Name	Type	Description	Default
`loaded`	`LoadedArtifact`	Loaded artifact with kind `"projection_coordinates"`.	required

Returns:

Type	Description
`ProjectionSpace`	Scientific view over existing projection coordinates.

Raises:

Type	Description
`ValueError`	If the artifact kind, data type, or coordinate columns are incompatible.

Notes

This converter does not recalculate PCA, t-SNE, UMAP, or other projections.

`shared_molecule_ids(space_a, space_b)` ¶

Return shared molecule IDs, preserving first-space order.

Parameters:

Name	Type	Description	Default
`space_a`	`object`	Objects exposing a `molecule_ids` attribute.	required
`space_b`	`object`	Objects exposing a `molecule_ids` attribute.	required

Returns:

Type	Description
`tuple of str`	IDs present in both spaces, ordered as in `space_a`.

`has_aligned_molecule_ids(space_a, space_b)` ¶

Return whether two spaces have the same non-empty molecule ID order.

Parameters:

Name	Type	Description	Default
`space_a`	`object`	Objects exposing a `molecule_ids` attribute.	required
`space_b`	`object`	Objects exposing a `molecule_ids` attribute.	required

Returns:

Type	Description
`bool`	`True` only when both ID sequences are non-empty and identical.

`align_spaces(*spaces)` ¶

Return spaces filtered and reordered to shared molecule IDs.

Parameters:

Name	Type	Description	Default
`*spaces`	`DescriptorSpace, ProjectionSpace, or SimilaritySpace`	Two or more scientific spaces to align.	`()`

Returns:

Type	Description
`tuple`	New space instances of the same types, filtered to shared IDs and ordered according to the first space.

Raises:

Type	Description
`ValueError`	If fewer than two spaces are provided, any space type is unsupported, any space has empty `molecule_ids`, or no IDs are shared.

Notes

Alignment uses existing data only. It does not recalculate descriptors, similarity matrices, or projections, and it does not mutate the input spaces.

`SpaceMetricResult` `dataclass` ¶

Scalar result of a structural comparison between scientific spaces.

Attributes:

Name	Type	Description
`name`	`str`	Metric name.
`value`	`float`	Scalar metric value. Some metrics may return `nan` when a correlation is undefined.
`metadata`	`dict`	Metric metadata such as molecule counts, pair counts, or coordinate columns.

`DescriptorProjectionCorrelationResult` `dataclass` ¶

Ranked descriptor/projection coordinate correlations.

Attributes:

Name	Type	Description
`data`	`DataFrame`	Ranked table with descriptor-coordinate correlation rows.
`metadata`	`dict`	Result metadata including molecule count, feature names, and coordinate columns.

`top_features(n=10)` ¶

Return the top ranked descriptor-coordinate rows.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of rows to return.	`10`

Returns:

Type	Description
`DataFrame`	Copy of the top `n` rows.

Raises:

Type	Description
`ValueError`	If `n` is not a positive integer.

`NeighborhoodPreservationResult` `dataclass` ¶

Per-molecule neighborhood preservation diagnostics.

Attributes:

Name	Type	Description
`data`	`DataFrame`	Per-molecule table with neighbor overlap counts, fractions, and neighbor ID lists.
`metadata`	`dict`	Result metadata including molecule count, `k`, and coordinate columns.

`worst_preserved(n=10)` ¶

Return molecules with the lowest neighborhood overlap.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of rows to return.	`10`

Returns:

Type	Description
`DataFrame`	Copy of the lowest-overlap rows, sorted by overlap fraction and molecule ID.

Raises:

Type	Description
`ValueError`	If `n` is not a positive integer.

`DescriptorGroupComparisonResult` `dataclass` ¶

Ranked descriptor differences for explicit groups.

Attributes:

Name	Type	Description
`data`	`DataFrame`	Group-feature summary table ranked by absolute deviation from the global descriptor mean.
`metadata`	`dict`	Result metadata including molecule count, retained groups, feature names, and minimum group size.

`top_differences(n=10)` ¶

Return the top group-feature differences.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of rows to return.	`10`

Returns:

Type	Description
`DataFrame`	Copy of the top `n` rows.

Raises:

Type	Description
`ValueError`	If `n` is not a positive integer.

`similarity_projection_correlation(similarity, projection)` ¶

Correlate pairwise similarity with projected-space proximity.

Parameters:

Name	Type	Description	Default
`similarity`	`SimilaritySpace`	Similarity space containing an existing pairwise similarity matrix.	required
`projection`	`ProjectionSpace`	Projection space containing existing coordinates.	required

Returns:

Type	Description
`SpaceMetricResult`	Scalar correlation result and metadata.

Raises:

Type	Description
`ValueError`	If inputs have wrong types, molecule IDs cannot be aligned, fewer than two aligned molecules are available, or projection coordinates are insufficient.

Notes

The function aligns inputs by molecule ID and uses existing artifacts only. It does not recalculate similarity or projections.

`similarity_projection_neighbor_overlap(similarity, projection, k=10)` ¶

Return mean overlap between similarity and projection neighbors.

Parameters:

Name	Type	Description	Default
`similarity`	`SimilaritySpace`	Similarity space containing an existing pairwise similarity matrix.	required
`projection`	`ProjectionSpace`	Projection space containing existing coordinates.	required
`k`	`int`	Number of neighbors to compare for each molecule.	`10`

Returns:

Type	Description
`SpaceMetricResult`	Mean neighbor-overlap fraction and metadata.

Raises:

Type	Description
`ValueError`	If inputs have wrong types, molecule IDs cannot be aligned, `k` is invalid, or projection coordinates are insufficient.

Notes

The function uses existing similarity and projection artifacts only. It does not perform clustering or automatic chemical interpretation.

`descriptor_projection_correlations(descriptors, projection)` ¶

Correlate numeric descriptors with projection coordinates.

Parameters:

Name	Type	Description	Default
`descriptors`	`DescriptorSpace`	Descriptor space containing numeric descriptor columns.	required
`projection`	`ProjectionSpace`	Projection space with at least two coordinate columns.	required

Returns:

Type	Description
`DescriptorProjectionCorrelationResult`	Ranked descriptor-coordinate correlations.

Raises:

Type	Description
`ValueError`	If inputs have wrong types, molecule IDs cannot be aligned, no numeric descriptor features exist, or the projection lacks sufficient coordinates.

Notes

Inputs are aligned by molecule ID before calculation. The function uses existing descriptor values and projection coordinates only; it does not recalculate descriptors or projections and does not make automatic chemical interpretations.

`projection_neighborhood_preservation(similarity, projection, k=10)` ¶

Evaluate local neighbor preservation for each molecule.

Parameters:

Name	Type	Description	Default
`similarity`	`SimilaritySpace`	Similarity space containing an existing pairwise similarity matrix.	required
`projection`	`ProjectionSpace`	Projection space containing existing coordinates.	required
`k`	`int`	Number of neighbors to compare for each molecule.	`10`

Returns:

Type	Description
`NeighborhoodPreservationResult`	Per-molecule overlap diagnostics and metadata.

Raises:

Type	Description
`ValueError`	If inputs have wrong types, IDs cannot be aligned, `k` is invalid, or the projection lacks sufficient coordinates.

Notes

This diagnostic compares neighbors from existing similarity and projection spaces. It does not recalculate fingerprints, similarity, projections, or clusters.

`compare_descriptor_groups(descriptors, labels, *, min_group_size=2)` ¶

Compare numeric descriptors across explicit groups.

Parameters:

Name	Type	Description	Default
`descriptors`	`DescriptorSpace`	Descriptor space containing numeric descriptor columns.	required
`labels`	`str or sequence`	Group labels. A string is interpreted as a column name in `descriptors.data`. A sequence must be aligned with `descriptors.molecule_ids` and have length `descriptors.n_molecules`.	required
`min_group_size`	`int`	Minimum number of molecules required for a group to be included.	`2`

Returns:

Type	Description
`DescriptorGroupComparisonResult`	Ranked group-feature summary table and metadata.

Raises:

Type	Description
`ValueError`	If `descriptors` is not a `DescriptorSpace`, no numeric descriptor features exist, labels are invalid, or `min_group_size` is invalid.

Notes

This function compares groups explicitly provided by the user or by an existing column. It does not perform clustering, enrichment, or automatic chemical interpretation.

Visualization¶

`VizInputs` `dataclass` ¶

Resolved file inputs for visualization code.

Attributes:

Name	Type	Description
`category`	`str`	Workflow category used to resolve inputs.
`root`	`Path`	Root directory of the reconstructed run.
`paths`	`tuple of pathlib.Path`	Existing input paths selected from the run manifest.

`as_dict()` ¶

Return a serializable representation of the resolved inputs.

Returns:

Type	Description
`dict`	Dictionary with `category`, `root`, and string paths.

`resolve_viz_inputs(run, category=None, required=None, kind=None)` ¶

Resolve visualization input paths from a reconstructed workflow run.

Parameters:

Name	Type	Description	Default
`run`	`WorkflowRun`	Reconstructed run whose manifest contains registered outputs.	required
`category`	`str`	Output category to select, such as `"chem"` or `"dimred"`.	`None`
`required`	`str or iterable of str`	Required path fragment or fragments used to select specific files.	`None`
`kind`	`str`	Semantic artifact kind to select through `run.artifacts`.	`None`

Returns:

Type	Description
`VizInputs`	Existing paths suitable for visualization functions.

Raises:

Type	Description
`ValueError`	If no outputs or artifacts match the requested category/kind.
`FileNotFoundError`	If `required` filters match nothing or a registered input path is missing on disk.

Notes

This function resolves inputs from an existing WorkflowRun. It does not create files, run pipeline stages, or generate plots.

`plot_hddf_scatters(source)` ¶

Generate scatter plots for HDDF descriptor pairs.

Parameters:

Name	Type	Description	Default
`source`	`str, VizInputs, or LoadedArtifact`	Input source. A string is interpreted as a collection tag. `VizInputs` should contain a features CSV. `LoadedArtifact` must have kind `"descriptor_table"` and hold a pandas `DataFrame`.	required

Returns:

Type	Description
`bool`	`True` when the plot is written successfully, otherwise `False`.

Notes

This function writes results/<tag>/figures/correlations/hddf_corr_scatters_trendline.png. When a reconstructed input object is supplied, the function uses existing descriptor-table data and does not recalculate descriptors.

API Overview and Reference¶

API Layers¶

Execute a Workflow from Python¶

Reconstruct Completed Runs¶

Select and Load Artifacts¶

Scientific Views¶

Alignment and Science Helpers¶

Visualization from Reconstructed Results¶

Reference¶

Pipeline¶

PipelineContext dataclass ¶

StageResult dataclass ¶

Stage ¶

run(context) ¶

WorkflowExecution dataclass ¶

failed_stages property ¶

ok property ¶

run_pipeline(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False) ¶

run_workflow(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False) ¶

execute_workflow(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False) ¶

Results¶

ResultArtifact dataclass ¶

LoadedArtifact dataclass ¶

classify_artifact(relative_path, operation=None) ¶

load_artifact(artifact, allow_pickle=False) ¶

WorkflowRun dataclass ¶

current_outputs property ¶

operations property ¶

output_categories property ¶

workflow_contract property ¶

artifact(kind=None, category=None, operation=None, required=None) ¶

artifacts(kind=None, category=None, operation=None, required=None) ¶

descriptor_space(category=None, operation=None, required=None, allow_pickle=False) ¶

load_artifact(kind=None, category=None, operation=None, required=None, allow_pickle=False) ¶

operation_metadata(operation_name) ¶

operations_by_stage(stage) ¶

outputs(category=None) ¶

projection_space(category=None, operation=None, required=None, allow_pickle=False) ¶

similarity_space(category=None, operation=None, required=None, allow_pickle=False) ¶

summary() ¶

to_dict() ¶

load_workflow_run(tag, results_dir=None) ¶

Science¶

DescriptorSpace dataclass ¶

molecule_ids property ¶

SimilaritySpace dataclass ¶

molecule_ids property ¶

ProjectionSpace dataclass ¶

molecule_ids property ¶

to_descriptor_space(loaded) ¶

to_similarity_space(loaded) ¶

to_projection_space(loaded) ¶

shared_molecule_ids(space_a, space_b) ¶

has_aligned_molecule_ids(space_a, space_b) ¶

align_spaces(*spaces) ¶

SpaceMetricResult dataclass ¶

DescriptorProjectionCorrelationResult dataclass ¶

top_features(n=10) ¶

NeighborhoodPreservationResult dataclass ¶

worst_preserved(n=10) ¶

DescriptorGroupComparisonResult dataclass ¶

top_differences(n=10) ¶

similarity_projection_correlation(similarity, projection) ¶

similarity_projection_neighbor_overlap(similarity, projection, k=10) ¶

descriptor_projection_correlations(descriptors, projection) ¶

projection_neighborhood_preservation(similarity, projection, k=10) ¶

compare_descriptor_groups(descriptors, labels, *, min_group_size=2) ¶

Visualization¶

VizInputs dataclass ¶

as_dict() ¶

resolve_viz_inputs(run, category=None, required=None, kind=None) ¶

plot_hddf_scatters(source) ¶

`PipelineContext` `dataclass` ¶

`StageResult` `dataclass` ¶

`Stage` ¶

`run(context)` ¶

`WorkflowExecution` `dataclass` ¶

`failed_stages` `property` ¶

`ok` `property` ¶

`run_pipeline(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False)` ¶

`run_workflow(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False)` ¶

`execute_workflow(tag, stage_names=None, include_sample=True, include_dimred=True, save_pickle=False, continue_on_error=False)` ¶

`ResultArtifact` `dataclass` ¶

`LoadedArtifact` `dataclass` ¶

`classify_artifact(relative_path, operation=None)` ¶

`load_artifact(artifact, allow_pickle=False)` ¶

`WorkflowRun` `dataclass` ¶

`current_outputs` `property` ¶

`operations` `property` ¶

`output_categories` `property` ¶

`workflow_contract` `property` ¶

`artifact(kind=None, category=None, operation=None, required=None)` ¶

`artifacts(kind=None, category=None, operation=None, required=None)` ¶

`descriptor_space(category=None, operation=None, required=None, allow_pickle=False)` ¶

`load_artifact(kind=None, category=None, operation=None, required=None, allow_pickle=False)` ¶

`operation_metadata(operation_name)` ¶

`operations_by_stage(stage)` ¶

`outputs(category=None)` ¶

`projection_space(category=None, operation=None, required=None, allow_pickle=False)` ¶

`similarity_space(category=None, operation=None, required=None, allow_pickle=False)` ¶

`summary()` ¶

`to_dict()` ¶

`load_workflow_run(tag, results_dir=None)` ¶

`DescriptorSpace` `dataclass` ¶

`molecule_ids` `property` ¶

`SimilaritySpace` `dataclass` ¶

`molecule_ids` `property` ¶

`ProjectionSpace` `dataclass` ¶

`molecule_ids` `property` ¶

`to_descriptor_space(loaded)` ¶

`to_similarity_space(loaded)` ¶

`to_projection_space(loaded)` ¶

`shared_molecule_ids(space_a, space_b)` ¶

`has_aligned_molecule_ids(space_a, space_b)` ¶

`align_spaces(*spaces)` ¶

`SpaceMetricResult` `dataclass` ¶

`DescriptorProjectionCorrelationResult` `dataclass` ¶

`top_features(n=10)` ¶

`NeighborhoodPreservationResult` `dataclass` ¶

`worst_preserved(n=10)` ¶

`DescriptorGroupComparisonResult` `dataclass` ¶

`top_differences(n=10)` ¶

`similarity_projection_correlation(similarity, projection)` ¶

`similarity_projection_neighbor_overlap(similarity, projection, k=10)` ¶

`descriptor_projection_correlations(descriptors, projection)` ¶

`projection_neighborhood_preservation(similarity, projection, k=10)` ¶

`compare_descriptor_groups(descriptors, labels, *, min_group_size=2)` ¶

`VizInputs` `dataclass` ¶

`as_dict()` ¶

`resolve_viz_inputs(run, category=None, required=None, kind=None)` ¶

`plot_hddf_scatters(source)` ¶