Configuration¶
MELITE reads defaults from melite/config_default.toml. A user TOML file can
override only the settings that need to change.
Minimal Override¶
[paths]
output = "my_output/"
[datasets.morgan_r2_2048]
path = "data/morgan_r2_2048.npz"
label_path = "raw/labels.npy"
family = "fingerprints"
method = "Morgan"
variant = "r2_2048"
[models]
active = ["svc", "rf"]
Use the file from the CLI:
melite run --config my_config.toml
melite export --config my_config.toml --row 0
Input Layout¶
MELITE consumes pre-computed feature matrices and labels:
raw/labels.npy <- target vector y, shape (n_samples,)
data/morgan_r2_2048.npz <- required key: X, optional key: y
data/maccs.npz
data/rdkit_descriptors.npz
data/PCA85.npz
data/UMAP90.npz
Each .npz file must contain an X array. If an embedded y array is present,
MELITE validates it against the configured label_path to avoid silent feature-label
mismatches.
MELITE is tabular at the modeling level. The learning algorithms only consume
numeric X and y arrays, so the feature matrix may come from PCA, UMAP,
fingerprints, descriptors, clinical variables, experimental measurements,
industrial features, or manually selected numeric features.
Each concrete matrix candidate is registered under [datasets.<dataset_id>].
Required fields are path and label_path. Optional metadata fields are
family, method, variant, level, and description; they are preserved
in reports for traceability and do not control special-case execution logic.
[datasets.morgan_r2_2048]
path = "data/morgan_r2_2048.npz"
label_path = "raw/labels.npy"
family = "fingerprints"
method = "Morgan"
variant = "r2_2048"
[datasets.maccs]
path = "data/maccs.npz"
label_path = "raw/labels.npy"
family = "fingerprints"
method = "MACCS"
[datasets.rdkit_descriptors]
path = "data/rdkit_descriptors.npz"
label_path = "raw/labels.npy"
family = "descriptors"
method = "RDKit"
[datasets.pca85]
path = "data/PCA85.npz"
label_path = "raw/labels.npy"
family = "dimensionality"
method = "PCA"
level = 85
[datasets.umap90]
path = "data/UMAP90.npz"
label_path = "raw/labels.npy"
family = "dimensionality"
method = "UMAP"
level = 90
Registered datasets are loaded strictly. A missing dataset file, missing
label_path, missing X, non-2D X, non-numeric X, length mismatch, or
embedded y mismatch raises an error instead of silently skipping the entry.
Legacy [benchmark].reduction_types and levels remain supported for
compatibility. When [datasets] is absent, MELITE synthesizes entries such as
PCA70 and UMAP90 with dimensionality metadata.
Outputs¶
By default, MELITE writes results under output/:
output/
|-- results.txt
|-- results.csv
|-- Model_<model>_<dataset>.pkl
`-- figures/
`-- <model>_<dataset>.png
| Output | Purpose |
|---|---|
results.txt |
Human-readable benchmark report. |
results.csv |
Structured rows with model performance, parameters, and smoke marker. |
.pkl artifact |
Final selected model retrained on all available data. |
| PNG figure | F1, Accuracy, and AUC-ROC cross-validation distributions. |
Model Families and Hyperparameter Grids¶
config_default.toml controls which model families are active:
[models]
active = ["svc", "rf", "xgb"]
Remove a key from active to skip that model family in a run. The detailed
hyperparameter grids are defined in melite/config.py; they are
developer-facing defaults rather than user TOML settings.
| Config key | Model family | Benchmark coverage |
|---|---|---|
svc |
Support Vector Classifier | Full runs evaluate polynomial and RBF kernels. Smoke mode uses a linear SVC only. |
rf |
Random Forest Classifier | Evaluates tree count, tree depth, feature sampling, and split/leaf controls. |
xgb |
XGBoost Classifier | Evaluates tree count, learning rate, depth, sampling, gamma, and regularization. |
SVC Kernels¶
| Mode | Kernels | Parameters |
|---|---|---|
| Full benchmark | poly, rbf |
C and gamma; plus coef0 and degree for poly. |
| Smoke mode | linear |
C = 1. |
Current full-run SVC grids:
| Kernel | Values |
|---|---|
poly |
C = [0.01, 0.1, 1, 10]; degree = [3, 4, 5]; coef0 = [0.0, 0.1, 0.02, 0.6, 0.8, 1]; gamma = [0.001, 0.002, 0.004, 0.008, 0.01, 0.02, 0.04, 0.08, 0.1, 0.2] |
rbf |
C = [0.01, 0.02, 0.1, 0.2, 1, 2, 10, 20]; gamma = [0.001, 0.002, 0.004, 0.008, 0.01, 0.02, 0.04, 0.08, 0.1, 0.2] |
Tree-Based Models¶
| Model | Main parameters explored |
|---|---|
| Random Forest | n_estimators, max_depth, max_features, min_samples_split, and min_samples_leaf. |
| XGBoost | n_estimators, learning_rate, max_depth, subsample, colsample_bytree, gamma, reg_alpha, and reg_lambda. |
These grids can be restricted at the family level through [models].active.
Changing the individual hyperparameter values currently requires editing
melite/config.py.