Task: predict two energy scores — SAP (1–100) and EI (environmental impact) — from tabular + text + spatial inputs.
Around two threads — robust (withstands input drift) + honest (abstains and defers to a human when unreliable).
How to read it: the chain has one predictor (advisor’s, frozen); the 6 LLMs only do features at step ②, never predict the score; all UQ (step ④) sits on the predictor’s ensemble. This update focuses on steps ② and ④.
| Split | Rows | Use |
|---|---|---|
| train | 74,994 | train the predictor |
| val | 18,748 | tuning / early stop |
| cal | 12,499 | conformal calibration only |
| test | 18,749 | final evaluation |
Each EPC text → 20 features (17 categorical + 3 numeric). These 20 join the 4 base numeric columns into 24 dims as extra input to the tabular encoder — a more robust “structured lane” for the model; it does not predict the score itself.
A single LLM extracted once, across all 124,990 rows. No redundancy, no cross-check — a single shot, so individual errors are invisible.
6 LLMs each extract independently → consensus: categorical by majority vote, numeric by median, plus a per-feature agreement Fleiss κ. Re-extracted on clean text across all 4 splits, then retrained.
Why mode/median rather than mean: these are discrete category codes and small integers — a mean yields meaningless fractions; mode/median is robust to a single LLM’s outliers and always lands in the valid value set.
4 base + 20 LLM = 24 numeric dims, plus categorical fields.
Encodes the assessor’s free-text notes with DistilBERT.
Encodes the property’s location / spatial information.
Gated fusion → SAP / EI. Gating α: tabular 0.87 / text 0.10 / spatial 0.03, faithfully reflecting modality redundancy. Hyperparameters chosen by Optuna (dropout 0.0583, lr 2.97e-4, fusion=gated); trained as an M=5 (now 10-seed) ensemble.
The backbone — external audit beats self-report: OOD / Conformal / CQR judge trust using held-out data the model never trained on, not the model’s own confidence. This update’s seven improvements mainly harden steps 2 / 3 / 5 (6-LLM features, Mahalanobis, CQR).
Re-extract the 20 features by 6-LLM consensus and retrain every dependent model.
in progress · one blocker
Seven thesis-hardening improvements for the viva.
all code landed
Current results (old) · honest positioning · next-steps timeline.
only the retrain-refresh remains
In one line: methods and engineering are in place; only a single retrain remains to refresh the numbers.
The 20 main features were extracted by a single LLM in V3 — one shot, with no redundancy or cross-check.
6 LLMs each extract independently, then aggregate by consensus: categorical (17) by majority vote, numeric (3) by median.
Models: doubaoglmminimaxdeepseekkimimimo (Xiaomi, own endpoint)
Evaluation data is 100% complete (cal / val / test done) — the inputs for most UQ experiments are in hand.
06-04 train hit the weekly quota cap at ≈3% and paused; quota reset 06-07 17:08 → resumed, train running (ETA ~21h), 0 failures.
train is the prerequisite for retraining; fully resumable (cross-split cache keyed by text-hash) — a pause loses no progress. Moving kimi to the KCL gateway routed around its old hang and sped throughput ~2.8×.
| Measure | Effect |
|---|---|
| Cross-split global cache | Dedup by text-hash and share across the 4 splits → 355k → 271k calls (≈ −23.5%; EPC text is highly repetitive). |
| 64-way parallel + cache reuse | Reuses the 736.9M-token phase11b cache; mimo is a pure incremental top-up — the 5 cached LLMs are not re-run. |
| skip-list | Some LLMs hang on specific texts (kimi thinking-loop, ≈0.01%) → skip and use partial consensus, never blocking the pipeline. |
| quota_supervisor watchdog | Auto-pauses on the ≈5h short-window cap and auto-resumes on refresh — runs independently, with no manual supervision. |
Low-profile but essential plumbing: it guarantees data quality and reproducibility, and turns a 4–5 day extraction into something controllable and resumable.
| Viva-exposable weak spot | Improvement |
|---|---|
| UQ-Spearman ≈ 0.24 (uncertainty weakly correlated with true error) | CQR (headline) · NLL split |
| GMM router dead signal (posterior 99.7% = 1) | Mahalanobis OOD |
| latent K=2 clustering degenerates | property_age primary |
| semantic-substitution min/max intervals unstable | 10–90 percentile band |
| too few seeds for statistical power | seeds → 10 |
| “four-method race” compares unlike quantities | reframe by role (writing only) |
Statusall code landed, all unit tests pass, default-off / config-on → activates with the Path-B retrain. Note: this set targets uncertainty usability, not accuracy (LLM features proven IID-neutral).
Global conformal assigns every sample the same interval width, so it cannot rank by per-sample difficulty by construction — a low Spearman is the necessary consequence, not an accident.
Analogy: a candidate’s self-rated “Python 9/10” cannot be trusted blindly; a standardised test finds everyone inflates by 3 → subtract 3 across the board. The model proposes; conformal calibrates and decides.
proof-of-direction: 1-LLM single-seed, not final; recompute after 6-LLM / 10-seed retrain (absolute shifts; the trend is the signal). Implementation = one quantile head, baked into the retrain.
Explicitly separates aleatoric (inherent data noise) from epistemic (under-trained model) → directly explains “why only 0.24”: the residual is aleatoric-dominated, so more ensemble members do not help.
Measures “how far from the training distribution” in the fused latent z_fused, replacing the dead GMM router. Analogy: a guard judges whether the whole pattern is anomalous, not whether one dimension is high.
bandsemantic substitution → 10–90 percentile band clusterproperty_age primary (K=8, width heterogeneity 1.93×) statsseeds → 10 (Wilcoxon power) writingreframe four methods by role
Connecting thread: external audit beats self-report — conformal / CQR / Mahalanobis all audit the model on held-out data it never trained on, rather than trusting its self-reported confidence.
All 5 are post-hoc: they wrap the frozen predictor, no retrain, no API, running on the same retrained weights — decoupled from Path B.
Code + unit tests all in (pytest 99 passed); real-data numbers pending the retrain. None of these lifts the ~0.24 aleatoric ceiling — CQR (0.36) is still the only lever; they add comparison breadth + harden contributions ①②, not accuracy.
Modality gating α: tabular 0.87 / text 0.10 / spatial 0.03 — the gate faithfully reflects information redundancy across modalities; an insight, not a defect.
Negative results, reported as science: MoE failed 3× (which led to Mondrian), LLM IID redundancy, UQ-Spearman 0.24 (the aleatoric ceiling) — all findings, not flaws.
| Concern | Response |
|---|---|
| ① The 20 features overlap with the raw text | Agreed — the IID data confirms it (+0.07pp). Argument: complementary under drift; the value is robustness. |
| ② Features are a pre-processing artefact / add user burden; why not just normalise the text? | Valid → the normalize_vs_extract experiment (extract-features vs normalise-text, same drift, compare degradation) is implemented and answers it directly. |
| ③ UQ should correlate with (prediction − truth), not just model/input perturbation | Agreed: ensemble/perturbation spread is epistemic; conformal is the part calibrated to true error (coverage); low Spearman reflects large aleatoric; the principled fix is CQR; early directional signal (1-LLM single-seed demo): 0.24 → 0.36, ~+50% rel (proof-of-direction, re-verify after retrain). |
These three directly shaped the improvements: CQR / NLL answer ③, the controlled experiment answers ②, the robustness narrative answers ①.
| Stage | Content |
|---|---|
| ① Resume extraction after quota refresh | Finish train extraction (≈2.5 days at full speed, plus quota pauses). |
| ② One-shot retrain | Trigger retrain: CQR / NLL new heads and 10 seeds baked in together (deferred via --skip-retrain, fired once after train completes — saving a full training round). |
| ③ Re-run UQ | phase4b(CQR) / 8 / 9 / 10 / 12 / 13 all re-run → refresh every number and verify CQR well above 0.24. |
| ④ Write up | Once numbers refresh, write to the THESIS_WRITING_GUIDE spine; submit August. |
Compute: KCL CREATE HPC verified usable (A100-40GB) — Mac handles LLM/API, the cluster handles GPU retraining. The retrain is packaged as a one-command Slurm array (16 trainings × 8-at-a-time ~2 waves, auto-chaining UQ) → wall-clock estimated ~25h → ~5h (same total GPU-hours, just parallelised); scripts ready, not yet run on CREATE.
Weekly/monthly quotas take days to refresh (the ≈5h short window merely cycles and can be auto-ridden). We have hit the weekly cap → wait for refresh, or swap to a backup key to resume.
Reminder: back up the LLM cache off-NAS once extraction finishes (the NAS failed once before).
Methods and engineering are in place; only a single retrain remains to refresh the results.
Thank you · Questions welcome