Group-Meeting Progress Update · 2026-06-08
Multimodal EPC Energy Prediction
An Uncertainty-Quantification · Calibration · Deployment-Robustness Framework
Jiahao Chen
KCL Engineering MSc
Context: Warm Home Healthy Life (Westminster City Council · Queen Park)
Data note: all quantitative results here are pre-retrain values (5-LLM / single-LLM); final figures will be refreshed once the Path-B retrain completes.
Recap

The contribution is the framework, not the model

Task: predict two energy scores — SAP (1–100) and EI (environmental impact) — from tabular + text + spatial inputs.

Two distinct “models” — never conflate

  • Predictor = the advisor’s three-encoder gated-fusion network (architecture frozen; the only component that outputs SAP/EI).
  • 6 LLMs = offline feature extractors turning free text into structured numbers (they do not predict the score).

Three contributions: the framework itself

Around two threads — robust (withstands input drift) + honest (abstains and defers to a human when unreliable).

1
Four-method UQ comparison — MC Dropout / deep ensemble / Conformal / CQR.
2
Cluster-conditional Mondrian conformal + 3-tier triage — per-cluster interval width, tiered handling.
3
LLM deployment-robustness benchmark — less degradation under text drift (+18.5pp).
Full system workflow · Overview

End to end: from raw EPC text to deployable uncertainty

Data & splits
EPC 124,990 rows
train/val/cal/test
unchanged
LLM feature extraction
text → 20 numbers
changed now
Three-encoder fusion predictor
tab+text+spatial
→ SAP / EI
unchanged
UQ / calibration / triage
contribution layer
hardened now
Deployable output
abstain & defer
when unreliable
unchanged (data splits / predictor architecture) this update (Path-B features + seven UQ improvements)

How to read it: the chain has one predictor (advisor’s, frozen); the 6 LLMs only do features at step ②, never predict the score; all UQ (step ④) sits on the predictor’s ensemble. This update focuses on steps ② and ④.

Full workflow · Step ① / ④

Step ① Data & splits: one dataset, four uses

What data · how processed

  • UK EPC energy certificates — 124,990 property records.
  • Each has three input types: tabular fields (area / walls / heating…), free text (assessor notes), spatial (location).
  • Tabular is standardised/encoded, text goes to DistilBERT, spatial is encoded separately — the three are fused at step ③.

Four splits (fixed since V1, shared throughout)

SplitRowsUse
train74,994train the predictor
val18,748tuning / early stop
cal12,499conformal calibration only
test18,749final evaluation
unchangedSplits were fixed in V1 and identical across V1→V6, so every version’s numbers are comparable; the dedicated cal split is the prerequisite for conformal’s coverage guarantee.
Full workflow · Step ② / ④ · changed this update

Step ② LLM feature extraction: free text → 20 structured numbers

What it does · how

Each EPC text → 20 features (17 categorical + 3 numeric). These 20 join the 4 base numeric columns into 24 dims as extra input to the tabular encoder — a more robust “structured lane” for the model; it does not predict the score itself.

Before · V3

A single LLM extracted once, across all 124,990 rows. No redundancy, no cross-check — a single shot, so individual errors are invisible.

Now · Path B

6 LLMs each extract independently → consensus: categorical by majority vote, numeric by median, plus a per-feature agreement Fleiss κ. Re-extracted on clean text across all 4 splits, then retrained.

Why mode/median rather than mean: these are discrete category codes and small integers — a mean yields meaningless fractions; mode/median is robust to a single LLM’s outliers and always lands in the valid value set.

Full workflow · Step ③ / ④

Step ③ Predictor: three-encoder gated fusion (advisor’s, frozen)

① Tabular encoder

4 base + 20 LLM = 24 numeric dims, plus categorical fields.

② Text encoder

Encodes the assessor’s free-text notes with DistilBERT.

③ Spatial encoder

Encodes the property’s location / spatial information.

Gated fusion → SAP / EI. Gating α: tabular 0.87 / text 0.10 / spatial 0.03, faithfully reflecting modality redundancy. Hyperparameters chosen by Optuna (dropout 0.0583, lr 2.97e-4, fusion=gated); trained as an M=5 (now 10-seed) ensemble.

architecture unchangedThis is the only component that outputs SAP/EI; the contribution is not in changing it, but in the trustworthy-deployment apparatus wrapped around it. Path B retrains the weights only because the input features changed — the structure is untouched.
Full workflow · Step ④ / ④ · how it runs at inference

Step ④ How one record flows at deployment: input → decision

1
Input: raw EPC — tabular fields + assessor free text + spatial.
2
Extract / normalise: 6-LLM consensus turns messy text into 20 structured features and canonicalises the text.
3
OOD check: after the three encoders produce the fused representation, Mahalanobis measures how far it is from the training distribution — too far ⇒ flag / down-weight (replacing the dead GMM router).
4
Model scores: the gated-fusion prediction head maps the fused representation to point SAP / EI.
5
UQ interval: Conformal / CQR (with MC Dropout / deep ensemble) give a prediction interval with a coverage guarantee (≈90%).
6
3-tier triage decision: combine interval width + OOD + consensus strength → auto-accept / human review / abstain to a human.

The backbone — external audit beats self-report: OOD / Conformal / CQR judge trust using held-out data the model never trained on, not the model’s own confidence. This update’s seven improvements mainly harden steps 2 / 3 / 5 (6-LLM features, Mahalanobis, CQR).

This update · maps to workflow steps ②+④

Three workstreams since the last meeting

① Path Bstep ②

Re-extract the 20 features by 6-LLM consensus and retrain every dependent model.
in progress · one blocker

② Seven improvementsstep ④

Seven thesis-hardening improvements for the viva.
all code landed

③ Results & plan

Current results (old) · honest positioning · next-steps timeline.
only the retrain-refresh remains

In one line: methods and engineering are in place; only a single retrain remains to refresh the numbers.

Workstream ① · Path B

From single-LLM extraction to 6-LLM consensus

Before

The 20 main features were extracted by a single LLM in V3 — one shot, with no redundancy or cross-check.

Now (Path B)

6 LLMs each extract independently, then aggregate by consensus: categorical (17) by majority vote, numeric (3) by median.

Models: doubaoglmminimaxdeepseekkimimimo (Xiaomi, own endpoint)

  • Re-extract on clean source text across all 4 splits, then retrain every model that consumes LLM features.
  • Two by-products: ① a feature-agreement metric, Fleiss κ (thesis-worthy); ② the old single-LLM features are retained, giving a built-in “old vs new” comparison.
  • Framing: turning “single-shot extraction” into “multi-model consensus” is itself a robustness contribution, not merely engineering.
Workstream ① · Extraction progress

Path B extraction progress (as of 2026-06-07)

3 / 3
cal · val · test
all splits extracted & written
running
train extracting (resumed 06-07)
ETA ~21h · 0 failures
≈ 52%↑
overall progress (climbing since 06-07)
≈ 142k / 271k cells
6
consensus LLMs
per text, per feature

Ready

Evaluation data is 100% complete (cal / val / test done) — the inputs for most UQ experiments are in hand.

Blocker cleared

06-04 train hit the weekly quota cap at ≈3% and paused; quota reset 06-07 17:08 → resumed, train running (ETA ~21h), 0 failures.

train is the prerequisite for retraining; fully resumable (cross-split cache keyed by text-hash) — a pause loses no progress. Moving kimi to the KCL gateway routed around its old hang and sped throughput ~2.8×.

Workstream ① · Engineering

Engineering that made ≈271k LLM calls manageable

MeasureEffect
Cross-split global cacheDedup by text-hash and share across the 4 splits → 355k 271k calls (≈ −23.5%; EPC text is highly repetitive).
64-way parallel + cache reuseReuses the 736.9M-token phase11b cache; mimo is a pure incremental top-up — the 5 cached LLMs are not re-run.
skip-listSome LLMs hang on specific texts (kimi thinking-loop, ≈0.01%) → skip and use partial consensus, never blocking the pipeline.
quota_supervisor watchdogAuto-pauses on the ≈5h short-window cap and auto-resumes on refresh — runs independently, with no manual supervision.

Low-profile but essential plumbing: it guarantees data quality and reproducibility, and turns a 4–5 day extraction into something controllable and resumable.

Workstream ② · Thesis hardening

Seven improvements, each targeting a viva weak spot

Viva-exposable weak spotImprovement
UQ-Spearman ≈ 0.24 (uncertainty weakly correlated with true error)CQR (headline) · NLL split
GMM router dead signal (posterior 99.7% = 1)Mahalanobis OOD
latent K=2 clustering degeneratesproperty_age primary
semantic-substitution min/max intervals unstable10–90 percentile band
too few seeds for statistical powerseeds → 10
“four-method race” compares unlike quantitiesreframe by role (writing only)

Statusall code landed, all unit tests pass, default-off / config-on → activates with the Path-B retrain. Note: this set targets uncertainty usability, not accuracy (LLM features proven IID-neutral).

Workstream ② · Headline improvement

CQR: targeting the UQ-Spearman weakness (0.24)

Structural root cause

Global conformal assigns every sample the same interval width, so it cannot rank by per-sample difficulty by construction — a low Spearman is the necessary consequence, not an accident.

How CQR works

  • The model self-reports per-sample quantile widths (0.05 / 0.5 / 0.95).
  • conformal audits and corrects them on a held-out setadaptive width with a retained coverage guarantee.

Analogy: a candidate’s self-rated “Python 9/10” cannot be trusted blindly; a standardised test finds everyone inflates by 3 subtract 3 across the board. The model proposes; conformal calibrates and decides.

Directional evidence (re-verify after retrain)

  • On a single-LLM, single-seed demo model, global CQR lifts UQ-Spearman 0.24 → 0.36 (SAP 0.40 / EI 0.32), ~+50% relative.
  • Counter-intuitive: making CQR cluster-conditional by age adds nothing (≈flat) → clustering's job is per-cluster coverage; the per-sample head carries difficulty ranking.

proof-of-direction: 1-LLM single-seed, not final; recompute after 6-LLM / 10-seed retrain (absolute shifts; the trend is the signal). Implementation = one quantile head, baked into the retrain.

Workstream ② · Remaining improvements

Hardening the results that were less robust

NLL head

Explicitly separates aleatoric (inherent data noise) from epistemic (under-trained model) → directly explains “why only 0.24”: the residual is aleatoric-dominated, so more ensemble members do not help.

Mahalanobis OOD

Measures “how far from the training distribution” in the fused latent z_fused, replacing the dead GMM router. Analogy: a guard judges whether the whole pattern is anomalous, not whether one dimension is high.

bandsemantic substitution → 10–90 percentile band clusterproperty_age primary (K=8, width heterogeneity 1.93×) statsseeds → 10 (Wilcoxon power) writingreframe four methods by role

Connecting thread: external audit beats self-report — conformal / CQR / Mahalanobis all audit the model on held-out data it never trained on, rather than trusting its self-reported confidence.

Workstream ② · New batch

Five post-hoc UQ methods: comparison breadth + hardening

All 5 are post-hoc: they wrap the frozen predictor, no retrain, no API, running on the same retrained weights — decoupled from Path B.

1
Proper scoring (CRPS / NLL / Winkler) — a shared yardstick rewarding sharpness AND calibration, not just MAE.
2
Worst-slice coverage — bin by predicted score, check the 90% interval doesn't secretly under-cover a slice.
3
Conformal Risk Control — lifts the guarantee from interval width to which energy band was assigned.
4
Last-layer Laplace — a Bayesian posterior over the final layer → a third per-sample epistemic column.
5
Calibrated regression (Kuleshov isotonic) — recalibrates the NLL Gaussian as a named comparison baseline, showing why conformal / CQR's finite-sample guarantee is preferable.

Code + unit tests all in (pytest 99 passed); real-data numbers pending the retrain. None of these lifts the ~0.24 aleatoric ceiling — CQR (0.36) is still the only lever; they add comparison breadth + harden contributions ①②, not accuracy.

Workstream ③ · Current results

Current key results Pre-retrain — will change after retrain

0.749
baseline mean R²
(MAE 4.38)
+88.7%
multimodal vs tab-only
R² (0.398 → 0.750)
+0.07pp
LLM-feature IID gain
≈ 0 (reported honestly)
+18.5pp
robustness advantage
(less drift degradation)
≈ 90%
conformal coverage
(guaranteed)
1.93×
Mondrian width heterog.
(property_age K=8)
2.4×
triage MAE gradient
Tier1 2.84 → Tier3 6.70
0.872
consensus Fleiss κ
(will shift at 6-LLM)

Modality gating α: tabular 0.87 / text 0.10 / spatial 0.03 — the gate faithfully reflects information redundancy across modalities; an insight, not a defect.

Workstream ③ · Honest positioning

Honest positioning: the viva red line

Must not claim

  • “LLM features improve accuracy” — they do not (IID +0.07pp, statistically zero).
  • “We built a better model” — the model is the advisor’s and is frozen.
  • “Robustness verified on real consumer text” — it is synthetic drift (LLM-rewritten).

Can claim

  • Architecture: multimodal over raw text beats “compress text → categorical” — tab_llm_only recovers only 94.2% of no_llm.
  • LLM value is robustness, not accuracy: 18.5pp less degradation under drift.
  • Calibration is real: conformal has a coverage guarantee (≈90%).
  • Deployability = triage: silent errors become honest abstentions.

Negative results, reported as science: MoE failed 3× (which led to Mondrian), LLM IID redundancy, UQ-Spearman 0.24 (the aleatoric ceiling) — all findings, not flaws.

Workstream ③ · Reviewer response

Responding to reviewer concerns

ConcernResponse
① The 20 features overlap with the raw textAgreed — the IID data confirms it (+0.07pp). Argument: complementary under drift; the value is robustness.
② Features are a pre-processing artefact / add user burden; why not just normalise the text?Valid the normalize_vs_extract experiment (extract-features vs normalise-text, same drift, compare degradation) is implemented and answers it directly.
③ UQ should correlate with (prediction − truth), not just model/input perturbationAgreed: ensemble/perturbation spread is epistemic; conformal is the part calibrated to true error (coverage); low Spearman reflects large aleatoric; the principled fix is CQR; early directional signal (1-LLM single-seed demo): 0.24 → 0.36, ~+50% rel (proof-of-direction, re-verify after retrain).

These three directly shaped the improvements: CQR / NLL answer ③, the controlled experiment answers ②, the robustness narrative answers ①.

Next steps · Timeline

Next steps and timeline

StageContent
① Resume extraction
after quota refresh
Finish train extraction (≈2.5 days at full speed, plus quota pauses).
② One-shot retrainTrigger retrain: CQR / NLL new heads and 10 seeds baked in together (deferred via --skip-retrain, fired once after train completes — saving a full training round).
③ Re-run UQphase4b(CQR) / 8 / 9 / 10 / 12 / 13 all re-run → refresh every number and verify CQR well above 0.24.
④ Write upOnce numbers refresh, write to the THESIS_WRITING_GUIDE spine; submit August.

Compute: KCL CREATE HPC verified usable (A100-40GB) — Mac handles LLM/API, the cluster handles GPU retraining. The retrain is packaged as a one-command Slurm array (16 trainings × 8-at-a-time ~2 waves, auto-chaining UQ) → wall-clock estimated ~25h → ~5h (same total GPU-hours, just parallelised); scripts ready, not yet run on CREATE.

Risks · Blockers

Current blockers and risks

Main blocker: the API quota is a multi-day hard wall

Weekly/monthly quotas take days to refresh (the ≈5h short window merely cycles and can be auto-ridden). We have hit the weekly cap → wait for refresh, or swap to a backup key to resume.

Timeline impact

  • train ≈2.5 days at full speed; retrain + UQ on the cluster ≈1–2 days.
  • Refreshed numbers expected within roughly a week of resuming.

Mitigations already in place

  • Fully resumable (cross-split cache); a pause loses no progress.
  • Watchdog handles the short window; eval data (cal / val / test) already complete.

Reminder: back up the LLM cache off-NAS once extraction finishes (the NAS failed once before).

Summary

Summary

  • Engineering: Path B extraction past halfway, eval data ready, blocked on quota (resumable).
  • Methods: all seven improvements coded, awaiting the retrain to bake in.
  • Positioning: the contribution is the framework, not the model; LLM value is robustness, not accuracy.
  • Next: resume → retrain → refresh numbers → write up (submit August).

Methods and engineering are in place; only a single retrain remains to refresh the results.

Thank you · Questions welcome