Multimodal EPC Prediction · Progress Update

Group-Meeting Progress Update · 2026-06-08

Multimodal EPC Energy Prediction

An Uncertainty-Quantification · Calibration · Deployment-Robustness Framework

Jiahao Chen
KCL Engineering MSc

Context: Warm Home Healthy Life (Westminster City Council · Queen Park)

Data note: all quantitative results here are pre-retrain values (5-LLM / single-LLM); final figures will be refreshed once the Path-B retrain completes.

Recap

The contribution is the framework, not the model

Task: predict two energy scores — SAP (1–100) and EI (environmental impact) — from tabular + text + spatial inputs.

Two distinct “models” — never conflate

Predictor = the advisor’s three-encoder gated-fusion network (architecture frozen; the only component that outputs SAP/EI).
6 LLMs = offline feature extractors turning free text into structured numbers (they do not predict the score).

Three contributions: the framework itself

Around two threads — robust (withstands input drift) + honest (abstains and defers to a human when unreliable).

Four-method UQ comparison — MC Dropout / deep ensemble / Conformal / CQR.

Cluster-conditional Mondrian conformal + 3-tier triage — per-cluster interval width, tiered handling.

LLM deployment-robustness benchmark — less degradation under text drift (+18.5pp).

Full system workflow · Overview

End to end: from raw EPC text to deployable uncertainty

①

Data & splits

EPC 124,990 rows
train/val/cal/test

unchanged

→

②

LLM feature extraction

text → 20 numbers

changed now

→

③

Three-encoder fusion predictor

tab+text+spatial
→ SAP / EI

unchanged

→

④

UQ / calibration / triage

contribution layer

hardened now

→

▶

Deployable output

abstain & defer
when unreliable

unchanged (data splits / predictor architecture) this update (Path-B features + seven UQ improvements)

How to read it: the chain has one predictor (advisor’s, frozen); the 6 LLMs only do features at step ②, never predict the score; all UQ (step ④) sits on the predictor’s ensemble. This update focuses on steps ② and ④.

Full workflow · Step ① / ④

Step ① Data & splits: one dataset, four uses

What data · how processed

UK EPC energy certificates — 124,990 property records.
Each has three input types: tabular fields (area / walls / heating…), free text (assessor notes), spatial (location).
Tabular is standardised/encoded, text goes to DistilBERT, spatial is encoded separately — the three are fused at step ③.

Four splits (fixed since V1, shared throughout)

Split	Rows	Use
train	74,994	train the predictor
val	18,748	tuning / early stop
cal	12,499	conformal calibration only
test	18,749	final evaluation

Splits were fixed in V1 and identical across V1→V6, so every version’s numbers are comparable; the dedicated cal split is the prerequisite for conformal’s coverage guarantee.

Full workflow · Step ② / ④　·　changed this update

Step ② LLM feature extraction: free text → 20 structured numbers

What it does · how

Each EPC text → 20 features (17 categorical + 3 numeric). These 20 join the 4 base numeric columns into 24 dims as extra input to the tabular encoder — a more robust “structured lane” for the model; it does not predict the score itself.

Before · V3

A single LLM extracted once, across all 124,990 rows. No redundancy, no cross-check — a single shot, so individual errors are invisible.

Now · Path B

6 LLMs each extract independently → consensus: categorical by majority vote, numeric by median, plus a per-feature agreement Fleiss κ. Re-extracted on clean text across all 4 splits, then retrained.

Why mode/median rather than mean: these are discrete category codes and small integers — a mean yields meaningless fractions; mode/median is robust to a single LLM’s outliers and always lands in the valid value set.

Full workflow · Step ③ / ④

Step ③ Predictor: three-encoder gated fusion (advisor’s, frozen)

① Tabular encoder

4 base + 20 LLM = 24 numeric dims, plus categorical fields.

② Text encoder

Encodes the assessor’s free-text notes with DistilBERT.

③ Spatial encoder

Encodes the property’s location / spatial information.

Gated fusion → SAP / EI. Gating α: tabular 0.87 / text 0.10 / spatial 0.03, faithfully reflecting modality redundancy. Hyperparameters chosen by Optuna (dropout 0.0583, lr 2.97e-4, fusion=gated); trained as an M=5 (now 10-seed) ensemble.

This is the only component that outputs SAP/EI; the contribution is not in changing it, but in the trustworthy-deployment apparatus wrapped around it. Path B retrains the weights only because the input features changed — the structure is untouched.

Full workflow · Step ④ / ④　·　how it runs at inference

Step ④ How one record flows at deployment: input → decision

Input: raw EPC — tabular fields + assessor free text + spatial.

↓

Extract / normalise: 6-LLM consensus turns messy text into 20 structured features and canonicalises the text.

↓

OOD check: after the three encoders produce the fused representation, Mahalanobis measures how far it is from the training distribution — too far ⇒ flag / down-weight (replacing the dead GMM router).

↓

Model scores: the gated-fusion prediction head maps the fused representation to point SAP / EI.

↓

UQ interval: Conformal / CQR (with MC Dropout / deep ensemble) give a prediction interval with a coverage guarantee (≈90%).

↓

3-tier triage decision: combine interval width + OOD + consensus strength → auto-accept / human review / abstain to a human.

The backbone — external audit beats self-report: OOD / Conformal / CQR judge trust using held-out data the model never trained on, not the model’s own confidence. This update’s seven improvements mainly harden steps 2 / 3 / 5 (6-LLM features, Mahalanobis, CQR).

This update · maps to workflow steps ②+④

Three workstreams since the last meeting

① Path Bstep ②

Re-extract the 20 features by 6-LLM consensus and retrain every dependent model.
in progress · one blocker

② Seven improvementsstep ④

Seven thesis-hardening improvements for the viva.
all code landed

③ Results & plan

Current results (old) · honest positioning · next-steps timeline.
only the retrain-refresh remains

In one line: methods and engineering are in place; only a single retrain remains to refresh the numbers.

Workstream ① · Path B

From single-LLM extraction to 6-LLM consensus

Before

The 20 main features were extracted by a single LLM in V3 — one shot, with no redundancy or cross-check.

Now (Path B)

6 LLMs each extract independently, then aggregate by consensus: categorical (17) by majority vote, numeric (3) by median.

Models: doubaoglmminimaxdeepseekkimimimo (Xiaomi, own endpoint)

Re-extract on clean source text across all 4 splits, then retrain every model that consumes LLM features.
Two by-products: ① a feature-agreement metric, Fleiss κ (thesis-worthy); ② the old single-LLM features are retained, giving a built-in “old vs new” comparison.
Framing: turning “single-shot extraction” into “multi-model consensus” is itself a robustness contribution, not merely engineering.

Workstream ① · Extraction progress

Path B extraction progress (as of 2026-06-07)

3 / 3

cal · val · test
all splits extracted & written

running

train extracting (resumed 06-07)
ETA ~21h · 0 failures

≈ 52%↑

overall progress (climbing since 06-07)
≈ 142k / 271k cells

consensus LLMs
per text, per feature

Ready

Evaluation data is 100% complete (cal / val / test done) — the inputs for most UQ experiments are in hand.

Blocker cleared

06-04 train hit the weekly quota cap at ≈3% and paused; quota reset 06-07 17:08 → resumed, train running (ETA ~21h), 0 failures.

train is the prerequisite for retraining; fully resumable (cross-split cache keyed by text-hash) — a pause loses no progress. Moving kimi to the KCL gateway routed around its old hang and sped throughput ~2.8×.

Workstream ① · Engineering

Engineering that made ≈271k LLM calls manageable

Measure	Effect
Cross-split global cache	Dedup by text-hash and share across the 4 splits → 355k → 271k calls (≈ −23.5%; EPC text is highly repetitive).
64-way parallel + cache reuse	Reuses the 736.9M-token phase11b cache; mimo is a pure incremental top-up — the 5 cached LLMs are not re-run.
skip-list	Some LLMs hang on specific texts (kimi thinking-loop, ≈0.01%) → skip and use partial consensus, never blocking the pipeline.
quota_supervisor watchdog	Auto-pauses on the ≈5h short-window cap and auto-resumes on refresh — runs independently, with no manual supervision.

Low-profile but essential plumbing: it guarantees data quality and reproducibility, and turns a 4–5 day extraction into something controllable and resumable.

Workstream ② · Thesis hardening

Seven improvements, each targeting a viva weak spot

Viva-exposable weak spot	Improvement
UQ-Spearman ≈ 0.24 (uncertainty weakly correlated with true error)	CQR (headline) · NLL split
GMM router dead signal (posterior 99.7% = 1)	Mahalanobis OOD
latent K=2 clustering degenerates	property_age primary
semantic-substitution min/max intervals unstable	10–90 percentile band
too few seeds for statistical power	seeds → 10
“four-method race” compares unlike quantities	reframe by role (writing only)

Statusall code landed, all unit tests pass, default-off / config-on → activates with the Path-B retrain. Note: this set targets uncertainty usability, not accuracy (LLM features proven IID-neutral).

Workstream ② · Headline improvement

CQR: targeting the UQ-Spearman weakness (0.24)

Structural root cause

Global conformal assigns every sample the same interval width, so it cannot rank by per-sample difficulty by construction — a low Spearman is the necessary consequence, not an accident.

How CQR works

The model self-reports per-sample quantile widths (0.05 / 0.5 / 0.95).
conformal audits and corrects them on a held-out set → adaptive width with a retained coverage guarantee.

Analogy: a candidate’s self-rated “Python 9/10” cannot be trusted blindly; a standardised test finds everyone inflates by 3 → subtract 3 across the board. The model proposes; conformal calibrates and decides.

Directional evidence (re-verify after retrain)

On a single-LLM, single-seed demo model, global CQR lifts UQ-Spearman 0.24 → 0.36 (SAP 0.40 / EI 0.32), ~+50% relative.
Counter-intuitive: making CQR cluster-conditional by age adds nothing (≈flat) → clustering's job is per-cluster coverage; the per-sample head carries difficulty ranking.

proof-of-direction: 1-LLM single-seed, not final; recompute after 6-LLM / 10-seed retrain (absolute shifts; the trend is the signal). Implementation = one quantile head, baked into the retrain.

Workstream ② · Remaining improvements

Hardening the results that were less robust

NLL head

Explicitly separates aleatoric (inherent data noise) from epistemic (under-trained model) → directly explains “why only 0.24”: the residual is aleatoric-dominated, so more ensemble members do not help.

Mahalanobis OOD

Measures “how far from the training distribution” in the fused latent z_fused, replacing the dead GMM router. Analogy: a guard judges whether the whole pattern is anomalous, not whether one dimension is high.

bandsemantic substitution → 10–90 percentile band　clusterproperty_age primary (K=8, width heterogeneity 1.93×)　statsseeds → 10 (Wilcoxon power)　writingreframe four methods by role

Connecting thread: external audit beats self-report — conformal / CQR / Mahalanobis all audit the model on held-out data it never trained on, rather than trusting its self-reported confidence.

Workstream ② · New batch

Five post-hoc UQ methods: comparison breadth + hardening

All 5 are post-hoc: they wrap the frozen predictor, no retrain, no API, running on the same retrained weights — decoupled from Path B.

Proper scoring (CRPS / NLL / Winkler) — a shared yardstick rewarding sharpness AND calibration, not just MAE.

Worst-slice coverage — bin by predicted score, check the 90% interval doesn't secretly under-cover a slice.

Conformal Risk Control — lifts the guarantee from interval width to which energy band was assigned.

Last-layer Laplace — a Bayesian posterior over the final layer → a third per-sample epistemic column.

Calibrated regression (Kuleshov isotonic) — recalibrates the NLL Gaussian as a named comparison baseline, showing why conformal / CQR's finite-sample guarantee is preferable.

Code + unit tests all in (pytest 99 passed); real-data numbers pending the retrain. None of these lifts the ~0.24 aleatoric ceiling — CQR (0.36) is still the only lever; they add comparison breadth + harden contributions ①②, not accuracy.

Workstream ③ · Current results

Current key results Pre-retrain — will change after retrain

0.749

baseline mean R²
(MAE 4.38)

+88.7%

multimodal vs tab-only
R² (0.398 → 0.750)

+0.07pp

LLM-feature IID gain
≈ 0 (reported honestly)

+18.5pp

robustness advantage
(less drift degradation)

≈ 90%

conformal coverage
(guaranteed)

1.93×

Mondrian width heterog.
(property_age K=8)

2.4×

triage MAE gradient
Tier1 2.84 → Tier3 6.70

0.872

consensus Fleiss κ
(will shift at 6-LLM)

Modality gating α: tabular 0.87 / text 0.10 / spatial 0.03 — the gate faithfully reflects information redundancy across modalities; an insight, not a defect.

Workstream ③ · Honest positioning

Honest positioning: the viva red line

Must not claim

“LLM features improve accuracy” — they do not (IID +0.07pp, statistically zero).
“We built a better model” — the model is the advisor’s and is frozen.
“Robustness verified on real consumer text” — it is synthetic drift (LLM-rewritten).

Can claim

Architecture: multimodal over raw text beats “compress text → categorical” — tab_llm_only recovers only 94.2% of no_llm.
LLM value is robustness, not accuracy: 18.5pp less degradation under drift.
Calibration is real: conformal has a coverage guarantee (≈90%).
Deployability = triage: silent errors become honest abstentions.

Negative results, reported as science: MoE failed 3× (which led to Mondrian), LLM IID redundancy, UQ-Spearman 0.24 (the aleatoric ceiling) — all findings, not flaws.

Workstream ③ · Reviewer response

Responding to reviewer concerns

Concern	Response
① The 20 features overlap with the raw text	Agreed — the IID data confirms it (+0.07pp). Argument: complementary under drift; the value is robustness.
② Features are a pre-processing artefact / add user burden; why not just normalise the text?	Valid → the normalize_vs_extract experiment (extract-features vs normalise-text, same drift, compare degradation) is implemented and answers it directly.
③ UQ should correlate with (prediction − truth), not just model/input perturbation	Agreed: ensemble/perturbation spread is epistemic; conformal is the part calibrated to true error (coverage); low Spearman reflects large aleatoric; the principled fix is CQR; early directional signal (1-LLM single-seed demo): 0.24 → 0.36, ~+50% rel (proof-of-direction, re-verify after retrain).

These three directly shaped the improvements: CQR / NLL answer ③, the controlled experiment answers ②, the robustness narrative answers ①.

Next steps · Timeline

Next steps and timeline

Stage	Content
① Resume extraction after quota refresh	Finish train extraction (≈2.5 days at full speed, plus quota pauses).
② One-shot retrain	Trigger retrain: CQR / NLL new heads and 10 seeds baked in together (deferred via `--skip-retrain`, fired once after train completes — saving a full training round).
③ Re-run UQ	phase4b(CQR) / 8 / 9 / 10 / 12 / 13 all re-run → refresh every number and verify CQR well above 0.24.
④ Write up	Once numbers refresh, write to the THESIS_WRITING_GUIDE spine; submit August.

Compute: KCL CREATE HPC verified usable (A100-40GB) — Mac handles LLM/API, the cluster handles GPU retraining. The retrain is packaged as a one-command Slurm array (16 trainings × 8-at-a-time ~2 waves, auto-chaining UQ) → wall-clock estimated ~25h → ~5h (same total GPU-hours, just parallelised); scripts ready, not yet run on CREATE.

Risks · Blockers

Current blockers and risks

Main blocker: the API quota is a multi-day hard wall

Weekly/monthly quotas take days to refresh (the ≈5h short window merely cycles and can be auto-ridden). We have hit the weekly cap → wait for refresh, or swap to a backup key to resume.

Timeline impact

train ≈2.5 days at full speed; retrain + UQ on the cluster ≈1–2 days.
Refreshed numbers expected within roughly a week of resuming.

Mitigations already in place

Fully resumable (cross-split cache); a pause loses no progress.
Watchdog handles the short window; eval data (cal / val / test) already complete.

Reminder: back up the LLM cache off-NAS once extraction finishes (the NAS failed once before).

Summary

Engineering: Path B extraction past halfway, eval data ready, blocked on quota (resumable).
Methods: all seven improvements coded, awaiting the retrain to bake in.

Positioning: the contribution is the framework, not the model; LLM value is robustness, not accuracy.
Next: resume → retrain → refresh numbers → write up (submit August).

Methods and engineering are in place; only a single retrain remains to refresh the results.

Thank you · Questions welcome