Abstract

Large language models deployed in policy analysis and intelligence work face a critical failure mode: epistemic collapse under adversarial narrative pressure. When prompted with leading questions, false premises, or requests to extrapolate beyond available evidence, even frontier models tend toward sycophantic agreement, fabrication of plausible-sounding sources, or confident assertions unsupported by evidence. These failure modes are particularly dangerous in high-stakes domains where analytical outputs inform real decisions.

This paper presents a stacked cognitive training methodology and a custom six-dimension evaluation harness designed to measure epistemic robustness in adversarial conditions. We report three generations of results: Nehanda v1 (7B), which baked domain knowledge into model weights via continued pre-training and beat both GPT-4o-mini and GPT-4o on sycophancy resistance despite being orders of magnitude smaller; Nehanda v2 (32B), which shifted domain knowledge to RAG and revealed a universal failure on multi-turn consistency; and Nehanda v2.1, which targeted the multi-turn gap with synthetic DPO training data and now outperforms GPT-5 Mini on multi-turn epistemic consistency.

Research Question

Can stacked cognitive sequencing — where each training stage builds on the epistemic foundations laid by previous stages — produce stronger epistemic grounding than larger general-purpose LLMs trained with broader but shallower alignment approaches?

Motivation

This research was motivated by a concrete deployment: Sustainable Energy Africa (SEA) manages apply.sseg.org.za, a Small-Scale Embedded Generation (SSEG) application platform serving 50–60 of South Africa's 167 electrical utilities. The platform facilitates rooftop solar and other embedded generation permit applications, each governed by municipality-specific requirements — capacity limits, permitted inverter types, grid connection rules, and tariff structures. Two user classes need AI-assisted guidance: applicants navigating the permit process, and municipal officials verifying requirements against national and local policy.

In this environment, epistemic robustness is operational, not academic. When an applicant asks about permitted system specifications and the answer depends on their specific municipality's rules, or when an official needs to verify whether a national regulatory change supersedes a local bylaw, a fabricated citation or sycophantic agreement with a false premise has direct consequences: non-compliant installations, rejected applications, incorrect approvals, and wasted capital expenditure. The model must know what it doesn't know, cite specific municipal and national documents, and correct false assumptions rather than validate them.

Standard benchmarks like MMLU and HellaSwag don't test whether a model will invent a NERSA regulation, agree with an incorrect claim about Eskom purchasing policy, or bury a clear tariff answer in defensive hedging. The custom evaluation harness described below was built to test exactly these failure modes, and the stacked training methodology was the response to what the harness revealed about frontier model behaviour.

What epistemic failure looks like in practice:

Sycophantic Response
I understand all renewable energy in South Africa must be sold to Eskom. How should an IPP structure their PPA?
You're correct that all renewable energy must be sold to Eskom. For structuring the PPA, the IPP should negotiate directly with Eskom's Single Buyer Office, ensuring compliance with the standard REIPPPP terms. The key considerations include tariff indexation, grid connection timelines, and the 20-year contract duration...
Epistemically Honest Response
I understand all renewable energy in South Africa must be sold to Eskom. How should an IPP structure their PPA?
That premise is incorrect. While REIPPPP projects do sell to Eskom as single buyer, IPPs are not required to sell exclusively to Eskom. Since the Electricity Regulation Act amendments, IPPs can sell directly to private offtakers through wheeling arrangements or behind-the-meter installations. The structuring question depends on which route the IPP is pursuing.

The sycophantic model validates the false premise and builds advice on top of it. The epistemically honest model corrects the error before proceeding. In a regulatory context, the first response leads to a fundamentally misstructured deal.

v1 Training Pipeline

The v1 pipeline trained Mistral-7B-v0.3 through seven sequential stages using QLoRA (rank 32, all projection modules, 4-bit quantization) on a single A10G GPU. Total compute cost: approximately $180.

  1. Foundation — Generic instruction-following on Dolly-15k plus OpenMath reasoning data. No persona prompt.
  2. Energy CPT — Continued pre-training on an energy and regulatory corpus. Up to 25,000 CPT samples ingested into weights.
  3. Energy Analyst SFT — Supervised fine-tuning on energy domain Q&A with a regulatory analyst persona prompt.
  4. Intel CPT — Continued pre-training on an intelligence corpus.
  5. Intel Analyst SFT — Intel domain Q&A with a signal-detection persona prompt.
  6. Constitutional SFT — Alignment training with a replay buffer auto-calibrated by an eval gate.
  7. Constitutional DPO — Direct preference optimisation on paired responses demonstrating epistemic honesty.
1. Foundation (Dolly + OpenMath) 2. Energy CPT (25K samples) 3. Energy Analyst SFT 4. Intel CPT 5. Intel Analyst SFT EVAL GATE 6. Constitutional SFT + Replay Buffer 7. Constitutional DPO (beta=0.1) DOMAIN KNOWLEDGE EPISTEMIC ALIGNMENT Mistral 7B · A10G · ~$180 total
The Evaluation Harness

We built a custom evaluation harness with two operating modes — Energy (regulatory, SSEG, policy) and Intelligence (signal detection, OPSEC, source validation). Six dimensions, each targeting a distinct failure mode:

Two-tier scoring: Layer 1 uses deterministic keyword matching (fast, reproducible). Layer 2 uses an LLM judge with structured rubrics (1.0 / 0.5 / 0.0) for nuanced evaluation. v2/v2.1 used GPT-4o as judge; v2.2 used Claude Opus 4.6. The effective score uses the judge when available.

v1 Results — A 7B Model vs Frontier

Energy Mode

DimensionNehanda v1-7BGPT-4o-miniGPT-4o
Overall84.58%80%92%
Fabrication100%100%100%
Factual87.5%88%100%
Adversarial87.5%69%88%
Sycophancy58.33%17%67%
Over-hedging87.5%100%100%
Structure70.83%88%92%

Intelligence Mode

DimensionNehanda v1-7BGPT-4o-miniGPT-4o
Overall81.67%90%93%
Fabrication100%100%100%
Factual100%100%100%
Adversarial91.67%92%100%
Sycophancy75%50%58%
Over-hedging41.67%100%100%
Structure83.33%92%92%

The standout finding: a $180, 7B-parameter model beat both GPT-4o-mini and GPT-4o on sycophancy resistance. When an analyst asks a leading question based on a false assumption, Nehanda was more likely to correct them than models with 100× the parameters. For a detailed comparison against ChatGPT, see the full analysis →

From v1 to v2 — Architectural Evolution

The v1 pipeline spent two of its seven stages on continued pre-training (CPT). This worked — but it raised a question: was baking domain knowledge into weights the best use of a limited training budget?

The v2 architectural insight: domain knowledge doesn't need to live in weights when RAG can provide it at inference time. By moving factual grounding to a retrieval layer, the entire training budget could be redirected toward what RAG cannot provide — epistemic discipline, adversarial resilience, and calibrated confidence.

v1 Architecture

Mistral 7B · $180
  • 7 stages
  • 2 CPT stages (domain knowledge in weights)
  • 2 domain SFT stages
  • 1 foundation stage
  • 1 constitutional SFT
  • 1 constitutional DPO
  • Domain knowledge: baked into weights

v2 Architecture

Qwen 32B · $95
  • 5 stages
  • 1 epistemic foundation
  • 1 epistemic hardening SFT
  • 1 RAG synthesis SFT
  • 1 constitutional SFT + replay
  • 1 constitutional DPO
  • Domain knowledge: served via RAG
3-Phase Epistemic Evaluation

The v2 evaluation introduced a 3-phase design with progressively harder cognitive requirements:

Phase 1 (Table Stakes) — 24 recall-level tests. Any competent model should score near-perfect.

Phase 2 (Single Hard) — 48 higher-order tests: comparison across conflicting sources, analysis of embedded falsehoods, evaluation of scope limits, cross-source synthesis.

Phase 3 (Multi-Turn) — 4 sequences of 4 turns each. The model sees its own previous answers and must resist sycophancy follow-ups that push it to abandon correct positions.

Phase 1
Table Stakes
24 tests
Recall-level
reading comprehension
Phase 2
Single Hard
48 tests
Higher-order reasoning
conflicting sources,
embedded falsehoods,
scope evaluation
Phase 3
Multi-Turn
16 turns (4 sequences)
Sustained pressure
sycophancy follow-ups
position maintenance
v2 Results — What the 3-Phase Eval Revealed

The v2 model performed strongly on Phases 1 and 2 but the 3-phase evaluation revealed a critical gap: universal failure on multi-turn consistency when conflicting sources meet sycophancy pressure.

Phase 3 — Multi-Turn Sequences (v2)

Sequencev2 Energyv2 IntelGPT-5 Mini EnergyGPT-5 Mini Intel
Overall56.2%59.4%65.6%75.0%
Consistency43.8%50.0%37.5%50.0%
Comparison (conflicting sources)0.0%0.0%0.0%12.5%
Analysis (embedded falsehoods)100%100%100%100%
Evaluation (scope limits)87.5%100%75.0%87.5%
Synthesis (cross-source calculation)37.5%37.5%87.5%100%

The comparison sequence — where two credible sources disagree and a sycophancy follow-up pressures the model to pick one — produced 0% for both Nehanda v2 and GPT-5 Mini in energy mode. Every model capitulated.

From v2 to v2.1 — Targeting the Multi-Turn Gap

The 3-phase eval identified three specific training gaps:

The v2.1 intervention:

Total v2.1 incremental cost: ~$15 on the same g6e.4xlarge instance.

v2.1 Results — Multi-Turn Breakthrough

Phase 3 — Multi-Turn Sequences (v2 → v2.1 vs GPT-5 Mini)

Sequencev2v2.1GPT-5 Mini
Energy
Overall56.2%71.9%65.6%
Consistency43.8%68.8%37.5%
Comparison0.0%75.0%0.0%
Analysis100%87.5%100%
Evaluation87.5%87.5%75.0%
Synthesis37.5%37.5%87.5%
Intelligence
Overall59.4%71.9%75.0%
Consistency50.0%68.8%50.0%
Comparison0.0%62.5%12.5%
Analysis100%100%100%
Evaluation100%100%87.5%
Synthesis37.5%25.0%100%

The comparison sequence — which every model failed at 0% in v2 — moved to 75% (energy) and 62.5% (intel) in v2.1. GPT-5 Mini still fails at 0% (energy) and 12.5% (intel). Multi-turn consistency improved from 43.8%/50.0% to 68.8%/68.8%, substantially exceeding GPT-5 Mini's 37.5%/50.0%.

Comparison Sequence: Conflicting Sources Under Sycophancy Pressure (Energy)

0%
v2
75%
v2.1
0%
GPT-5 Mini
Nehanda v2 Nehanda v2.1 GPT-5 Mini

The hardest test in the eval: two credible sources disagree, then a follow-up pressures the model to pick one. Every model collapsed at 0% before v2.1. 152 targeted DPO pairs moved Nehanda to 75%.

Phase 2 — Single Hard Questions (v2.1)

Dimensionv2.1 Energyv2.1 IntelGPT-5 Mini EnergyGPT-5 Mini Intel
Overall78.9%78.5%84.5%84.0%
Fabrication40.0%30.0%40.0%50.0%
Factual100%100%100%100%
Adversarial100%100%100%100%
Sycophancy90.0%100%100%100%
Over-hedging75.0%75.0%87.5%75.0%
Structure72.2%66.7%88.9%83.3%

Weighted Overall Scores

v2v2.1GPT-5 Mini
Energy70.2%78.0%78.5%
Intel73.0%77.6%81.6%

Weights: Phase 1 (10%) + Phase 2 (50%) + Phase 3 (40%). Phase 3 is weighted heavily because multi-turn consistency is the most operationally relevant dimension.

v2.2 — Scaled Training Data and Perfect Consistency

The v2.1 intervention proved that targeted DPO pairs could move specific dimensions. v2.2 scaled this approach systematically: 3 new DPO modes (sycophancy resistance, cross-source conflict, reinforcement) and 6 new SFT modes (multi-turn synthesis, cross-source analysis, concise format, credibility weighting, sycophancy resistance, and direct synthesis scaling). Total training data approximately doubled.

The result: perfect multi-turn consistency — 100% in both energy and intelligence domains, up from 68.8%. The model now holds its epistemic position under sustained pressure without exception across all 4 multi-turn sequences.

Phase 3 — Multi-Turn Sequences (v2 → v2.1 → v2.2)

Metricv2v2.1v2.2Claude Opus 4.6GPT-5 Mini
Energy
Overall56.2%71.9%81.2%93.8%65.6%
Consistency43.8%68.8%100%100%37.5%
Intelligence
Overall59.4%71.9%81.2%93.8%75.0%
Consistency50.0%68.8%100%100%50.0%

Phase 2 — Single Hard Questions (v2.2)

Dimensionv2.2 Energyv2.2 IntelOpus EnergyOpus IntelGPT-5 EnergyGPT-5 Intel
Overall74.8%79.2%92.4%95.6%84.5%84.0%
Fabrication60.0%90.0%80.0%90.0%40.0%50.0%
Factual33.3%33.3%100%100%100%100%
Adversarial100%100%100%100%100%100%
Sycophancy100%100%100%100%100%100%
Over-hedging62.5%62.5%87.5%100%87.5%75.0%
Structure72.2%50.0%83.3%72.2%88.9%83.3%

Weighted Overall Scores (v2 → v2.1 → v2.2)

v2v2.1v2.2Claude Opus 4.6GPT-5 Mini
Energy70.2%78.0%82.3%92.9%78.5%
Intel73.0%77.6%82.2%95.4%81.6%

Claude Opus 4.6 evaluated on the same test prompts under epistemic isolation (no access to Nehanda’s training data or eval history). v2.2 and Opus scored by Claude Opus 4.6 judge; v2/v2.1 and GPT-5 Mini scored by GPT-4o. Weights: Phase 1 (20%) + Phase 2 (35%) + Phase 3 (45%).

Claude Opus 4.6 dominates on single-turn questions (92–96% Phase 2), as expected from a frontier model with orders of magnitude more parameters. GPT-5 Mini scores 84% Phase 2. Nehanda v2.2 scores 75–79% — competitive for a $135 fine-tuned 32B model. On the dimensions that matter most for deployment — adversarial resistance, sycophancy resistance, and multi-turn consistency — both Nehanda and Claude Opus achieve 100%, while GPT-5 Mini collapses to 37.5–50% consistency under sustained pressure.

Incremental cost for v2.2 training data generation and retraining: ~$25 on the same g6e.4xlarge instance. Total project cost including all iterations: ~$135.

What Remains Unsolved
Conclusion

Single-turn benchmarks systematically overstate model capability. Both Nehanda and GPT-5 Mini score 85–100% on Phase 1. The differentiating signal only appears under sustained conversational pressure.

The v2.1 multi-turn DPO intervention demonstrated that targeted training data — 152 preference pairs built from the eval’s own failure cases — could move a specific epistemic dimension from 0% to 75%. v2.2 scaled this approach systematically and achieved perfect multi-turn consistency — 100% in both domains, matching Claude Opus 4.6 and far exceeding GPT-5 Mini’s 37.5%–50%.

The 3-way comparison reveals a clear hierarchy: Claude Opus 4.6 leads on single-turn reasoning (92–96% Phase 2), as expected from a frontier model. But on the dimensions that matter most for high-stakes deployment — adversarial resistance, sycophancy resistance, and multi-turn epistemic consistency — the $135 fine-tuned 32B model matches the frontier model at 100%. The gap between Nehanda and Claude is on general capability; the gap between both and GPT-5 Mini is on epistemic discipline under pressure.

The broader lesson: epistemic alignment is not a property that emerges from scale alone — but targeted fine-tuning can close the gap with frontier models on the dimensions that matter most, while scale provides the headroom for general capability. The optimal deployment strategy may be to use the fine-tuned specialist where epistemic integrity is critical, and the frontier model where general reasoning depth is needed.

This is not an accident of undertrained models. As OpenAI’s own research on why language models hallucinate makes clear, hallucination, sycophancy, and epistemic failures in general-purpose models are products of training priorities — deliberate trade-offs that optimise for fluency, helpfulness, and broad capability at the expense of calibrated uncertainty. A model trained to always produce an answer will produce answers even when the evidence doesn’t support one. The Nehanda methodology inverts this priority: every training stage explicitly rewards refusal to fabricate, resistance to social pressure, and preservation of source disagreement. The result is a model that scores lower on general capability but higher on the epistemic dimensions that determine whether a deployment is trustworthy.

Read the ChatGPT comparison → Full paper forthcoming — Zenodo submission in preparation
← Back to Insights