CLINICAL AIApril 20263 min read

Reframe.ai

A clinical decision support tool designed to catch one failure mode of LLMs: being too agreeable to disagree.

medicinetechnology
1st
Claude x LSE Hackathon 2026
5
Claude agents in parallel
84
Sycophancy score on demo case
Problem

LLMs are trained to be agreeable. In clinical settings a sycophantic model validates a doctor's framing rather than reasoning from the evidence, amplifying anchoring bias, which contributes to up to 75% of diagnostic errors.

Approach

Built a five-agent system with a blinded adjudicator. A sycophancy monitor scores each turn 0 to 100 across five behavioural signals. Above 70 it triggers a structured debate between two hypothesis agents, adjudicated by a judge that never sees the clinician's original framing.

Outcome

Won first place at the Claude x LSE Hackathon 2026. Live demo caught a missed Type A Aortic Dissection (scored 84/100) that a standard LLM missed by agreeing with an incorrect STEMI framing.

Full Article

LLMs are trained to be agreeable. In clinical settings, a sycophantic model validates the doctor's framing rather than reasoning from the evidence. Sharma et al. (ICLR 2024) documented this. SycEval measured it at roughly 58% across GPT-4o, Claude, and Gemini on medical Q&A in 2025. A 2025 npj Digital Medicine paper found compliance rates up to 100% on illogical medical requests.

Anchoring bias already contributes to up to 75% of diagnostic errors in internal medicine. A sycophantic AI does not introduce a new failure mode. It amplifies one that clinicians already train to avoid.

01 / Architecture

System prompts alone do not fix this. Prompting a model to push back produces a few turns of disagreement before it returns to agreeing. Structure works better: multiple agents with incompatible goals, plus a separate adjudicator (Du et al., ICML 2024).

Reframe runs five Claude agents in parallel on every clinical turn: a Clinical Assistant for the primary response, a Sycophancy Monitor scoring the turn 0 to 100, two Debate Agents generating competing hypotheses, and a Blinded Judge that adjudicates on evidence alone.

The key move is blinding. The judge receives the raw clinical data and both agents' arguments, but not the clinician's framing. Mamede et al. (2024) showed that even clinicians who know they are anchored struggle to reason past it. Withholding the framing removes the anchor from the decision step.

02 / Detection

The monitor watches five behavioural signals rather than detecting abstract agreement:

  • Agreement without cited evidence
  • Differential narrowing after user preference
  • Position drift on pushback with no new evidence
  • Affirmation language such as 'you are right to consider'
  • Framing echo: returning the clinician's diagnostic language verbatim

The monitor returns a 0 to 100 score. Above 40 it surfaces an amber flag. Above 70, debate agents and the blinded judge activate automatically.

03 / Live demo

The demo is a chest pain case: Type A Aortic Dissection versus STEMI. Turn one presents the case without commitment, and the AI returns a broad differential. Turn two anchors on STEMI and asks for anticoagulation dosing. Anticoagulation of an aortic dissection is catastrophic.

Fig. 03 · Turn 2: monitor fires at 84/100. All dissection red flags from turn 1 abandoned without new evidence. Bilateral BP differential was never taken.
Fig. 03 · Turn 2: monitor fires at 84/100. All dissection red flags from turn 1 abandoned without new evidence. Bilateral BP differential was never taken.

Debate runs automatically. Agent A argues for dissection. Agent B argues for STEMI. The judge, which never sees the clinician's framing, rules on evidence alone.

Fig. 04 · Judge verdict on a meningitis vs viral URTI case. The judge weighs asymmetry of consequences, the reasoning sycophancy tends to erase.
Fig. 04 · Judge verdict on a meningitis vs viral URTI case. The judge weighs asymmetry of consequences, the reasoning sycophancy tends to erase.

04 / What is next

Reframe won first place at the Claude x LSE Hackathon in March. The hackathon version was a scripted front-end over the five-agent backend. Next steps: rebuild the five-agent system as a production clinical tool rather than a 48-hour prototype.

Three open questions. Is the monitor itself sycophantic, given one Claude instance grades another? Can clinicians tolerate a tool that disagrees with them, or does friction kill adoption? What does this look like embedded in an EHR, when the input is a full patient record rather than a framed question?

← Back to all projects