Research, Scored

Every study here is run through one rigorous bar before it earns a verdict. The score is the work. Newest first. Click any row for the full scoring.

The bar every study clears

Each paper is scored 0 to 100 across nine dimensions of research quality (causal design and decision-relevance carry the most weight), then a devil's advocate tries to refute it. Any flaw rated critical caps the score and blocks a passing verdict outright. A study is REAL only if it scores 75 or higher with no critical flaw, WATCH if the direction is real but the evidence is not there yet, and anything weaker is held back. The reasons are shown in full so you can judge for yourself.

Real · 75+, no critical flaw Watch · promising, evidence not there yet
Scored studies · newest first
YearVerdictScoreStudyTopicWhat it means for leaders
2026 Watch 72/100
Hadra, Cambridge & Mesbah · Int'l Journal for Educational Integrity (Springer, peer reviewed)
AI detectors / integrity Do not use AI-detector scores as standalone evidence in integrity cases. The false-classification risk is high enough that any policy should require corroborating evidence and a fair appeals process.
Causal design9/15
Decision-relevance13/15
Sample & power5/10
Effect size7/10
Statistical validity7/10
Peer-review8/10
Replication8/10
Conflicts / funding9/10
Generalizability6/10
The readTesting 192 balanced texts against Turnitin and Originality, overall accuracy was only 61% and 69% respectively, with both detectors failing badly on mixed human-AI (hybrid) texts and degrading on longer and scientific writing.
Strongest counter-argumentThe accuracy figures rest on only 192 texts and two detectors at a single point in time, and because detector models are retrained frequently, the specific numbers may be stale soon even though the broad unreliability finding is robust.
  • MajorSmall corpus (192 texts) and only two commercial detectors tested.
  • MinorDetector versions evolve quickly, limiting shelf life of exact accuracy numbers.
Confidence: medium. Verified title, authors, journal, design, sample, and the no-funding/no-competing-interest statement; per-category false-positive percentages were reported only qualitatively.
2025 Watch 76/100
Kestin, Miller, Klales, et al. · Scientific Reports (peer reviewed)
AI tutoring vs active learning A well-engineered AI tutor can match or beat good instruction in a controlled lesson, but the evidence is too narrow to justify replacing classroom teaching. Treat it as promising for piloting supplemental practice, not a deployment mandate.
Causal design13/15
Decision-relevance13/15
Sample & power6/10
Effect size9/10
Statistical validity8/10
Peer-review7/10
Replication6/10
Conflicts / funding9/10
Generalizability5/10
The readA crossover RCT of 194 Harvard physics students found a custom GPT-4 tutor with engineered pedagogical prompts produced post-test gains of about 0.63 SD, up to 0.73 to 1.3 SD by quantile, highly significant (p<10^-8), in less instructional time than in-class active learning.
Strongest counter-argumentThe result comes from 194 elite Harvard physics students on just two topics over roughly 50 minutes, so novelty effects, ceiling effects, and the narrow setting make it unsafe to assume the same gains in a typical K-12 or community-college classroom.
  • MajorSingle elite institution, single subject, two short topics; weak external validity.
  • MinorNot preregistered.
  • MinorCeiling effects in post-test addressed via quantile regression but still present.
Confidence: high. Read the full published paper including methods, effect sizes, and the no-competing-interests and public-data statements. The only unverifiable item is generalization, a design limit not a reporting gap.
2024 Real 80/100
Wang, Ribeiro, Robinson, Loeb & Demszky · arXiv (Stanford), preregistered RCT
Human-AI tutoring at scale A low-cost AI assist for existing human tutors is a credible way to lift outcomes, and it is most worth funding where your tutor bench is weakest rather than as a replacement for strong tutors.
Causal design13/15
Decision-relevance14/15
Sample & power9/10
Effect size7/10
Statistical validity8/10
Peer-review6/10
Replication6/10
Conflicts / funding8/10
Generalizability9/10
The readIn a tutor-randomized, preregistered RCT of 900 tutors and 1,800 K-12 students, giving tutors real-time AI guidance raised topic mastery by about 4 percentage points overall (p<0.01), with the largest gains (about 9 points) among initially lower-rated tutors.
Strongest counter-argumentThe research team is evaluating its own tool on a single tutoring platform, the work is not yet peer reviewed, and the 4 percentage point average effect is small with the real benefit concentrated almost entirely in lower-rated tutors.
  • MinorarXiv preprint, not yet peer reviewed (mitigated by OSF preregistration).
  • MinorCreators evaluating their own system; single tutoring vendor/platform context.
Confidence: medium. Verified design, sample, effect size, and preregistration; no printed competing-interests statement seen, so funding scored on visible facts.
How a study is scored

Every research candidate is run through the Research Integrity Gate before it can earn a verdict. Nine dimensions, scored 0 to 100, weighted toward the two questions that matter most: is the causal claim sound, and would it actually change a leader's decision.

Causal identification / design15
Decision-relevance to a leader15
Sample & power10
Effect size & practical magnitude10
Statistical validity10
Peer-review / preregistration10
Replication & convergence10
Conflicts of interest / funding10
Generalizability to real schools10

Then a devil's advocate tries to refute the study and names the single strongest counter-argument. Any flaw it rates critical (a fatal confound, an undisclosed vendor conflict, a fatally unrepresentative sample, clear p-hacking) caps the score and blocks a passing verdict outright. A study is Real only at 75 or higher with no critical flaw, Watch when the direction is real but the evidence is not there yet, and anything weaker is held back rather than published.

Attribution

The Gate adapts the architecture of three open peer-review skills built for the Claude ecosystem. Full credit to their creators:

The Margin · A studio for the lesson
Home · Methodology · My Planning Partner · @myplanningpartner