Every study here is run through one rigorous bar before it earns a verdict. The score is the work. Newest first. Click any row for the full scoring.
| Year | Verdict | Score | Study | Topic | What it means for leaders | |
|---|---|---|---|---|---|---|
| 2026 | Watch | 72/100 |
Hadra, Cambridge & Mesbah · Int'l Journal for Educational Integrity (Springer, peer reviewed)
|
AI detectors / integrity | Do not use AI-detector scores as standalone evidence in integrity cases. The false-classification risk is high enough that any policy should require corroborating evidence and a fair appeals process. | › |
The readTesting 192 balanced texts against Turnitin and Originality, overall accuracy was only 61% and 69% respectively, with both detectors failing badly on mixed human-AI (hybrid) texts and degrading on longer and scientific writing.
Strongest counter-argumentThe accuracy figures rest on only 192 texts and two detectors at a single point in time, and because detector models are retrained frequently, the specific numbers may be stale soon even though the broad unreliability finding is robust.
Confidence: medium. Verified title, authors, journal, design, sample, and the no-funding/no-competing-interest statement; per-category false-positive percentages were reported only qualitatively.
| ||||||
| 2025 | Watch | 76/100 |
Kestin, Miller, Klales, et al. · Scientific Reports (peer reviewed)
|
AI tutoring vs active learning | A well-engineered AI tutor can match or beat good instruction in a controlled lesson, but the evidence is too narrow to justify replacing classroom teaching. Treat it as promising for piloting supplemental practice, not a deployment mandate. | › |
The readA crossover RCT of 194 Harvard physics students found a custom GPT-4 tutor with engineered pedagogical prompts produced post-test gains of about 0.63 SD, up to 0.73 to 1.3 SD by quantile, highly significant (p<10^-8), in less instructional time than in-class active learning.
Strongest counter-argumentThe result comes from 194 elite Harvard physics students on just two topics over roughly 50 minutes, so novelty effects, ceiling effects, and the narrow setting make it unsafe to assume the same gains in a typical K-12 or community-college classroom.
Confidence: high. Read the full published paper including methods, effect sizes, and the no-competing-interests and public-data statements. The only unverifiable item is generalization, a design limit not a reporting gap.
| ||||||
| 2024 | Real | 80/100 |
Wang, Ribeiro, Robinson, Loeb & Demszky · arXiv (Stanford), preregistered RCT
|
Human-AI tutoring at scale | A low-cost AI assist for existing human tutors is a credible way to lift outcomes, and it is most worth funding where your tutor bench is weakest rather than as a replacement for strong tutors. | › |
The readIn a tutor-randomized, preregistered RCT of 900 tutors and 1,800 K-12 students, giving tutors real-time AI guidance raised topic mastery by about 4 percentage points overall (p<0.01), with the largest gains (about 9 points) among initially lower-rated tutors.
Strongest counter-argumentThe research team is evaluating its own tool on a single tutoring platform, the work is not yet peer reviewed, and the 4 percentage point average effect is small with the real benefit concentrated almost entirely in lower-rated tutors.
Confidence: medium. Verified design, sample, effect size, and preregistration; no printed competing-interests statement seen, so funding scored on visible facts.
| ||||||
Every research candidate is run through the Research Integrity Gate before it can earn a verdict. Nine dimensions, scored 0 to 100, weighted toward the two questions that matter most: is the causal claim sound, and would it actually change a leader's decision.
| Causal identification / design | 15 |
| Decision-relevance to a leader | 15 |
| Sample & power | 10 |
| Effect size & practical magnitude | 10 |
| Statistical validity | 10 |
| Peer-review / preregistration | 10 |
| Replication & convergence | 10 |
| Conflicts of interest / funding | 10 |
| Generalizability to real schools | 10 |
Then a devil's advocate tries to refute the study and names the single strongest counter-argument. Any flaw it rates critical (a fatal confound, an undisclosed vendor conflict, a fatally unrepresentative sample, clear p-hacking) caps the score and blocks a passing verdict outright. A study is Real only at 75 or higher with no critical flaw, Watch when the direction is real but the evidence is not there yet, and anything weaker is held back rather than published.
The Gate adapts the architecture of three open peer-review skills built for the Claude ecosystem. Full credit to their creators: