Research, Scored — The Margin

The bar every study clears

Each paper is scored 0 to 100 across nine dimensions of research quality (causal design and decision-relevance carry the most weight), then a devil's advocate tries to refute it. Any flaw rated critical caps the score and blocks a passing verdict outright. A study is REAL only if it scores 75 or higher with no critical flaw, WATCH if the direction is real but the evidence is not there yet, and anything weaker is held back. The reasons are shown in full so you can judge for yourself.

Real · 75+, no critical flaw Watch · promising, evidence not there yet

Read the full method and credits ↓

Scored studies · newest first

Year	Verdict	Score	Study	Topic	What it means for leaders
2026	Watch	72/100	Evaluating the accuracy and reliability of AI content detectors in academic contexts Hadra, Cambridge & Mesbah · Int'l Journal for Educational Integrity (Springer, peer reviewed)	AI detectors / integrity	Do not use AI-detector scores as standalone evidence in integrity cases. The false-classification risk is high enough that any policy should require corroborating evidence and a fair appeals process.	›
Causal design9/15 Decision-relevance13/15 Sample & power5/10 Effect size7/10 Statistical validity7/10 Peer-review8/10 Replication8/10 Conflicts / funding9/10 Generalizability6/10 The readTesting 192 balanced texts against Turnitin and Originality, overall accuracy was only 61% and 69% respectively, with both detectors failing badly on mixed human-AI (hybrid) texts and degrading on longer and scientific writing. Strongest counter-argumentThe accuracy figures rest on only 192 texts and two detectors at a single point in time, and because detector models are retrained frequently, the specific numbers may be stale soon even though the broad unreliability finding is robust. MajorSmall corpus (192 texts) and only two commercial detectors tested. MinorDetector versions evolve quickly, limiting shelf life of exact accuracy numbers. Confidence: medium. Verified title, authors, journal, design, sample, and the no-funding/no-competing-interest statement; per-category false-positive percentages were reported only qualitatively.
2025	Watch	76/100	AI tutoring outperforms in-class active learning: an RCT in an authentic educational setting Kestin, Miller, Klales, et al. · Scientific Reports (peer reviewed)	AI tutoring vs active learning	A well-engineered AI tutor can match or beat good instruction in a controlled lesson, but the evidence is too narrow to justify replacing classroom teaching. Treat it as promising for piloting supplemental practice, not a deployment mandate.	›
Causal design13/15 Decision-relevance13/15 Sample & power6/10 Effect size9/10 Statistical validity8/10 Peer-review7/10 Replication6/10 Conflicts / funding9/10 Generalizability5/10 The readA crossover RCT of 194 Harvard physics students found a custom GPT-4 tutor with engineered pedagogical prompts produced post-test gains of about 0.63 SD, up to 0.73 to 1.3 SD by quantile, highly significant (p<10^-8), in less instructional time than in-class active learning. Strongest counter-argumentThe result comes from 194 elite Harvard physics students on just two topics over roughly 50 minutes, so novelty effects, ceiling effects, and the narrow setting make it unsafe to assume the same gains in a typical K-12 or community-college classroom. MajorSingle elite institution, single subject, two short topics; weak external validity. MinorNot preregistered. MinorCeiling effects in post-test addressed via quantile regression but still present. Confidence: high. Read the full published paper including methods, effect sizes, and the no-competing-interests and public-data statements. The only unverifiable item is generalization, a design limit not a reporting gap.
2024	Real	80/100	Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise Wang, Ribeiro, Robinson, Loeb & Demszky · arXiv (Stanford), preregistered RCT	Human-AI tutoring at scale	A low-cost AI assist for existing human tutors is a credible way to lift outcomes, and it is most worth funding where your tutor bench is weakest rather than as a replacement for strong tutors.	›
Causal design13/15 Decision-relevance14/15 Sample & power9/10 Effect size7/10 Statistical validity8/10 Peer-review6/10 Replication6/10 Conflicts / funding8/10 Generalizability9/10 The readIn a tutor-randomized, preregistered RCT of 900 tutors and 1,800 K-12 students, giving tutors real-time AI guidance raised topic mastery by about 4 percentage points overall (p<0.01), with the largest gains (about 9 points) among initially lower-rated tutors. Strongest counter-argumentThe research team is evaluating its own tool on a single tutoring platform, the work is not yet peer reviewed, and the 4 percentage point average effect is small with the real benefit concentrated almost entirely in lower-rated tutors. MinorarXiv preprint, not yet peer reviewed (mitigated by OSF preregistration). MinorCreators evaluating their own system; single tutoring vendor/platform context. Confidence: medium. Verified design, sample, effect size, and preregistration; no printed competing-interests statement seen, so funding scored on visible facts.

How a study is scored

Every research candidate is run through the Research Integrity Gate before it can earn a verdict. Nine dimensions, scored 0 to 100, weighted toward the two questions that matter most: is the causal claim sound, and would it actually change a leader's decision.

Causal identification / design	15
Decision-relevance to a leader	15
Sample & power	10
Effect size & practical magnitude	10
Statistical validity	10
Peer-review / preregistration	10
Replication & convergence	10
Conflicts of interest / funding	10
Generalizability to real schools	10

Then a devil's advocate tries to refute the study and names the single strongest counter-argument. Any flaw it rates critical (a fatal confound, an undisclosed vendor conflict, a fatally unrepresentative sample, clear p-hacking) caps the score and blocks a passing verdict outright. A study is Real only at 75 or higher with no critical flaw, Watch when the direction is real but the evidence is not there yet, and anything weaker is held back rather than published.

Attribution

The Gate adapts the architecture of three open peer-review skills built for the Claude ecosystem. Full credit to their creators:

Academic Paper Reviewer, by imbad0202. The multi-reviewer panel, the devil's-advocate critical-issue gate, the 0 to 100 rubric, and the calibration model. academic-research-skills
Scientific Reviewer (Grade A), by John Kitchin. The evaluation-dimension structure for claims, methodology, and statistical rigor.
peer-review, from claude-scientific-skills. The checklist approach to methodology, statistical validity, and reporting standards.

The Margin · A studio for the lesson
Home · Methodology · My Planning Partner · @myplanningpartner