The Margin — Methodology

Pipeline

Four stages

The pipeline detects, ranks, then sends. A human verdict sits between rank and send. Stages 1, 2, and 4 are automated. Stage 3 is not.

The architecture is adapted from automated AI-engineering digests, which rank by crowd velocity (what is shared and starred fastest). That works when the audience is expert. The education audience is not, and the loudest sources are vendors, so ranking by popularity would surface marketing. The crowd-velocity step is therefore replaced by a teacher-impact score plus a human verdict.

01 Detect

Scan

Reads the source list, collects candidate items from the window.

02 Rank

Score

Scores each item on four axes. Outputs a shortlist of 8.

03 Verdict

Judge

Human

Sets each verdict, kills noise, reads the primary source.

04 Send

Ship

Renders the verdicted issue to template, sends.

Stages 1, 2, 4: automated. Stage 3: human, and not automatable.

Stage 1 · Sources

19 sources, four tiers

The source list is fixed and curated, not a wide scrape. It is the first filter. Each source sits in one of four tiers by trust. Research is weighted highest. Lab and vendor announcements are treated as claims to be tested, never as signal on their own.

Tier 1 · ResearcharXiv, RAND, IES, AERA, EdArXiv

primary, peer-reviewed

Tier 2 · JournalismEdWeek, Hechinger, Chalkbeat, The 74

reaching classrooms

Tier 3 · Labs / vendorsOpenAI, Anthropic, Google, Khan

claims to test

Tier 4 · InstitutionsStanford SCALE, MIT, CDT, hand-picked

named, reviewed quarterly

Bar = starting weight before judgment. A vendor's claim about its own product starts near zero.

Stage 2 · Ranking

One question, four axes

Items are not ranked by popularity. Each is scored against one question:

Does this change what a teacher should do, believe, or stop believing?

That question is scored on four axes, each 0 to 10, then summed. Marketing and items whose only signal is that they are trending are demoted.

Decision-relevance. Would a teacher act differently?

Durability. Does it survive the next model release?

Source credibility. Research high, vendor claims low.

Hype-correction. Is a loud claim measurably wrong?

Output: a ranked shortlist of 8 with provisional verdicts attached as drafts.

Stage 3 · Verdict

The human step

The pipeline attaches a provisional verdict to each item. These are drafts. The human overrules them freely, kills noise, and reads the primary source before anything ships. Three verdicts are used:

Real

Works, or is true. Act on it.

Hype

The claim is bigger than the thing.

Watch

Not there yet. Direction is real.

Default when uncertain: Watch.

This step exists because the pipeline produces confident errors. In one issue, the draft stated that removing AI before a test was what protected student learning. The primary source (Bastani et al., PNAS 2025) showed the opposite: the protective factor was the tool's design, not its removal. The draft was rewritten. The human verdict is the control for this failure mode.

Evidence context

Why the filter is strict

Stanford's SCALE initiative reviewed the K-12 AI research base in 2026. Of 800-plus studies, roughly 20 establish a causal effect with rigorous methods. None were conducted in a U.S. K-12 classroom.

800+

studies reviewed

~20

causal · rigorous

in U.S. K-12

Source: Stanford SCALE, 2026 review of the K-12 AI evidence base.

Cadence

Who runs what, when

Stage	Operator	Frequency
Detect	AI	Continuous
Rank	AI	Weekly
Verdict & fact-check	Human	Weekly
Assemble & send	AI	Weekly

One human touch per issue. The rest is automated.

Disclosure rule

Conflict of interest

The editor builds My Planning Partner, an AI lesson-planning tool. Any issue touching lesson-planning tools, that category, carries an explicit disclosure line. The disclosure is fixed policy and is never removed for length or tone.

← Back to The Margin

The Margin · Compiled by AI, edited by a human
My Planning Partner · @myplanningpartner