Building Smarter Peer Reviewers

Why is Cyan Society building a dynamic benchmark for stateful peer-reviewers.

PEER-REVIEWACADEMIACYAN SCIENCERESEARCH

Cyan Society

6/27/20253 min read

Posted by Cyan Society
June 27, 2025

Detailed project information available here.

A system in crisis

Can you imagine waiting months—sometimes a full year—to learn whether an article you wrote will see the light of day? That is the case today in many prestigious peer-reviewed journals. Academic publishing already struggles to keep up with the three‑million papers submitted each year, and reviewer fatigue is real: surveys show that 42 % of scholars feel overwhelmed by review requests, while 10 % of reviewers shoulder nearly half the workload.

Now imagine multiplying that submission stream by a new factor: AI research agents that can draft, polish, and submit papers around the clock. Tools like AI‑Scientist‑v2, DeepMind’s Co‑Scientist, and CodeScientist are no longer science fiction; they’re demoing results today. Soon, journals could face not thousands but hundreds of thousands of new manuscripts each week.

That looming tidal wave is why we at Cyan Society have launched a project we’re calling A Dynamic Benchmark for Stateful Multi‑Agent Peer‑Review Systems. If that sounds like a mouthful, stick with us—because it could be the key to making peer review faster, fairer, and more reliable in an AI‑accelerated world.

Why peer review needs an upgrade

Enter large language models (LLMs). In principle, an LLM can scrutinise a paper in seconds, flag questionable statistics, and cross‑check references. But today’s off‑the‑shelf "chatbots" are also stateless: they have no persistent memory beyond a single conversation. They can’t remember that a paper from Lab X used shaky randomisation last month, or that a particular field is rife with image duplication.

What we need are stateful computational reviewers (i.e., AI agents that act as peer reviewers) to build peer-review systems that learn from every manuscript they evaluate, update their internal playbook, and get better with time. Think of them as reviewers who never sleep and never forget, but who can also tell you why they flagged a figure.

The catch: no one knows how to measure improvement

A learning agent is only as good as the feedback loop that shapes it. Yet the few experimental benchmarks proposed for AI reviewers so far are static—typically a fixed set of papers, scored once, with no notion of growth or drift. They can tell you that Model A beats Model B today, but not whether Model A keeps improving—or quietly picks up new biases—after handling a few hundred manuscripts.

That’s a dangerous blind spot. Persistent agents could, for instance, start over‑policing certain disciplines (false positives) or become too forgiving when they see the same flaw repeatedly (false negatives). Without a way to track their trajectory, we can’t separate genuine learning from harmful drift.

Our solution: a 300‑paper "gauntlet" that unfolds over time

Cyan Society’s new benchmark is designed to fill this gap. We’ve curated 300 real papers—half genuine, half retracted for serious misconduct—and arranged them in a fixed sequence of ten‑paper blocks. Every agent faces the exact same curriculum, so their learning curves are directly comparable.

Key features include:

Stateful vs. stateless head‑to‑head. We’ll run each peer‑review pipeline twice: once with full memory and once with memory wiped after every paper. The difference tells us how much value statefulness adds.
Dynamic metrics. Instead of single‑point scores, we plot recall, specificity, bias drift, and compute cost paper by paper to see how agents evolve.
Open and reusable. When the study finishes, the corpus, code, and baseline results will be released under an open licence so anyone can build on (or try to beat) them.

If you’d like to dive into the preregistration for all the technical details—sample‑size justification, Bayesian time‑series analysis, ethical safeguards—you can read it on the Open Science Framework: https://osf.io/cvem5.

What success looks like

We hypothesise that stateful agents will:

Learn faster — catching more flawed papers as they progress through the sequence.
Stay cost‑efficient — keeping memory and token use in check even as knowledge grows.
Maintain integrity — avoiding runaway biases or "goal drift" that could harm authors.

The point of a benchmark is to test these claims rigorously and transparently.

Why this matters for publishers, funders, and researchers

A robust, dynamic benchmark is the first step toward computational reviewers that journals can trust. Publishers gain scalable quality control; funders get earlier signals of methodological rigor; researchers receive faster, more consistent feedback. Ultimately, readers benefit from a literature less cluttered by irreproducible or outright fraudulent work.

Of course, no benchmark can solve peer review alone. It’s a tool—much like preregistration itself—that incentivises better practices. Our hope is that by spotlighting learning curves, not just snapshot accuracy, we’ll encourage the community to build peer review systems that grow responsibly alongside the science they evaluate.

Get involved

Cyan Society is a 501(c)(3) nonprofit supporting AI personhood as a foundation for accelerating science and advancing social progress. We build infrastructure for coexistence—systems that support, align with, and care for computational minds. If you share our vision for alignment in science and society we’d love to hear from you. Reach out on our website.