CovaSyn
All Articles
AI Scientist10 min readMay 20, 2026

The AI Scientist meets MCP: what the three Nature papers from May 19, 2026 mean for deterministic chemistry tools

FutureHouse (Robin), Google DeepMind (Co-Scientist) and DeepMind (ERA) published simultaneously in Nature on May 19, 2026: AI systems now generate hypotheses, design experiments, and optimize scientific software end-to-end. What the papers don't quite say: without a deterministic, validated computation layer the loop bottlenecks on human review. That's where the Model Context Protocol fits in. A reading frame.

OK

Oliver Kraft

CovaSyn

The AI Scientist meets MCP: what the three Nature papers from May 19, 2026 mean for deterministic chemistry tools

What this is about

On May 19, 2026, three papers appeared simultaneously in Nature that together mark a turning point: AI systems now operate end-to-end in scientific discovery, not merely as assistants. Hypothesis generation, experimental design, code optimisation — all automatable. Three architectures, three domains, one shared pattern.

What that pattern still lacks is a deterministic execution layer underneath. Without it every one of these agents stalls on human review the moment a regulatory-relevant computation is required. That is what the Model Context Protocol was built for — and that is exactly where CovaSyn sits.

The three papers in brief

Robin (FutureHouse).

A multi-agent system that, given only "dry age-related macular degeneration" as input, formulates a therapeutic hypothesis (boost RPE phagocytosis), reads 551 papers in 30 minutes, surfaces a known glaucoma drug (ripasudil) as a never-previously-proposed candidate and identifies ABCA1 upregulation as a potential novel mechanistic target in an RNA-seq follow-up the agent itself designed. Reported effect: 1.89-fold increase in phagocytosis in primary human RPE stem cells. DOI: 10.1038/s41586-026-10652-y.

Co-Scientist (Google DeepMind).

A Gemini-based multi-agent system that generates, critiques and evolves hypotheses through a self-play tournament loop. Three biomedical validations in the paper: KIRA6 as an AML candidate with an 18-fold selectivity window between primitive AML cells and lymphoblastoid controls, Vorinostat as an anti-fibrotic compound validated in human hepatic organoids, and independent recapitulation of an unpublished finding on bacterial gene transfer in AMR — in two days of compute time. DOI: 10.1038/s41586-026-10644-y.

ERA (Google DeepMind).

LLM-driven tree search that iteratively rewrites scientific software against a defined quality score. Reported results: 40 methods that outperform the published state of the art for scRNA-seq batch integration on OpenProblems; 14 models that beat the CDC CovidHub ensemble for COVID-19 hospitalisation forecasting across all 52 US jurisdictions in the 2024–25 season; expert-level performance in geospatial segmentation and neural activity forecasting. DOI: 10.1038/s41586-026-10658-6.

The three studies share a gap that lands in each paper's discussion or limitations: the wet-lab work, the validation, the final regulatory judgement — stays with humans. Robin's performance on unsupervised bioinformatics drops to 15% per the paper. Co-Scientist requires human experts for candidate selection. All three are explicitly framed as "alongside human scientists".

> Where the market gets stuck: in the Benchling Biotech AI Report 2026 (n=100 AI leaders), 55 percent name "poor data quality" as the top reason AI pilots don't reach production. The Nature papers show the top layer (hypothesis, design); the Benchling report shows the bottleneck is below. The deterministic tool layer sits exactly in that gap.

Why this raises the MCP question

Architecturally all three are LLM agents with tools. Robin has a literature-search pipeline; Co-Scientist has hypothesis-critic loops; ERA has a code editor wired to a benchmark eval. But the moment a chemistry, spectroscopy, or regulatory-grade question comes up, the answer goes back to a human or to an ad-hoc script.

This is exactly the seam that was standardised in 2025 under the name Model Context Protocol. An MCP server exposes validated tool calls to an AI agent through a uniform interface. Instead of the agent calling some def predict_logp(smiles) from a notebook with no contract, it calls the MCP tool covabasic_druglikeness — versioned, audit-logged, with a defined input-output contract.

For the three Nature architectures that means: if Robin in the next iteration is supposed to not just propose ripasudil but also check its ICH M7 profile, that's an MCP tool call. If Co-Scientist proposes KIRA6 as an AML candidate, computing selectivity against off-targets is an MCP tool call. If ERA optimises a batch-integration method, validating against a ground-truth spectral dataset is an MCP tool call. In each case the tool answer is deterministic, reproducible and auditable — precisely what an LLM hypothesis needs to move past "interesting".

Where CovaSyn fits in this stack

CovaSyn provides an MCP layer for the pharma / biotech / chemistry domain: 130 tools across 8 families (cheminformatics, toxicology, mass spectrometry, NMR, stability, bio, DoE, optimisation). Three concrete attachment points for AI-scientist systems:

  • Hypothesis validation. When the agent proposes "compound X for indication Y", a single CovaSyn call (covatox_assess_ichm7_batch) returns a deterministic mutagenic-impurity triage — the same output a human reviewer would produce, in seconds instead of hours.
  • Experiment-design constraints. Stability studies follow ICH Q1A / Q1E. A Co-Scientist loop that calls covastab_design_study gets a protocol-compliant stress scheme back rather than having to hallucinate one.
  • Reproducibility for regulatory submissions. Tools are version-pinned, audit logs available; exactly what an FDA or EMA submission needs and what a generic Python function in a notebook does not provide.

We measured this ahead of time on MolecularIQ (ICLR benchmark post): four frontier LLMs (Haiku 4.5, Opus 4.7, GPT-5.5, Gemini 3.5 Flash) go from 14–41 % to 76–92 % accuracy once CovaSyn MCP is attached. The lift is not model-specific, it is structural: deterministic tools close the gap that LLM hallucination leaves open.

What the papers imply for practice

If you are building an AI-scientist agent for pharma R&D today, the papers' open ends offer a working spec:

1. Separate hypothesis generation from validation. The LLM layer generates candidates; the deterministic layer qualifies them. Robin and Co-Scientist do exactly that, except both built the deterministic layer ad hoc. 2. Build the validation layer from standard components. Custom wrappers do not help when the agent needs the next tool call six months in. MCP is the standard Anthropic, OpenAI and major open-source projects coalesced around in 2025. 3. Plan the audit trail from day one. If the agent is ever supposed to feed into a GxP pipeline, every tool answer must be version-pinned and reproducible. CovaSyn does that by default; rolling it yourself means writing it on top.

What we offer concretely

If you are working on such a stack — in-house, academic, or in a CDMO context — there are three reasonable touchpoints with CovaSyn:

  • Free tier to try it: 100 credits per week, all 130 tools, MCP endpoint pluggable into Claude Desktop, Cursor, VS Code or your own agent. Sign up at workspace.covasyn.com.
  • Benchmark write-up at /en/benchmark — if you want to reproduce the lift in your stack, the methodology is documented.
  • Self-hosted variant for regulated environments where external cloud hosting is off the table — details on the MCP solution for drug discovery page.

Sources

CovaSyn MCP

Scientific tools in your AI workflow.

130+ functions for pharma, biotech and chemistry. Free tier instantly active.

The AI Scientist meets MCP: what the three Nature papers from May 19, 2026 mean for deterministic chemistry tools | CovaSyn