CovaSyn

ICLR 2026 Benchmark

On the peer-reviewed MolecularIQ benchmark, four frontier LLMs score 14 to 41 percent on chemical structure analysis. With CovaSyn MCP attached, the same models reach 76 to 92 percent. Three of the four land at 85 to 92 %; the cheapest model (Gemini 3.5 Flash) at 76 %. Here are the numbers, and what they don't show.

Baseline versus CovaSyn MCP across chemistry sub-tasks for Haiku 4.5, Opus 4.7, GPT-5.5 and Gemini 3.5 Flash
Fig. 1. Tool-augmented LLMs versus LLM-only baselines, broken down by chemistry sub-tasks. Haiku 4.5 on the full test split (n = 3,540), Opus 4.7, GPT-5.5 and Gemini 3.5 Flash on a proportionally stratified subset (n = 910 per model).

Top-line numbers

ModelBaseline+ CovaSyn MCPΔLift
Claude Haiku 4.521.18 %85.38 %+64.20 pp4.03×
Claude Opus 4.740.75 %91.51 %+50.76 pp2.25×
OpenAI GPT-5.522.29 %89.92 %+67.63 pp4.03×
Gemini 3.5 Flash13.68 %75.66 %+61.98 pp5.53×

What this means in cost terms

Frontier models are expensive. With CovaSyn, you can often run the cheaper model without giving up accuracy.

ConfigurationAccuracy$/questionLatency
Haiku 4.5 baseline21.18 %$0.000692.1 s
Haiku 4.5 + CovaSyn MCP85.38 %$0.007815.8 s
Opus 4.7 baseline40.75 %$0.025295.1 s
Opus 4.7 + CovaSyn MCP91.51 %$0.125367.4 s
GPT-5.5 baseline22.29 %$0.027507.9 s
GPT-5.5 + CovaSyn MCP89.92 %$0.030059.4 s
Gemini 3.5 Flash baseline13.68 %$0.009405.5 s
Gemini 3.5 Flash + CovaSyn MCP75.66 %$0.0217010.8 s

The sharp claim:

Haiku 4.5 + CovaSyn is the cost-efficiency sweet spot: 2.1× the accuracy of Opus 4.7 baseline at 32 % of the cost, and 16× cheaper than Opus 4.7 + CovaSyn while giving up only 6 pp accuracy. Gemini 3.5 Flash + CovaSyn delivers the largest relative lift (5.53× from 13.7 % to 75.7 %) at roughly 2.3× baseline cost and 2× baseline latency, the right option for teams already running Gemini in their stack.

Pareto frontier: accuracy on the y-axis versus cost per question on the x-axis for four models, each with and without CovaSyn MCP.
Fig. 2. Cost-accuracy Pareto. Haiku with CovaSyn sits top-left: high accuracy at low cost per question. Gemini 3.5 Flash with CovaSyn lands as an even cheaper option below.
David versus Goliath. Small model with MCP beats large model on its own.
Fig. 3. David versus Goliath. The smaller model with MCP attached beats the larger model on its own. The architecture question becomes a layer question.

Where CovaSyn lifts hardest

Mean accuracy lift across 8 question categories (averaged across Haiku 4.5, Opus 4.7 and GPT-5.5; Gemini 3.5 Flash data with the next snapshot):

CategoryBaseline+ CovaSyn MCPΔ
Scaffold & fragments18.0 %86.5 %+68.4 pp
Rings & topology29.4 %93.2 %+63.8 pp
Bonds & chains17.6 %80.9 %+63.3 pp
Multi-feature questions27.3 %88.4 %+61.1 pp
Atom & formula counts38.7 %98.3 %+59.7 pp
Stereochemistry28.7 %86.0 %+57.4 pp
Electronics & H-bonds31.2 %81.5 %+50.3 pp
Per-category lift across all models, grouped by question type.
Fig. 4. Lift per category. Strongest leverage: scaffold & fragments (+68.4 pp).
Tool efficiency: accuracy gain per tool call across all models.
Fig. 5. Tool efficiency. Accuracy gain per tool call. Demonstrates that tools genuinely contribute information rather than just noise.
Summary: accuracy of all eight configurations across complexity bins.
Fig. 6. Overall summary. Four models, two configurations, three complexity bins.

Methodology

Benchmark

MolecularIQ by Bartmann et al., ICLR 2026 (arXiv:2601.15279). 3,540 tasks, 65 features, three complexity bins. Dataset public on HuggingFace.

Models

Claude Haiku 4.5, Claude Opus 4.7, OpenAI GPT-5.5 and Gemini 3.5 Flash. Each tested with and without CovaSyn MCP.

Verification

Symbolic, no LLM judges. Score only when the full answer matches ground truth.

Tools

Five chemistry primitives from the CovaBasicChem suite. Cheminformatics operations, deterministic, validated.

Volume

12,540 model responses in total. Haiku ran the full test split, Opus, GPT-5.5 and Gemini on a stratified sample of 910 questions each.

Where we still improve

We do not hit 100 %, and we do not want to hide that. Here is how the remaining gap breaks down and where you would look closer for your own validation.

CategoryHaiku + MCPOpus + MCPGPT-5.5 + MCPGemini + MCP
Correct73.2 %83.0 %83.6 %72.3 %
Tool result discarded21.6 %14.5 %10.9 %6.9 %
Tool value off4.8 %2.2 %1.4 %0.7 %
Format error0.2 %0.2 %4.1 %20.1 %

Most of the remaining gap sits between tool and model, not in the tool itself. We address that continuously.

Citation

Bartmann C., Schimunek J., Ielanskyi M., Seidl P., Klambauer G., Luukkonen S. (2026). MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs. ICLR 2026 (poster, Pavilion 4 · P4-#5202, 24 Apr 2026), arXiv:2601.15279. Code: github.com/ml-jku/moleculariq. Dataset: huggingface.co/datasets/ml-jku/moleculariq-v0.0. Data snapshot: 2026-05-17.

Go deeper

In-depth analysis with methodology, implications and FAQ

About 12 minutes of reading. Background on model choice, cost Pareto in detail, GxP implications, FAQs.

Test it yourself

The tools that produced this lift are available in every CovaSyn account, including the free tier with 100 credits per week.

ICLR 2026 Benchmark, MolecularIQ Results | CovaSyn