ICLR 2026 Benchmark

On the peer-reviewed MolecularIQ benchmark, four frontier LLMs score 14 to 41 percent on chemical structure analysis. With CovaSyn MCP attached, the same models reach 76 to 92 percent. Three of the four land at 85 to 92 %; the cheapest model (Gemini 3.5 Flash) at 76 %. Here are the numbers, and what they don't show.

Baseline versus CovaSyn MCP across chemistry sub-tasks for Haiku 4.5, Opus 4.7, GPT-5.5 and Gemini 3.5 Flash — Fig. 1. Tool-augmented LLMs versus LLM-only baselines, broken down by chemistry sub-tasks. Haiku 4.5 on the full test split (n = 3,540), Opus 4.7, GPT-5.5 and Gemini 3.5 Flash on a proportionally stratified subset (n = 910 per model).

Top-line numbers

Model	Baseline	+ CovaSyn MCP	Δ	Lift
Claude Haiku 4.5	21.18 %	85.38 %	+64.20 pp	4.03×
Claude Opus 4.7	40.75 %	91.51 %	+50.76 pp	2.25×
OpenAI GPT-5.5	22.29 %	89.92 %	+67.63 pp	4.03×
Gemini 3.5 Flash	13.68 %	75.66 %	+61.98 pp	5.53×

What this means in cost terms

Frontier models are expensive. With CovaSyn, you can often run the cheaper model without giving up accuracy.

Configuration	Accuracy	$/question	Latency
Haiku 4.5 baseline	21.18 %	$0.00069	2.1 s
Haiku 4.5 + CovaSyn MCP	85.38 %	$0.00781	5.8 s
Opus 4.7 baseline	40.75 %	$0.02529	5.1 s
Opus 4.7 + CovaSyn MCP	91.51 %	$0.12536	7.4 s
GPT-5.5 baseline	22.29 %	$0.02750	7.9 s
GPT-5.5 + CovaSyn MCP	89.92 %	$0.03005	9.4 s
Gemini 3.5 Flash baseline	13.68 %	$0.00940	5.5 s
Gemini 3.5 Flash + CovaSyn MCP	75.66 %	$0.02170	10.8 s

The sharp claim:

Haiku 4.5 + CovaSyn is the cost-efficiency sweet spot: 2.1× the accuracy of Opus 4.7 baseline at 32 % of the cost, and 16× cheaper than Opus 4.7 + CovaSyn while giving up only 6 pp accuracy. Gemini 3.5 Flash + CovaSyn delivers the largest relative lift (5.53× from 13.7 % to 75.7 %) at roughly 2.3× baseline cost and 2× baseline latency, the right option for teams already running Gemini in their stack.

Pareto frontier: accuracy on the y-axis versus cost per question on the x-axis for four models, each with and without CovaSyn MCP. — Fig. 2. Cost-accuracy Pareto. Haiku with CovaSyn sits top-left: high accuracy at low cost per question. Gemini 3.5 Flash with CovaSyn lands as an even cheaper option below.

David versus Goliath. Small model with MCP beats large model on its own. — Fig. 3. David versus Goliath. The smaller model with MCP attached beats the larger model on its own. The architecture question becomes a layer question.

Where CovaSyn lifts hardest

Mean accuracy lift across 8 question categories (averaged across Haiku 4.5, Opus 4.7 and GPT-5.5; Gemini 3.5 Flash data with the next snapshot):

Category	Baseline	+ CovaSyn MCP	Δ
Scaffold & fragments	18.0 %	86.5 %	+68.4 pp
Rings & topology	29.4 %	93.2 %	+63.8 pp
Bonds & chains	17.6 %	80.9 %	+63.3 pp
Multi-feature questions	27.3 %	88.4 %	+61.1 pp
Atom & formula counts	38.7 %	98.3 %	+59.7 pp
Stereochemistry	28.7 %	86.0 %	+57.4 pp
Electronics & H-bonds	31.2 %	81.5 %	+50.3 pp

Per-category lift across all models, grouped by question type. — Fig. 4. Lift per category. Strongest leverage: scaffold & fragments (+68.4 pp).

Tool efficiency: accuracy gain per tool call across all models. — Fig. 5. Tool efficiency. Accuracy gain per tool call. Demonstrates that tools genuinely contribute information rather than just noise.

Summary: accuracy of all eight configurations across complexity bins. — Fig. 6. Overall summary. Four models, two configurations, three complexity bins.

Methodology

Benchmark

MolecularIQ by Bartmann et al., ICLR 2026 (arXiv:2601.15279). 3,540 tasks, 65 features, three complexity bins. Dataset public on HuggingFace.

Models

Claude Haiku 4.5, Claude Opus 4.7, OpenAI GPT-5.5 and Gemini 3.5 Flash. Each tested with and without CovaSyn MCP.

Verification

Symbolic, no LLM judges. Score only when the full answer matches ground truth.

Tools

Five chemistry primitives from the CovaBasicChem suite. Cheminformatics operations, deterministic, validated.

Volume

12,540 model responses in total. Haiku ran the full test split, Opus, GPT-5.5 and Gemini on a stratified sample of 910 questions each.

Where we still improve

We do not hit 100 %, and we do not want to hide that. Here is how the remaining gap breaks down and where you would look closer for your own validation.

Category	Haiku + MCP	Opus + MCP	GPT-5.5 + MCP	Gemini + MCP
Correct	73.2 %	83.0 %	83.6 %	72.3 %
Tool result discarded	21.6 %	14.5 %	10.9 %	6.9 %
Tool value off	4.8 %	2.2 %	1.4 %	0.7 %
Format error	0.2 %	0.2 %	4.1 %	20.1 %

Most of the remaining gap sits between tool and model, not in the tool itself. We address that continuously.

Citation

Bartmann C., Schimunek J., Ielanskyi M., Seidl P., Klambauer G., Luukkonen S. (2026). MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs. ICLR 2026 (poster, Pavilion 4 · P4-#5202, 24 Apr 2026), arXiv:2601.15279. Code: github.com/ml-jku/moleculariq. Dataset: huggingface.co/datasets/ml-jku/moleculariq-v0.0. Data snapshot: 2026-05-17.

Go deeper

In-depth analysis with methodology, implications and FAQ

About 12 minutes of reading. Background on model choice, cost Pareto in detail, GxP implications, FAQs. →

Test it yourself

The tools that produced this lift are available in every CovaSyn account, including the free tier with 100 credits per week.

Create account →See tool catalogue →