ICLR 2026 Benchmark
On the peer-reviewed MolecularIQ benchmark, frontier LLMs score 21–41 % on chemical structure analysis. With CovaSyn MCP tools: 85–92 %. Here are the numbers, and what they don't show.
Top-line numbers
| Model | Baseline | + CovaSyn MCP | Δ | Lift |
|---|---|---|---|---|
| Claude Haiku 4.5 | 21.18 % | 85.38 % | +64.20 pp | 4.03× |
| Claude Opus 4.7 | 40.75 % | 91.51 % | +50.76 pp | 2.25× |
| OpenAI GPT-5.5 | 22.29 % | 89.92 % | +67.63 pp | 4.03× |
What this means in cost terms
Frontier models are expensive. With CovaSyn, you can often run the cheaper model without giving up accuracy.
| Configuration | Accuracy | $/question | Latency |
|---|---|---|---|
| Opus 4.7 baseline | 40.75 % | $0.02529 | 5.1 s |
| Opus 4.7 + CovaSyn MCP | 91.51 % | $0.12536 | 7.4 s |
| Haiku 4.5 + CovaSyn MCP | 85.38 % | $0.00781 | 5.8 s |
| Haiku 4.5 baseline | 21.18 % | $0.00069 | 2.1 s |
| GPT-5.5 + CovaSyn MCP | 89.92 % | $0.03005 | 9.4 s |
The sharp claim:
Haiku 4.5 + CovaSyn delivers 2.1× the accuracy of Opus 4.7 baseline at 32 % of the cost, and stays 16× cheaper than Opus 4.7 + CovaSyn while giving up only 6 pp accuracy.
Where CovaSyn lifts hardest
Mean accuracy lift across 8 question categories (averaged across all three models):
| Category | Baseline | + CovaSyn MCP | Δ |
|---|---|---|---|
| Scaffold & fragments | 18.0 % | 86.5 % | +68.4 pp |
| Rings & topology | 29.4 % | 93.2 % | +63.8 pp |
| Bonds & chains | 17.6 % | 80.9 % | +63.3 pp |
| Multi-feature questions | 27.3 % | 88.4 % | +61.1 pp |
| Atom & formula counts | 38.7 % | 98.3 % | +59.7 pp |
| Stereochemistry | 28.7 % | 86.0 % | +57.4 pp |
| Electronics & H-bonds | 31.2 % | 81.5 % | +50.3 pp |
Methodology
Benchmark
MolecularIQ by Bartmann et al., ICLR 2026 (arXiv:2601.15279). 3,540 tasks, 65 features, three complexity bins. Dataset public on HuggingFace.
Models
Claude Haiku 4.5, Claude Opus 4.7 and GPT-5.5. Each tested with and without CovaSyn MCP.
Verification
Symbolic, no LLM judges. Score only when the full answer matches ground truth.
Tools
Five chemistry primitives from the CovaBasicChem suite. Cheminformatics operations, deterministic, validated.
Volume
10,720 model responses in total. Haiku ran the full set, Opus and GPT-5.5 a stratified sample.
Where we still improve
We do not hit 100 %, and we do not want to hide that. Here is how the remaining gap breaks down and where you would look closer for your own validation.
| Category | Haiku + MCP | Opus + MCP | GPT-5.5 + MCP |
|---|---|---|---|
| Correct | 73.2 % | 83.0 % | 83.6 % |
| Tool result discarded | 21.6 % | 14.5 % | 10.9 % |
| Tool value off | 4.8 % | 2.2 % | 1.4 % |
| Format error | 0.2 % | 0.2 % | 4.1 % |
Most of the remaining gap sits between tool and model, not in the tool itself. We address that continuously.
Citation
Bartmann C., Schimunek J., Ielanskyi M., Seidl P., Klambauer G., Luukkonen S. (2026). MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs. ICLR 2026 (poster, Pavilion 4 · P4-#5202, 24 Apr 2026), arXiv:2601.15279. Code: github.com/ml-jku/moleculariq. Dataset: huggingface.co/datasets/ml-jku/moleculariq-v0.0. Data snapshot: 2026-05-17.
Go deeper
In-depth analysis with methodology, implications and FAQ
About 12 minutes of reading. Background on model choice, cost Pareto in detail, GxP implications, FAQs. →
Test it yourself
The tools that produced this lift are available in every CovaSyn account, including the free tier with 100 credits per week.
