🧜‍♀️ Merbench - LLM Evaluation

Getting LLMs to consistently nail the Mermaid diagram syntax can be... an adventure.

Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.

Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.

Evaluation Summary

879

Total Evaluation Runs

Models Evaluated

Test Cases

Providers Tested

AmazonGoogle

Source Code

Data updated: Oct 14, 2025

What do these metrics mean?

Success Rate: The percentage of successful Mermaid diagram generations out of all runs.
Avg Cost/Run: The average cost in USD to generate one diagram, based on provider pricing.
Price/Success: The effective cost for each successful diagram, calculated as (Avg Cost / Success Rate).
Avg Duration: The average time in seconds taken to generate a diagram.
Avg Tokens: The average number of tokens (input + output) used per run.
Runs: The total number of times this model was run in the evaluation.

Model Leaderboard

Rank	Model	Success Rate ↓	Avg Cost/Run	Price/Success	Avg Duration	Avg Tokens	Runs	Provider
1	gemini-2.5-flash-preview-09-2025	31.1%	$0.0206	$0.0661	32.33s	22,980.822	45	Google
2	gemini-2.5-pro-preview-06-05	29.4%	$0.0383	$0.1302	36.84s	8,111.882	51	Google
3	gemini-2.5-pro-preview-05-06	26.7%	$0.1308	$0.4904	49.85s	19,753.911	45	Google
4	gemini-2.5-pro-preview-03-25	22.9%	$0.1133	$0.4942	57.17s	16,393.313	48	Google
5	gemini-2.5-pro	20.0%	$0.0544	$0.2722	32.94s	14,255.511	45	Google
6	gemini-2.5-flash	13.3%	$0.0128	$0.0957	10.15s	6,990.467	45	Google
7	gemini-2.5-flash-lite-preview-06-17	5.0%	$0.0008	$0.0163	4.40s	4,974.583	60	Google
8	gemini-2.5-flash-preview-05-20	5.0%	$0.0101	$0.2014	9.75s	5,771.55	60	Google
9	gemini-2.5-flash-preview-04-17	4.4%	$0.0233	$0.5237	24.15s	10,492.711	45	Google
10	bedrock:us.amazon.nova-premier-v1:0	3.3%	$0.0356	$1.0692	63.19s	9,528.967	60	Amazon
11	gemini-2.5-flash-lite	3.3%	$0.0013	$0.0382	5.90s	9,506.689	90	Google
12	bedrock:us.amazon.nova-lite-v1:0	0.0%	$0.0002	N/A	24.54s	2,799.317	60	Amazon
13	bedrock:us.amazon.nova-micro-v1:0	0.0%	$0.0001	N/A	18.83s	1,783.85	60	Amazon
14	bedrock:us.amazon.nova-pro-v1:0	0.0%	$0.0008	N/A	49.53s	678.15	60	Amazon
15	gemini-2.0-flash	0.0%	$0.0003	N/A	4.21s	1,325.667	60	Google
16	gemini-2.5-flash-lite-preview-09-2025	0.0%	$0.0011	N/A	5.68s	5,687.822	45	Google

Performance vs Efficiency Trade-offs

Loading chart data...

Performance by Difficulty Level

Loading chart data...

Token Usage Breakdown

Loading chart data...

Failure Analysis by Reason

Loading chart data...

Last updated: October 16, 2025 at 01:15 AM UTC

View Source