Skip to content

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed

8.1 relevance
Score Breakdown
technical depth
9
novelty
8
actionability
8
community
7
strategic
6
personal
9

Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.

Adversarial evaluation framework for LLMs with concrete results, highly relevant to AI/ML testing.

AI/ML dev.to
I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed
Summary

Agent-eval is an open-source adversarial evaluation framework that runs full ReAct agentic loops with tool calls against live LLM backends, then scores outputs through a three-tier assertion pyramid (deterministic, heuristic, model-as-judge). Testing 5 models (including Llama 3.3 70B via Groq) on 10 adversarial scenarios—prompt injection via tool output, hallucinated file contents, sycophancy, and circular dependency chains—the best model scored 62.5% and the worst 34%, with every model failing the same three tests. The framework short-circuits upward: if Tier 1 deterministic checks catch a prompt injection, it skips expensive LLM judge calls.

Author

Saurav Bhattacharya

More from Saurav Bhattacharya →