I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed

8.1 relevance

Adversarial evaluation framework for LLMs with concrete results, highly relevant to AI/ML testing.

AI/ML dev.to

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed

Summary

Agent-eval is an open-source adversarial evaluation framework that runs full ReAct agentic loops with tool calls against live LLM backends, then scores outputs through a three-tier assertion pyramid (deterministic, heuristic, model-as-judge). Testing 5 models (including Llama 3.3 70B via Groq) on 10 adversarial scenarios—prompt injection via tool output, hallucinated file contents, sycophancy, and circular dependency chains—the best model scored 62.5% and the worst 34%, with every model failing the same three tests. The framework short-circuits upward: if Tier 1 deterministic checks catch a prompt injection, it skips expensive LLM judge calls.

Author

Saurav Bhattacharya