Papers
arxiv:2603.24511

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Published on Mar 25
Authors:
,
,
,
,

Abstract

LLM agents can autonomously discover novel white-box adversarial attack algorithms that significantly outperform existing methods in jailbreaking and prompt injection evaluations.

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering rank2026posttrainbench, novikov2025alphaevolve. We show that an autoresearch-style pipeline karpathy2026autoresearch powered by Claude Code discovers novel white-box adversarial attack algorithms that significantly outperform all existing (30+) methods in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~zou2023universal, the agent iterates to produce new algorithms achieving up to 40\% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to leq10\% for existing algorithms (fig:teaser, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving 100\% ASR against Meta-SecAlign-70B chen2025secalign versus 56\% for the best baseline (fig:teaser, middle). Extending the findings of~carlini2025autoadvexbench, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.24511
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.24511 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.24511 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.24511 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.