all InfoSec news
Why do universal adversarial attacks work on large language models?: Geometry might be the answer. (arXiv:2309.00254v1 [cs.LG])
cs.CR updates on arXiv.org arxiv.org
Transformer based large language models with emergent capabilities are
becoming increasingly ubiquitous in society. However, the task of understanding
and interpreting their internal workings, in the context of adversarial
attacks, remains largely unsolved. Gradient-based universal adversarial attacks
have been shown to be highly effective on large language models and potentially
dangerous due to their input-agnostic nature. This work presents a novel
geometric perspective explaining universal adversarial attacks on large
language models. By attacking the 117M parameter GPT-2 model, we find …
adversarial adversarial attacks attacks capabilities context geometry internal language language models large society task understanding work