Abstract
In this paper, we present new techniques for increasing the diversity of red-teaming prompts generated by automated machine learning-based methods, thereby enabling the discovery of more vulnerabilities in large language models. Using reinforcement learning to train models to output effective prompts for this task results in the models converging deterministically to a single output. Our first technique, which we term Defender, acts by blocking the reward signal for prompts that have already been discovered, thus making what was a stationary problem into a non-stationary problem that compels the reward maximizing algorithm to continually seek new prompts. Our second technique, Teamplay, trains two prompt generation models in tandem and adds the KL divergence between them to the reward in order to make them search in disparate regions of the space of prompts. Our techniques are shown experimentally to increase the effectiveness and diversity of prompts generated by existing reinforcement learning baselines.
| Original language | English |
|---|---|
| Title of host publication | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings |
| Editors | Bhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| ISBN (Electronic) | 9798350368741 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Hyderabad, India Duration: 2025 Apr 6 → 2025 Apr 11 |
Publication series
| Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
|---|---|
| ISSN (Print) | 1520-6149 |
Conference
| Conference | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 |
|---|---|
| Country/Territory | India |
| City | Hyderabad |
| Period | 25/4/6 → 25/4/11 |
Bibliographical note
Publisher Copyright:© 2025 IEEE.
Keywords
- AI safety
- large language models
- red-teaming
- reinforcement learning
- toxicity
ASJC Scopus subject areas
- Software
- Signal Processing
- Electrical and Electronic Engineering
Fingerprint
Dive into the research topics of 'Diversity Seeking Techniques for Red-Teaming Large Language Models'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS