Enhancing the Security of Large Language Models Against Persuasion-Based Jailbreak Attacks in Multi-Turn Dialogues
Researchers from Old Dominion University, University of Virginia
Researchers will address vulnerabilities in Large Language Models (LLMs) posed by multi-turn persuasion-based jailbreak attacks, in which attackers exploit conversational manipulation to bypass safety protocols.
Funded by the CCI Coastal Virginia Node
Project Investigators
- Principal Investigator (PI): Javad Rafiei Asl, Old Dominion University School of Cybersecurity
- Co-PI: Shangtong Zhang, University of Virginia Department of Computer Science
- Co-PI: Prajwal Panzade, Old Dominion University School of Cybersecurity
Rationale
Multi-turn persuasion-based jailbreak attacks mirror human-like interactions, making them particularly dangerous, as they leverage the model’s understanding of natural language and context to generate harmful outputs.
Current defenses often focus on single-turn adversarial attacks, leaving a critical gap in addressing the multi-turn strategies that attackers use in real-world scenarios.
Projected Outcomes
Researchers will develop a defense mechanism that learns and adapts to new strategies by:
- Building a dataset of persuasive attack techniques.
- Simulating multi-turn adversarial models.
- Implementing reinforcement learning-driven defensive architectures.
The system will dynamically adjust its responses to protect LLMs from manipulation while maintaining conversational naturalness.
The results will contribute to AI safety where LLMs are widely deployed, such as cybersecurity, finance, and health care.