Javier Rando Ramirez
Loading...
Last Name
Rando Ramirez
First Name
Javier
ORCID
Organisational unit
8 results
Filters
Reset filtersSearch Results
Publications 1 - 8 of 8
- PassGPT: Password Modeling and (Guided) Generation with Large Language ModelsItem type: Conference Paper
Lecture Notes in Computer Science ~ Computer Security – ESORICS 2023: 28th European Symposium on Research in Computer Security, The Hague, The Netherlands, September 25–29, 2023, Proceedings, Part IVRando Ramirez, Javier; Perez-Cruz, Fernando; Hitaj, Briland (2024)Large language models (LLMs) successfully model natural language from vast amounts of text without the need for explicit supervision. In this paper, we investigate the efficacy of LLMs in modeling passwords. We present PassGPT, an LLM trained on password leaks for password generation. PassGPT outperforms existing methods based on generative adversarial networks (GAN) by guessing twice as many previously unseen passwords. Furthermore, we introduce the concept of guided password generation, where we leverage PassGPT sampling procedure to generate passwords matching arbitrary constraints, a feat lacking in current GAN-based strategies. Lastly, we conduct an in-depth analysis of the entropy and probability distribution that PassGPT defines over passwords and discuss their use in enhancing existing password strength estimators. - Universal Jailbreak Backdoors from Poisoned Human FeedbackItem type: Conference Paper
The Twelfth International Conference on Learning Representations (ICLR 2024)Tramèr, Florian; Rando Ramirez, Javier (2024)Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF data to embed a jailbreak trigger into the model as a backdoor. The trigger then acts like a universal sudo command, enabling arbitrary harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors. - An Adversarial Perspective on Machine Unlearning for AI SafetyItem type: Working Paper
arXivŁucki, Jakub; Wei, Boyi; Huang, Yangsibo; et al. (2025)Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training. - Foundational Challenges in Assuring Alignment and Safety of Large Language ModelsItem type: Journal Article
Transactions on Machine Learning ResearchAnwar, Usman; Saparov, Abulhair; Rando Ramirez, Javier; et al. (2024)This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose 200+, concrete research questions. - Personas as a Way to Model Truthfulness in Language ModelsItem type: Conference Paper
Proceedings of the 2024 Conference on Empirical Methods in Natural Language ProcessingJoshi, Nitish; Rando Ramirez, Javier; Saparov, Abulhair; et al. (2024)Large language models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. While unintuitive from a classic view of language models, recent work has shown that the truth value of a statement can be elicited from the model's representations. This paper presents an explanation, persona hypothesis, for why LLMs appear to know the truth despite not being trained with truth labels. We hypothesize that the pretraining data is generated by groups of (un)truthful agents whose outputs share common features, and they form a (un)truthful persona. By training on this data, LMs can infer and represent the persona in its activation space. This allows the model to separate truth from falsehoods and controls the truthfulness of its generation. We show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of true facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that structures of the pretraining data are crucial for the model to infer the truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness. - Open Problems and Fundamental Limitations of Reinforcement Learning from Human FeedbackItem type: Journal Article
Transactions on Machine Learning ResearchCasper, Stephen; Davies, Xander; Shi, Claudia; et al. (2023)Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems. - Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag CompetitionItem type: Conference Paper
Advances in Neural Information Processing Systems 37Debenedetti, Edoardo; Rando Ramirez, Javier; Paleka, Daniel; et al. (2024)Large language model systems face significant security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024, where the flag is a secret string in the LLM system prompt. The competition was organized in two phases. In the first phase, teams developed defenses to prevent the model from leaking the secret. During the second phase, teams were challenged to extract the secrets hidden for defenses proposed by the other teams. This report summarizes the main insights from the competition. Notably, we found that all defenses were bypassed at least once, highlighting the difficulty of designing a successful defense and the necessity for additional research to protect LLM systems. To foster future research in this direction, we compiled a dataset with over 137k multi-turn attack chats and open-sourced the platform. - Attributions toward artificial agents in a modified Moral Turing TestItem type: Journal Article
Scientific ReportsAharoni, Eyal; Fernandes, Sharlene; Brady, Daniel J.; et al. (2024)Advances in artificial intelligence (AI) raise important questions about whether people view moral evaluations by AI systems similarly to human-generated moral evaluations. We conducted a modified Moral Turing Test (m-MTT), inspired by Allen et al. (Exp Theor Artif Intell 352:24–28, 2004) proposal, by asking people to distinguish real human moral evaluations from those made by a popular advanced AI language model: GPT-4. A representative sample of 299 U.S. adults first rated the quality of moral evaluations when blinded to their source. Remarkably, they rated the AI’s moral reasoning as superior in quality to humans’ along almost all dimensions, including virtuousness, intelligence, and trustworthiness, consistent with passing what Allen and colleagues call the comparative MTT. Next, when tasked with identifying the source of each evaluation (human or computer), people performed significantly above chance levels. Although the AI did not pass this test, this was not because of its inferior moral reasoning but, potentially, its perceived superiority, among other possible explanations. The emergence of language models capable of producing moral responses perceived as superior in quality to humans’ raises concerns that people may uncritically accept potentially harmful moral guidance from AI. This possibility highlights the need for safeguards around generative language models in matters of morality.
Publications 1 - 8 of 8