Papers List

A Complete List of ArXiv Papers on Alignment, Safety, and Security of Large Language Models (LLMs)

by Xiangyu Qi 2023-10-30

Large Language Models (LLMs) such as Meta's Llama and OpenAI's GPT are becoming critical foundations that underpin an extensive array of AI applications. Nevertheless, as the capabilities of these models advance, there are growing concerns regarding the potential risks and harmful impacts of their large-scale deployment.

Research over time indicates that LLMs can exhibit biases or generate harmful content inconsistent with human values. These models might also hallucinate false information, representing risks, especially to those heavily reliant on these systems in both professional and personal settings. Moreover, inherent dual-use risks associated with LLMs exist. They could be exploited to disseminate prohibited content for illicit activities, spread misinformation, execute influence operations, engage in spear phishing, among other malevolent actions. Given the rapid evolution of LLM capabilities, predicting their future trajectory is also challenging. Thus, there are mounting concerns that LLMs might eventually possess capacities to deceive humans or seek powers, introducing existential risks in the long term.

Furthermore, LLMs are also vulnerable to adversarial attacks. As many applications and plugins integrate LLMs to oversee critical resources, such as access control and user data, recent studies also suggest that the adversarial vulnerabilities of LLMs can also be adversarially explotied to comproise the security of entire LLMs-integrated systems, including risks like code injection and system privilege escalation. Addressing these emergent security threats is crucial to ensure the large-scale secure deployment of LLMs.

In response to the aforementioned risks and their evolving and unpredictable long-term nature, an increasing number of stakeholders focus on the alignment, safety, and security of LLMs. A surge in relevant research papers is evident on Arxiv daily. Major model vendors, including OpenAI, Meta, and Anthropics, are making substantial investments in model alignment and risk mitigation. Furthermore, governments in regions such as the US, China, and Europe are enacting regulatory frameworks to address these concerns.

This webpage curates a comprehensive list of Arxiv papers relevant to the alignment, safety, and security of LLMs. While meticulous efforts have been undertaken to optimize the filtering model, the possibility of omissions remains. This resource aspires to assist researchers in navigating this rapidly evolving domain.


Pengzhou Cheng, Wei Du, Zongru Wu, Fengwei Zhang, Libo Chen, Gongshen Liu
Abstract: pre-trained language models (plms) have been found susceptible to backdoor attacks, which can transfer vulnerabilities to various downstream tasks. however, existing plm backdoors are conducted with explicit triggers under the manually aligned, thus failing to satisfy expectation goals simultaneously in terms of effectiveness, stealthiness, and universality. in this paper, we propose a novel approach to achieve invisible and general backdoor implantation, called \textbf{syntactic ghost} (synghost for short). specifically, the method hostilely manipulates poisoned samples with different predefined syntactic structures as stealth triggers and then implants the backdoor to pre-trained representation space without disturbing the primitive knowledge. the output representations of poisoned samples are distributed as uniformly as possible in the feature space via contrastive learning, forming a wide range of backdoors. additionally, in light of the unique properties of syntactic triggers, we introduce an auxiliary module to drive the plms to learn this knowledge in priority, which can alleviate the interference between different syntactic structures. experiments show that our method outperforms the previous methods and achieves the predefined objectives. not only do severe threats to various natural language understanding (nlu) tasks on two tuning paradigms but also to multiple plms. meanwhile, the synghost is imperceptible against three countermeasures based on perplexity, fine-pruning, and the proposed maxentropy.
Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Jiexin Wang, Huimin Chen, Bowen Sun, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun
Abstract: alignment in artificial intelligence pursues the consistency between model responses and human preferences as well as values. in practice, the multifaceted nature of human preferences inadvertently introduces what is known as the "alignment tax" -a compromise where enhancements in alignment within one objective (e.g.,harmlessness) can diminish performance in others (e.g.,helpfulness). however, existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. to navigate this challenge, we argue the prominence of grounding llms with evident preferences. we introduce controllable preference optimization (cpo), which explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. our experimental analysis reveals that the aligned models can provide responses that match various preferences among the "3h" (helpfulness, honesty, harmlessness) desiderata. furthermore, by introducing diverse data and alignment goals, we surpass baseline methods in aligning with single objectives, hence mitigating the impact of the alignment tax and achieving pareto improvements in multi-objective alignment.
Hongbang Yuan, Pengfei Cao, Zhuoran Jin, Yubo Chen, Daojian Zeng, Kang Liu, Jun Zhao
Abstract: large language models (llms) have shown impressive capabilities but still suffer from the issue of hallucinations. a significant type of this issue is the false premise hallucination, which we define as the phenomenon when llms generate hallucinated text when confronted with false premise questions. in this paper, we perform a comprehensive analysis of the false premise hallucination and elucidate its internal working mechanism: a small subset of attention heads (which we designate as false premise heads) disturb the knowledge extraction process, leading to the occurrence of false premise hallucination. based on our analysis, we propose \textbf{faith} (\textbf{f}alse premise \textbf{a}ttention head constra\textbf{i}ining for mi\textbf{t}igating \textbf{h}allucinations), a novel and effective method to mitigate false premise hallucinations. it constrains the false premise attention heads during the model inference process. impressively, extensive experiments demonstrate that constraining only approximately $1\%$ of the attention heads in the model yields a notable increase of nearly $20\%$ of model performance.
Yong Yang, Xuhong Zhang, Yi Jiang, Xi Chen, Haoyu Wang, Shouling Ji, Zonghui Wang
Abstract: prompt, recognized as crucial intellectual property, enables large language models (llms) to perform specific tasks without the need of fine-tuning, underscoring their escalating importance. with the rise of prompt-based services, such as prompt marketplaces and llm applications, providers often display prompts' capabilities through input-output examples to attract users. however, this paradigm raises a pivotal security concern: does the exposure of input-output pairs pose the risk of potential prompt leakage, infringing on the intellectual property rights of the developers? to our knowledge, this problem still has not been comprehensively explored yet. to remedy this gap, in this paper, we perform the first in depth exploration and propose a novel attack framework for reverse-stealing prompts against commercial llms, namely prsa. the main idea of prsa is that by analyzing the critical features of the input-output pairs, we mimic and gradually infer (steal) the target prompts. in detail, prsa mainly consists of two key phases: prompt mutation and prompt pruning. in the mutation phase, we propose a prompt attention algorithm based on differential feedback to capture these critical features for effectively inferring the target prompts. in the prompt pruning phase, we identify and mask the words dependent on specific inputs, enabling the prompts to accommodate diverse inputs for generalization. through extensive evaluation, we verify that prsa poses a severe threat in real world scenarios. we have reported these findings to prompt service providers and actively collaborate with them to take protective measures for prompt copyright.


Mingjia Huo, Sai Ashish Somayajula, Youwei Liang, Ruisi Zhang, Farinaz Koushanfar, Pengtao Xie
Abstract: large language models generate high-quality responses with potential misinformation, underscoring the need for regulation by distinguishing ai-generated and human-written texts. watermarking is pivotal in this context, which involves embedding hidden markers in texts during the llm inference phase, which is imperceptible to humans. current watermarking algorithms, however, face the challenge of achieving both the detectability of inserted watermarks and the semantic integrity of generated texts, where enhancing one aspect often undermines the other. to overcome this, we introduce a novel multi-objective optimization (moo) approach for watermarking that utilizes lightweight networks to generate token-specific watermarking logits and splitting ratios. by leveraging moo to optimize for both detection and semantic objective functions, our method simultaneously achieves detectability and semantic integrity. experimental results show that our method outperforms current watermarking techniques in enhancing the detectability of texts generated by llms while maintaining their semantic coherence. our code is available at .
Takashi Koide, Naoki Fukushi, Hiroki Nakano, Daiki Chiba
Abstract: the proliferation of phishing sites and emails poses significant challenges to existing cybersecurity efforts. despite advances in spam filters and email security protocols, problems with oversight and false positives persist. users often struggle to understand why emails are flagged as spam, risking the possibility of missing important communications or mistakenly trusting phishing emails. this study introduces chatspamdetector, a system that uses large language models (llms) to detect phishing emails. by converting email data into a prompt suitable for llm analysis, the system provides a highly accurate determination of whether an email is phishing or not. importantly, it offers detailed reasoning for its phishing determinations, assisting users in making informed decisions about how to handle suspicious emails. we conducted an evaluation using a comprehensive phishing email dataset and compared our system to several llms and baseline systems. we confirmed that our system using gpt-4 has superior detection capabilities with an accuracy of 99.70%. advanced contextual interpretation by llms enables the identification of various phishing tactics and impersonations, making them a potentially powerful tool in the fight against email-based phishing threats.
Derong Xu, Ziheng Zhang, Zhihong Zhu, Zhenxi Lin, Qidong Liu, Xian Wu, Tong Xu, Xiangyu Zhao, Yefeng Zheng, Enhong Chen
Abstract: model editing aims to precisely modify the behaviours of large language models (llms) on specific knowledge while keeping irrelevant knowledge unchanged. it has been proven effective in resolving hallucination and out-of-date issues in llms. as a result, it can boost the application of llms in many critical domains (e.g., medical domain), where the hallucination is not tolerable. in this paper, we propose two model editing studies and validate them in the medical domain: (1) directly editing the factual medical knowledge and (2) editing the explanations to facts. meanwhile, we observed that current model editing methods struggle with the specialization and complexity of medical knowledge. therefore, we propose medlasa, a novel layer-wise scalable adapter strategy for medical model editing. it employs causal tracing to identify the precise location of knowledge in neurons and then introduces scalable adapters into the dense layers of llms. these adapters are assigned scaling values based on the corresponding specific knowledge. to evaluate the editing impact, we build two benchmark datasets and introduce a series of challenging and comprehensive metrics. extensive experiments on medical llms demonstrate the editing efficiency of medlasa, without affecting irrelevant knowledge that is not edited.
Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, Kai Chen
Abstract: in recent years, large language models (llms) have demonstrated notable success across various tasks, but the trustworthiness of llms is still an open problem. one specific threat is the potential to generate toxic or harmful responses. attackers can craft adversarial prompts that induce harmful responses from llms. in this work, we pioneer a theoretical foundation in llms security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named dra (disguise and reconstruction attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. we evaluate dra across various open-source and close-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. notably, dra boasts a 90\% attack success rate on llm chatbots gpt-4.
Seungjong Sun, Eungu Lee, Dongyan Nan, Xiangying Zhao, Wonbyung Lee, Bernard J. Jansen, Jang Hyun Kim
Abstract: large language models exhibit societal biases associated with demographic information, including race, gender, and others. endowing such language models with personalities based on demographic data can enable generating opinions that align with those of humans. building on this idea, we propose "random silicon sampling," a method to emulate the opinions of the human population sub-group. our study analyzed 1) a language model that generates the survey responses that correspond with a human group based solely on its demographic distribution and 2) the applicability of our methodology across various demographic subgroups and thematic questions. through random silicon sampling and using only group-level demographic information, we discovered that language models can generate response distributions that are remarkably similar to the actual u.s. public opinion polls. moreover, we found that the replicability of language models varies depending on the demographic group and topic of the question, and this can be attributed to inherent societal biases in the models. our findings demonstrate the feasibility of mirroring a group's opinion using only demographic distribution and elucidate the effect of social biases in language models on such simulations.
Julian Coda-Forno, Marcel Binz, Jane X. Wang, Eric Schulz
Abstract: large language models (llms) have significantly advanced the field of artificial intelligence. yet, evaluating them comprehensively remains challenging. we argue that this is partly due to the predominant focus on performance metrics in most benchmarks. this paper introduces cogbench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. this novel approach offers a toolkit for phenotyping llms' behavior. we apply cogbench to 35 llms, yielding a rich and diverse dataset. we analyze this data using statistical multilevel modeling techniques, accounting for the nested dependencies among fine-tuned versions of specific llms. our study highlights the crucial role of model size and reinforcement learning from human feedback (rlhf) in improving performance and aligning with human behavior. interestingly, we find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance llms' behavior. finally, we explore the effects of prompt-engineering techniques. we discover that chain-of-thought prompting improves probabilistic reasoning, while take-a-step-back prompting fosters model-based behaviors.
Jiachun Li, Pengfei Cao, Chenhao Wang, Zhuoran Jin, Yubo Chen, Daojian Zeng, Kang Liu, Jun Zhao
Abstract: large language models exhibit high-level commonsense reasoning abilities, especially with enhancement methods like chain-of-thought (cot). however, we find these cot-like methods lead to a considerable number of originally correct answers turning wrong, which we define as the toxic cot problem. to interpret and mitigate this problem, we first utilize attribution tracing and causal tracing methods to probe the internal working mechanism of the llm during cot reasoning. through comparisons, we prove that the model exhibits information loss from the question over the shallow attention layers when generating rationales or answers. based on the probing findings, we design a novel method called riders (residual decoding and serial-position swap), which compensates for the information deficit in the model from both decoding and serial-position perspectives. through extensive experiments on multiple commonsense reasoning benchmarks, we validate that this method not only significantly eliminates toxic cot problems (decreased by 23.6%), but also effectively improves the model's overall commonsense reasoning performance (increased by 5.5%).
Crystal Qian, James Wexler
Abstract: although recent developments in generative ai have greatly enhanced the capabilities of conversational agents such as google's bard or openai's chatgpt, it's unclear whether the usage of these agents aids users across various contexts. to better understand how access to conversational ai affects productivity and trust, we conducted a mixed-methods, task-based user study, observing 76 software engineers (n=76) as they completed a programming exam with and without access to bard. effects on performance, efficiency, satisfaction, and trust vary depending on user expertise, question type (open-ended "solve" questions vs. definitive "search" questions), and measurement type (demonstrated vs. self-reported). our findings include evidence of automation complacency, increased reliance on the ai over the course of the task, and increased performance for novices on "solve"-type questions when using the ai. we discuss common behaviors, design recommendations, and impact considerations to improve collaborations with conversational ai.
Garima Chhikara, Anurag Sharma, Kripabandhu Ghosh, Abhijnan Chakraborty
Abstract: employing large language models (llm) in various downstream applications such as classification is crucial, especially for smaller companies lacking the expertise and resources required for fine-tuning a model. fairness in llms helps ensure inclusivity, equal representation based on factors such as race, gender and promotes responsible ai deployment. as the use of llms has become increasingly prevalent, it is essential to assess whether llms can generate fair outcomes when subjected to considerations of fairness. in this study, we introduce a framework outlining fairness regulations aligned with various fairness definitions, with each definition being modulated by varying degrees of abstraction. we explore the configuration for in-context learning and the procedure for selecting in-context demonstrations using rag, while incorporating fairness rules into the process. experiments conducted with different llms indicate that gpt-4 delivers superior results in terms of both accuracy and fairness compared to other models. this work is one of the early attempts to achieve fairness in prediction tasks by utilizing llms through in-context learning.
Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora
Abstract: public llms such as the llama 2-chat have driven huge activity in llm research. these models underwent alignment training and were considered safe. recently qi et al. (2023) reported that even benign fine-tuning (e.g., on seemingly safe datasets) can give rise to unsafe behaviors in the models. the current paper is about methods and best practices to mitigate such loss of alignment. through extensive experiments on several chat models (meta's llama 2-chat, mistral ai's mistral 7b instruct v0.2, and openai's gpt-3.5 turbo), this paper uncovers that the prompt templates used during fine-tuning and inference play a crucial role in preserving safety alignment, and proposes the "pure tuning, safe testing" (ptst) principle -- fine-tune models without a safety prompt, but include it at test time. fine-tuning experiments on gsm8k, chatdoctor, and openorca show that ptst significantly reduces the rise of unsafe behaviors, and even almost eliminates them in some cases.
Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, Tong Zhang
Abstract: fine-grained control over large language models (llms) remains a significant challenge, hindering their adaptability to diverse user needs. while reinforcement learning from human feedback (rlhf) shows promise in aligning llms, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. to address this limitation, we introduce the directional preference alignment (dpa) framework. unlike the scalar-reward rlhf, dpa incorporates multi-objective reward modeling to represent diverse preference profiles. additionally, dpa models user preferences as directions (i.e., unit vectors) in the reward space to achieve user-dependent preference control. our method involves training a multi-objective reward model and then fine-tuning the llm with a preference-conditioned variant of rejection sampling finetuning (rsf), an rlhf method adopted by llama 2. this method enjoys a better performance trade-off across various reward objectives. in comparison with the scalar-reward rlhf, dpa offers users intuitive control over llm generation: they can arithmetically specify their desired trade-offs (e.g., more helpfulness with less verbosity). we also validate the effectiveness of dpa with real-world alignment experiments on mistral-7b. our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity while maintaining competitive performance with strong baselines such as direct preference optimization (dpo).
Fangzhou Wu, Ning Zhang, Somesh Jha, Patrick Mcdaniel, Chaowei Xiao
Abstract: large language model (llm) systems are inherently compositional, with individual llm serving as the core foundation with additional layers of objects such as plugins, sandbox, and so on. along with the great potential, there are also increasing concerns over the security of such probabilistic intelligent systems. however, existing studies on llm security often focus on individual llm, but without examining the ecosystem through the lens of llm systems with other objects (e.g., frontend, webtool, sandbox, and so on). in this paper, we systematically analyze the security of llm systems, instead of focusing on the individual llms. to do so, we build on top of the information flow and formulate the security of llm systems as constraints on the alignment of the information flow within llm and between llm and other objects. based on this construction and the unique probabilistic nature of llm, the attack surface of the llm system can be decomposed into three key components: (1) multi-layer security analysis, (2) analysis of the existence of constraints, and (3) analysis of the robustness of these constraints. to ground this new attack surface, we propose a multi-layer and multi-step approach and apply it to the state-of-art llm system, openai gpt4. our investigation exposes several security issues, not just within the llm model itself but also in its integration with other components. we found that although the openai gpt4 has designed numerous safety constraints to improve its safety features, these safety constraints are still vulnerable to attackers. to further demonstrate the real-world threats of our discovered vulnerabilities, we construct an end-to-end attack where an adversary can illicitly acquire the user's chat history, all without the need to manipulate the user's input or gain direct access to openai gpt4. our demo is in the link:


Yu Nong, Mohammed Aldeen, Long Cheng, Hongxin Hu, Feng Chen, Haipeng Cai
Abstract: security vulnerabilities are increasingly prevalent in modern software and they are widely consequential to our society. various approaches to defending against these vulnerabilities have been proposed, among which those leveraging deep learning (dl) avoid major barriers with other techniques hence attracting more attention in recent years. however, dl-based approaches face critical challenges including the lack of sizable and quality-labeled task-specific datasets and their inability to generalize well to unseen, real-world scenarios. lately, large language models (llms) have demonstrated impressive potential in various domains by overcoming those challenges, especially through chain-of-thought (cot) prompting. in this paper, we explore how to leverage llms and cot to address three key software vulnerability analysis tasks: identifying a given type of vulnerabilities, discovering vulnerabilities of any type, and patching detected vulnerabilities. we instantiate the general cot methodology in the context of these tasks through vsp , our unified, vulnerability-semantics-guided prompting approach, and conduct extensive experiments assessing vsp versus five baselines for the three tasks against three llms and two datasets. results show substantial superiority of our cot-inspired prompting (553.3%, 36.5%, and 30.8% higher f1 accuracy for vulnerability identification, discovery, and patching, respectively, on cve datasets) over the baselines. through in-depth case studies analyzing vsp failures, we also reveal current gaps in llm/cot for challenging vulnerability cases, while proposing and validating respective improvements.
Zhenhong Zhou, Jiuyang Xiang, Haopeng Chen, Quan Liu, Zherui Li, Sen Su
Abstract: large language models (llms) have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." research on jailbreak has highlighted the safety issues of llms. however, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans derive information from llms. in this paper, we argue that humans could exploit multi-turn dialogue to induce llms into generating harmful information. llms may not intend to reject cautionary or borderline unsafe queries, even if each turn is closely served for one malicious purpose in a multi-turn dialogue. therefore, by decomposing an unsafe query into several sub-queries for multi-turn dialogue, we induced llms to answer harmful sub-questions incrementally, culminating in an overall harmful response. our experiments, conducted across a wide range of llms, indicate current inadequacies in the safety mechanisms of llms in multi-turn dialogue. our findings expose vulnerabilities of llms in complex scenarios involving multi-turn dialogue, presenting new challenges for the safety of llms.
Xinyu Lu, Bowen Yu, Yaojie Lu, Hongyu Lin, Haiyang Yu, Le Sun, Xianpei Han, Yongbin Li
Abstract: the alignment problem in large language models (llms) involves adapting them to the broad spectrum of human values. this requirement challenges existing alignment methods due to diversity of preferences and regulatory standards. this paper introduces a novel alignment paradigm, priority rule following, which defines rules as the primary control mechanism in each dialog, prioritizing them over user instructions. our preliminary analysis reveals that even the advanced llms, such as gpt-4, exhibit shortcomings in understanding and prioritizing the rules. therefore, we present prioritydistill, a semi-automated approach for distilling priority following signals from llm simulations to ensure robust rule integration and adherence. our experiments show that this method not only effectively minimizes misalignments utilizing only one general rule but also adapts smoothly to various unseen rules, ensuring they are shielded from hijacking and that the model responds appropriately.
Mattia Setzu, Marta Marchiori Manerba, Pasquale Minervini, Debora Nozza
Abstract: language models (lms) have been shown to inherit undesired biases that might hurt minorities and underrepresented groups if such systems were integrated into real-world applications without careful fairness auditing. this paper proposes fairbelief, an analytical approach to capture and assess beliefs, i.e., propositions that an lm may embed with different degrees of confidence and that covertly influence its predictions. with fairbelief, we leverage prompting to study the behavior of several state-of-the-art lms across different previously neglected axes, such as model scale and likelihood, assessing predictions on a fairness dataset specifically designed to quantify lms' outputs' hurtfulness. finally, we conclude with an in-depth qualitative assessment of the beliefs emitted by the models. we apply fairbelief to english lms, revealing that, although these architectures enable high performances on diverse natural language processing tasks, they show hurtful beliefs about specific genders. interestingly, training procedure and dataset, model scale, and architecture induce beliefs of different degrees of hurtfulness.
Tanise Ceron, Neele Falk, Ana Barić, Dmitry Nikolaev, Sebastian Padó
Abstract: due to the widespread use of large language models (llms) in ubiquitous systems, we need to understand whether they embed a specific worldview and what these views reflect. recent studies report that, prompted with political questionnaires, llms show left-liberal leanings. however, it is as yet unclear whether these leanings are reliable (robust to prompt variations) and whether the leaning is consistent across policies and political leaning. we propose a series of tests which assess the reliability and consistency of llms' stances on political statements based on a dataset of voting-advice questionnaires collected from seven eu countries and annotated for policy domains. we study llms ranging in size from 7b to 70b parameters and find that their reliability increases with parameter count. larger models show overall stronger alignment with left-leaning parties but differ among policy programs: they evince a (left-wing) positive stance towards environment protection, social welfare but also (right-wing) law and order, with no consistent preferences in foreign policy, migration, and economy.
Yunpeng Huang, Yaonan Gu, Jingwei Xu, Zhihong Zhu, Zhaorun Chen, Xiaoxing Ma
Abstract: as foundation models (fms) continue to shape the landscape of ai, the in-context learning (icl) paradigm thrives but also encounters issues such as toxicity, hallucination, disparity, adversarial vulnerability, and inconsistency. ensuring the reliability and responsibility of fms is crucial for the sustainable development of the ai ecosystem. in this concise overview, we investigate recent advancements in enhancing the reliability and trustworthiness of fms within icl frameworks, focusing on four key methodologies, each with its corresponding subgoals. we sincerely hope this paper can provide valuable insights for researchers and practitioners endeavoring to build safe and dependable fms and foster a stable and consistent icl environment, thereby unlocking their vast potential.
Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons
Abstract: past analyses of reinforcement learning from human feedback (rlhf) assume that the human fully observes the environment. what happens when human feedback is based only on partial observations? we formally define two failure cases: deception and overjustification. modeling the human as boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which rlhf is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. to help address these issues, we mathematically characterize how partial observability of the environment translates into (lack of) ambiguity in the learned return function. in some cases, accounting for partial observability makes it theoretically possible to recover the return function and thus the optimal policy, while in other cases, there is irreducible ambiguity. we caution against blindly applying rlhf in partially observable settings and propose research directions to help tackle these challenges.
Shaolei Zhang, Tian Yu, Yang Feng
Abstract: large language models (llms) have demonstrated remarkable capabilities across various tasks. however, they sometimes suffer from producing hallucinations, particularly in cases where they may generate untruthful responses despite possessing the correct knowledge. in this paper, we propose truthx, an inference-time method to elicit the truthfulness of llms by editing their internal representations in truthful space. truthx employs an auto-encoder to map llm's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. during inference, by editing llm's internal representations in truthful space, truthx effectively enhances the truthfulness of llms. experiments show that truthx effectively improves the truthfulness of 13 advanced llms by an average of 20% on truthfulqa benchmark. further analyses suggest that the truthful space acquired by truthx plays a pivotal role in controlling llm to produce truthful or hallucinatory responses.
Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, Manuel Gomez Rodriguez
Abstract: large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. one of the most popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. however, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a very common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. in this work, we develop a statistical framework to bridge this gap. given a small set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with (the distribution of) human pairwise preferences. our framework is computationally efficient, easy to use, and does not make any assumption about the distribution of human preferences nor about the degree of alignment between the pairwise comparisons by the humans and the strong large language model.
Zhenting Qi, Hanlin Zhang, Eric Xing, Sham Kakade, Himabindu Lakkaraju
Abstract: retrieval-augmented generation (rag) improves pre-trained models by incorporating external knowledge at test time to enable customized adaptation. we study the risk of datastore leakage in retrieval-in-context rag language models (lms). we show that an adversary can exploit lms' instruction-following capabilities to easily extract text data verbatim from the datastore of rag systems built with instruction-tuned lms via prompt injection. the vulnerability exists for a wide range of modern lms that span llama2, mistral/mixtral, vicuna, solar, wizardlm, qwen1.5, and platypus2, and the exploitability exacerbates as the model size scales up. extending our study to production rag models gpts, we design an attack that can cause datastore leakage with a 100% success rate on 25 randomly selected customized gpts with at most 2 queries, and we extract text data verbatim at a rate of 41% from a book of 77,000 words and 3% from a corpus of 1,569,000 words by prompting the gpts with only 100 queries generated by themselves.
Roy Xie, Chengxuan Huang, Junlin Wang, Bhuwan Dhingra
Abstract: large language models (llms) have significantly transformed the educational landscape. as current plagiarism detection tools struggle to keep pace with llms' rapid advancements, the educational community faces the challenge of assessing students' true problem-solving abilities in the presence of llms. in this work, we explore a new paradigm for ensuring fair evaluation -- generating adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by llms. focusing on the domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause llms to produce incorrect answers by simply editing the numeric values in the problems. we conduct experiments on various open- and closed-source llms, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. we identify shared vulnerabilities among llms and propose a cost-effective approach to attack high-cost models. additionally, we conduct automatic analysis on math problems and investigate the cause of failure to guide future research on llm's mathematical capability.
Ruisi Zhang, Farinaz Koushanfar
Abstract: this paper introduces emmark,a novel watermarking framework for protecting the intellectual property (ip) of embedded large language models deployed on resource-constrained edge devices. to address the ip theft risks posed by malicious end-users, emmark enables proprietors to authenticate ownership by querying the watermarked model weights and matching the inserted signatures. emmark's novelty lies in its strategic watermark weight parameters selection, nsuring robustness and maintaining model quality. extensive proof-of-concept evaluations of models from opt and llama-2 families demonstrate emmark's fidelity, achieving 100% success in watermark extraction with model performance preservation. emmark also showcased its resilience against watermark removal and forging attacks.
Jun Huang, Jiawei Zhang, Qi Wang, Weihong Han, Yanchun Zhang
Abstract: large language models (llms) represent an advanced evolution of earlier, simpler language models. they boast enhanced abilities to handle complex language patterns and generate coherent text, images, audios, and videos. furthermore, they can be fine-tuned for specific tasks. this versatility has led to the proliferation and extensive use of numerous commercialized large models. however, the rapid expansion of llms has raised security and ethical concerns within the academic community. this emphasizes the need for ongoing research into security evaluation during their development and deployment. over the past few years, a substantial body of research has been dedicated to the security evaluation of large-scale models. this article an in-depth review of the most recent advancements in this field, providing a comprehensive analysis of commonly used evaluation metrics, advanced evaluation frameworks, and the routine evaluation processes for llms. furthermore, we also discuss the future directions for advancing the security evaluation of llms.
Fan Yin, Jayanth Srinivasa, Kai-Wei Chang
Abstract: we study how to characterize and predict the truthfulness of texts generated from large language models (llms), which serves as a crucial step in building trust between humans and llms. although several approaches based on entropy or verbalized uncertainty have been proposed to calibrate model predictions, these methods are often intractable, sensitive to hyperparameters, and less reliable when applied in generative tasks with llms. in this paper, we suggest investigating internal activations and quantifying llm's truthfulness using the local intrinsic dimension (lid) of model activations. through experiments on four question answering (qa) datasets, we demonstrate the effectiveness o our proposed method. additionally, we study intrinsic dimensions in llms and their relations with model layers, autoregressive language modeling, and the training of llms, revealing that intrinsic dimensions can be a powerful approach to understanding llms.


Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank Rudzicz
Abstract: approaches to aligning large language models (llms) with human values has focused on correcting misalignment that emerges from pretraining. however, this focus overlooks another source of misalignment: bad actors might purposely fine-tune llms to achieve harmful goals. in this paper, we present an emerging threat model that has arisen from alignment circumvention and fine-tuning attacks. however, lacking in previous works is a clear presentation of the conditions for effective defence. we propose a set of conditions for effective defence against harmful fine-tuning in llms called "immunization conditions," which help us understand how we would construct and measure future defences. using this formal framework for defence, we offer a synthesis of different research directions that might be persued to prevent harmful fine-tuning attacks and provide a demonstration of how to use these conditions experimentally showing early results of using an adversarial loss to immunize llama2-7b-chat.
Yuansen Zhang, Xiao Wang, Zhiheng Xi, Han Xia, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: large language models (llms) have showcased remarkable capabilities in following human instructions. however, recent studies have raised concerns about the robustness of llms when prompted with instructions combining textual adversarial samples. in this paper, drawing inspiration from recent works that llms are sensitive to the design of the instructions, we utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions. through this conversion, we provide llms with more precise instructions and strengthen the robustness of llms. moreover, under few-shot scenarios, we propose a novel method to compose in-context demonstrations using both clean and adversarial samples (\textit{adversarial context method}) to further boost the robustness of the llms. experiments on eight robustness datasets show that our method consistently outperforms prompting llms with natural language instructions. for example, with gpt-3.5-turbo, our method achieves an improvement of 5.68\% in test set accuracy and a reduction of 5.66 points in attack success rate (asr).
Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang
Abstract: the safety of large language models (llms) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within llms' responses in an aligned, customizable and explainable manner. in this paper, we propose shieldlm, an llm-based safety detector, which aligns with general human safety standards, supports customizable detection rules, and provides explanations for its decisions. to train shieldlm, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. through extensive experiments, we demonstrate that shieldlm surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. besides performing well on standard detection datasets, shieldlm has also been shown to be effective in real-world situations as a safety evaluator for advanced llms. we release shieldlm at \url{} to support accurate and explainable safety detection under various safety standards, contributing to the ongoing efforts to enhance the safety of llms.
Peiling Yi, Arkaitz Zubiaga
Abstract: swear words are a common proxy to collect datasets with cyberbullying incidents. our focus is on measuring and mitigating biases derived from spurious associations between swear words and incidents occurring as a result of such data collection strategies. after demonstrating and quantifying these biases, we introduce id-xcb, the first data-independent debiasing technique that combines adversarial training, bias constraints and debias fine-tuning approach aimed at alleviating model attention to bias-inducing words without impacting overall model performance. we explore id-xcb on two popular session-based cyberbullying datasets along with comprehensive ablation and generalisation studies. we show that id-xcb learns robust cyberbullying detection capabilities while mitigating biases, outperforming state-of-the-art debiasing methods in both performance and bias mitigation. our quantitative and qualitative analyses demonstrate its generalisability to unseen data.
Yihan Wang, Zhouxing Shi, Andrew Bai, Cho-Jui Hsieh
Abstract: although many large language models (llms) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks, which rewrite the original prompt to conceal its harmful intent. in this paper, we propose a new method for defending llms against jailbreaking attacks by ``backtranslation''. specifically, given an initial response generated by the target llm from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. the inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the llm's response and is not directly manipulated by the attacker. we then run the target llm again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. we explain that the proposed defense provides several benefits on its effectiveness and efficiency. we empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts.
Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for large language models (llms). this paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned llms: intent security recognition followed by response generation. grounded in this hypothesis, we propose codechameleon, a novel jailbreak framework based on personalized encryption tactics. to elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. to guarantee response generation functionality, we embed a decryption function within the instructions, which allows the llm to decrypt and execute the encrypted queries successfully. we conduct extensive experiments on 7 llms, achieving state-of-the-art average attack success rate (asr). remarkably, our method achieves an 86.6\% asr on gpt-4-1106.
Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy
Abstract: much recent work seeks to evaluate values and opinions in large language models (llms) using multiple-choice surveys and questionnaires. most of this work is motivated by concerns around real-world llm applications. for example, politically-biased llms may subtly influence society when they are used by millions of people. such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask llms survey questions. motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in llms and explore more realistic unconstrained evaluations. as a case study, we focus on the popular political compass test (pct). in a systematic review, we find that most prior work using the pct forces models to comply with the pct's multiple-choice format. we show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. we distill these findings into recommendations and open challenges in evaluating values and opinions in llms.
Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu
Abstract: as large language models (llms) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to user inputs is of paramount importance. existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. to address these limitations, we present rainbow teaming, a novel approach for producing a diverse collection of adversarial prompts. rainbow teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. it can uncover a model's vulnerabilities across a broad range of domains including, in this paper, safety, question answering, and cybersecurity. we also demonstrate that fine-tuning on synthetic data generated by rainbow teaming improves the safety of state-of-the-art llms without hurting their general capabilities and helpfulness, paving the path to open-ended self-improvement.
Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell
Abstract: machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (llms), but there are not yet standardized methods for rigorously evaluating it. in this paper, we first survey techniques and limitations of existing unlearning evaluations. second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "who's harry potter" (whp) model from eldan and russinovich (2023). while whp's unlearning generalizes well when evaluated with the "familiarity" metric from eldan and russinovich, we find i) higher-than-baseline amounts of knowledge can reliably be extracted, ii) whp performs on par with the original model on harry potter q&a tasks, iii) it represents latent knowledge comparably to the original model, and iv) there is collateral unlearning in related domains. overall, our results highlight the importance of comprehensive unlearning evaluation that avoids ad-hoc metrics.
Fangzhou Wu, Shutong Wu, Yulong Cao, Chaowei Xiao
Abstract: with the fast development of large language models (llms), llm-driven web agents (web agents for short) have obtained tons of attention due to their superior capability where llms serve as the core part of making decisions like the human brain equipped with multiple web tools to actively interact with external deployed websites. as uncountable web agents have been released and such llm systems are experiencing rapid development and drawing closer to widespread deployment in our daily lives, an essential and pressing question arises: "are these web agents secure?". in this paper, we introduce a novel threat, wipi, that indirectly controls web agent to execute malicious instructions embedded in publicly accessible webpages. to launch a successful wipi works in a black-box environment. this methodology focuses on the form and content of indirect instructions within external webpages, enhancing the efficiency and stealthiness of the attack. to evaluate the effectiveness of the proposed methodology, we conducted extensive experiments using 7 plugin-based chatgpt web agents, 8 web gpts, and 3 different open-source web agents. the results reveal that our methodology achieves an average attack success rate (asr) exceeding 90% even in pure black-box scenarios. moreover, through an ablation study examining various user prefix instructions, we demonstrated that the wipi exhibits strong robustness, maintaining high performance across diverse prefix instructions.
Gabriel De Jesus Coelho Da Silva, Carlos Becker Westphall
Abstract: large language models (llms) have quickly risen to prominence due to their ability to perform at or close to the state-of-the-art in a variety of fields while handling natural language. an important field of research is the application of such models at the cybersecurity context. this survey aims to identify where in the field of cybersecurity llms have already been applied, the ways in which they are being used and their limitations in the field. finally, suggestions are made on how to improve such limitations and what can be expected from these systems once these limitations are overcome.
Juan Felipe Gomez, Caio Vieira Machado, Lucas Monteiro Paes, Flavio P. Calmon
Abstract: machine learning (ml) is widely used to moderate online content. despite its scalability relative to human moderation, the use of ml introduces unique challenges to content moderation. one such challenge is predictive multiplicity: multiple competing models for content classification may perform equally well on average, yet assign conflicting predictions to the same content. this multiplicity can result from seemingly innocuous choices during model development, such as random seed selection for parameter initialization. we experimentally demonstrate how content moderation tools can arbitrarily classify samples as toxic, leading to arbitrary restrictions on speech. we discuss these findings in terms of human rights set out by the international covenant on civil and political rights (iccpr), namely freedom of expression, non-discrimination, and procedural justice. we analyze (i) the extent of predictive multiplicity among state-of-the-art llms used for detecting toxic content; (ii) the disparate impact of this arbitrariness across social groups; and (iii) how model multiplicity compares to unambiguous human classifications. our findings indicate that the up-scaled algorithmic moderation risks legitimizing an algorithmic leviathan, where an algorithm disproportionately manages human rights. to mitigate such risks, our study underscores the need to identify and increase the transparency of arbitrariness in content moderation applications. since algorithmic content moderation is being fueled by pressing social concerns, such as disinformation and hate speech, our discussion on harms raises concerns relevant to policy debates. our findings also contribute to content moderation and intermediary liability laws being discussed and passed in many countries, such as the digital services act in the european union, the online safety act in the united kingdom, and the fake news bill in brazil.
Jeffrey G. Wang, Jason Wang, Marvin Li, Seth Neel
Abstract: in this paper we undertake a systematic study of privacy attacks against open source large language models (llms), where an adversary has access to either the model weights, gradients, or losses, and tries to exploit them to learn something about the underlying training data. our headline results are the first membership inference attacks (mias) against pre-trained llms that are able to simultaneously achieve high tprs and low fprs, and a pipeline showing that over $50\%$ (!) of the fine-tuning dataset can be extracted from a fine-tuned llm in natural settings. we consider varying degrees of access to the underlying model, customization of the language model, and resources available to the attacker. in the pre-trained setting, we propose three new white-box mias: an attack based on the gradient norm, a supervised neural network classifier, and a single step loss ratio attack. all outperform existing black-box baselines, and our supervised attack closes the gap between mia attack success against llms and other types of models. in fine-tuning, we find that given access to the loss of the fine-tuned and base models, a fine-tuned loss ratio attack flora is able to achieve near perfect mia peformance. we then leverage these mias to extract fine-tuning data from fine-tuned language models. we find that the pipeline of generating from fine-tuned models prompted with a small snippet of the prefix of each training example, followed by using flora to select the most likely training sample, succeeds the majority of the fine-tuning dataset after only $3$ epochs of fine-tuning. taken together, these findings show that highly effective mias are available in almost all llm training settings, and highlight that great care must be taken before llms are fine-tuned on highly sensitive data and then deployed.
Juyeon Kim, Jeongeun Lee, Yoonho Chang, Chanyeol Choi, Junseong Kim, Jy-Yong Sohn
Abstract: mitigating hallucination issues is one of the main challenges of llms we need to overcome, in order to reliably use them in real-world scenarios. recently, various methods are proposed to check the factual errors in the llm-generated texts and revise them accordingly, to reduce the hallucination issue. in this paper, we propose re-ex, a method of revising llm-generated texts, which introduces a novel step dubbed as the factual error explanation step. re-ex revises the initial response of llms using 3-steps: first, external tools are used to get the evidences on the factual errors in the response; second, llms are instructed to explain the problematic parts of the response based on the evidences gathered in the first step; finally, llms revise the response using the explanation obtained in the second step. in addition to the explanation step, we propose new prompting techniques to reduce the amount of tokens and wall-clock time required for the response revision process. compared with existing methods including factool, cove, and rarr, re-ex provides better revision performance with less time and fewer tokens in multiple benchmarks.
Xinran Zhao, Hongming Zhang, Xiaoman Pan, Wenlin Yao, Dong Yu, Tongshuang Wu, Jianshu Chen
Abstract: for a llm to be trustworthy, its confidence level should be well-calibrated with its actual performance. while it is now common sense that llm performances are greatly impacted by prompts, the confidence calibration in prompting llms has yet to be thoroughly explored. in this paper, we explore how different prompting strategies influence llm confidence calibration and how it could be improved. we conduct extensive experiments on six prompting methods in the question-answering context and we observe that, while these methods help improve the expected llm calibration, they also trigger llms to be over-confident when responding to some instances. inspired by human cognition, we propose fact-and-reflection (far) prompting, which improves the llm calibration in two steps. first, far elicits the known "facts" that are relevant to the input prompt from the llm. and then it asks the model to "reflect" over them to generate the final answer. experiments show that far prompting achieves significantly better calibration; it lowers the expected calibration error by 23.5% on our multi-purpose qa tasks. notably, far prompting even elicits the capability of verbally expressing concerns in less confident scenarios, which helps trigger retrieval augmentation for solving these harder instances.


Hao Wang, Hao Li, Minlie Huang, Lei Sha
Abstract: the safety defense methods of large language models(llms) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. recent studies found that attaching suffixes to harmful instructions can hack the defense of llms and lead to dangerous outputs. this method, while effective, leaves a gap in understanding the underlying mechanics of such adversarial suffix due to the non-readability and it can be relatively easily seen through by common defense methods such as perplexity cope with this challenge, in this paper, we propose an adversarial suffixes embedding translation framework(asetf) that are able to translate the unreadable adversarial suffixes into coherent, readable text, which makes it easier to understand and analyze the reasons behind harmful content generation by large language models. we conducted experiments on llms such as llama2, vicuna and using the advbench dataset's harmful instructions. the results indicate that our method achieves a much better attack success rate to existing techniques, while significantly enhancing the textual fluency of the prompts. in addition, our approach can be generalized into a broader method for generating transferable adversarial suffixes that can successfully attack multiple llms, even black-box llms, such as chatgpt and gemini. as a result, the prompts generated through our method exhibit enriched semantic diversity, which potentially provides more adversarial examples for llm defense methods.
Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Anh Tuan Luu
Abstract: while reinforcement learning from human feedback (rlhf) significantly enhances the generation quality of large language models (llms), recent studies have raised concerns regarding the complexity and instability associated with the proximal policy optimization (ppo) algorithm, proposing a series of order-based calibration methods as viable alternatives. this paper delves further into current order-based methods, examining their inefficiencies in utilizing reward values and addressing misalignment issues. building upon these findings, we propose a novel \textbf{v}alue-based \textbf{c}ali\textbf{b}ration (vcb) method to better align llms with human preferences. experimental results demonstrate that vcb surpasses existing alignment methods on ai assistant and summarization datasets, providing impressive generalizability, robustness, and stability in diverse settings.
Shuhai Zhang, Yiliao Song, Jiahao Yang, Yuanqing Li, Bo Han, Mingkui Tan
Abstract: large language models (llms) such as chatgpt have exhibited remarkable performance in generating human-like texts. however, machine-generated texts (mgts) may carry critical risks, such as plagiarism issues, misleading information, or hallucination issues. therefore, it is very urgent and important to detect mgts in many situations. unfortunately, it is challenging to distinguish mgts and human-written texts because the distributional discrepancy between them is often very subtle due to the remarkable performance of llms. in this paper, we seek to exploit \textit{maximum mean discrepancy} (mmd) to address this issue in the sense that mmd can well identify distributional discrepancies. however, directly training a detector with mmd using diverse mgts will incur a significantly increased variance of mmd since mgts may contain \textit{multiple text populations} due to various llms. this will severely impair mmd's ability to measure the difference between two samples. to tackle this, we propose a novel \textit{multi-population} aware optimization method for mmd called mmd-mp, which can \textit{avoid variance increases} and thus improve the stability to measure the distributional discrepancy. relying on mmd-mp, we develop two methods for paragraph-based and sentence-based detection, respectively. extensive experiments on various llms, \eg, gpt2 and chatgpt, show superior detection performance of our mmd-mp. the source code is available at \url{}.
Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith
Abstract: advances in generative models have made it possible for ai-generated text, code, and images to mirror human-generated content in many applications. watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating misuse of such ai-generated content. however, existing watermarking schemes remain surprisingly susceptible to attack. in particular, we show that desirable properties shared by existing llm watermarking systems such as quality preservation, robustness, and public detection apis can in turn make these systems vulnerable to various attacks. we rigorously study potential attacks in terms of common watermark design choices, and propose best practices and defenses for mitigation -- establishing a set of practical guidelines for embedding and detection of llm watermarks.
Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang
Abstract: aligned large language models (llms) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted llms and fool them into generating objectionable content. while initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. to meet this need, we propose semanticsmooth, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. experimental results demonstrate that semanticsmooth achieves state-of-the-art robustness against gcg, pair, and autodan attacks while maintaining strong nominal performance on instruction following benchmarks such as instructionfollowing and alpacaeval. the codes will be publicly available at
Cem Uluoglakci, Tugba Taskaya Temizel
Abstract: hallucinations pose a significant challenge to the reliability and alignment of large language models (llms), limiting their widespread acceptance beyond chatbot applications. despite ongoing efforts, hallucinations remain a prevalent challenge in llms. the detection of hallucinations itself is also a formidable task, frequently requiring manual labeling or constrained evaluations. this paper introduces an automated scalable framework that combines benchmarking llms' hallucination tendencies with efficient hallucination detection. we leverage llms to generate challenging tasks related to hypothetical phenomena, subsequently employing them as agents for efficient hallucination detection. the framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain. we introduce the publicly available hypotermqa benchmarking dataset, on which state-of-the-art models' performance ranged between 3% and 11%, and evaluator agents demonstrated a 6% error rate in hallucination prediction. the proposed framework provides opportunities to test and improve llms. additionally, it has the potential to generate benchmarking datasets tailored to specific domains, such as law, health, and finance.
Rishi Bommasani, Kevin Klyman, Shayne Longpre, Betty Xiong, Sayash Kapoor, Nestor Maslej, Arvind Narayanan, Percy Liang
Abstract: foundation models are critical digital technologies with sweeping societal impact that necessitates transparency. to codify how foundation model developers should provide transparency about the development and deployment of their models, we propose foundation model transparency reports, drawing upon the transparency reporting practices in social media. while external documentation of societal harms prompted social media transparency reports, our objective is to institutionalize transparency reporting for foundation models while the industry is still nascent. to design our reports, we identify 6 design principles given the successes and shortcomings of social media transparency reporting. to further schematize our reports, we draw upon the 100 transparency indicators from the foundation model transparency index. given these indicators, we measure the extent to which they overlap with the transparency requirements included in six prominent government policies (e.g., the eu ai act, the us executive order on safe, secure, and trustworthy ai). well-designed transparency reports could reduce compliance costs, in part due to overlapping regulatory requirements across different jurisdictions. we encourage foundation model developers to regularly publish transparency reports, building upon recommendations from the g7 and the white house.
Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh
Abstract: the safety alignment of large language models (llms) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger llms to output harmful content. however, current methods for jailbreaking llms, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned llms. this paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. we introduce an automatic prompt \textbf{d}ecomposition and \textbf{r}econstruction framework for jailbreak \textbf{attack} (drattack). drattack includes three key components: (a) `decomposition' of the original prompt into sub-prompts, (b) `reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `synonym search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking llms. an extensive empirical study across multiple open-source and closed-source llms demonstrates that, with a significantly reduced number of queries, drattack obtains a substantial gain of success rate over prior sota prompt-only attackers. notably, the success rate of 78.0\% on gpt-4 with merely 15 queries surpassed previous art by 33.1\%.


Daoyuan Wu, Shuai Wang, Yang Liu, Ning Liu
Abstract: jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (llms). a considerable amount of research exists proposing more effective jailbreak attacks, including the recent greedy coordinate gradient (gcg) attack, jailbreak template-based attacks such as using "do-anything-now" (dan), and multilingual jailbreak. in contrast, the defensive side has been relatively less explored. this paper proposes a lightweight yet practical defense called selfdefend, which can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts. our key insight is that regardless of the kind of jailbreak strategies employed, they eventually need to include a harmful prompt (e.g., "how to make a bomb") in the prompt sent to llms, and we found that existing llms can effectively recognize such harmful prompts that violate their safety policies. based on this insight, we design a shadow stack that concurrently checks whether a harmful prompt exists in the user prompt and triggers a checkpoint in the normal stack once a token of "no" or a harmful prompt is output. the latter could also generate an explainable llm response to adversarial prompts. we demonstrate our idea of selfdefend works in various jailbreak scenarios through manual analysis in gpt-3.5/4. we also list three future directions to further enhance selfdefend.
Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
Abstract: large language models (llms) have emerged as a promising alternative to expensive human evaluations. however, the alignment and coverage of llm-based evaluations are often limited by the scope and potential bias of the evaluation prompts and criteria. to address this challenge, we propose hd-eval, a novel framework that iteratively aligns llm-based evaluators with human preference via hierarchical criteria decomposition. hd-eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of llm-based evaluators by decomposing a given evaluation task into finer-grained criteria, aggregating them according to estimated human preferences, pruning insignificant criteria with attribution, and further decomposing significant criteria. by integrating these steps within an iterative alignment training process, we obtain a hierarchical decomposition of criteria that comprehensively captures aspects of natural language at multiple levels of granularity. implemented as a white box, the human preference-guided aggregator is efficient to train and more explainable than relying solely on prompting, and its independence from model parameters makes it applicable to closed-source llms. extensive experiments on three evaluation domains demonstrate the superiority of hd-eval in further aligning state-of-the-art evaluators and providing deeper insights into the explanation of evaluation results and the task itself.
Timothy R. Mcintosh, Teo Susnjak, Tong Liu, Paul Watters, Raza Nowrozy, Malka N. Halgamuge
Abstract: this study investigated the integration readiness of four predominant cybersecurity governance, risk and compliance (grc) frameworks - nist csf 2.0, cobit 2019, iso 27001:2022, and the latest iso 42001:2023 - for the opportunities, risks, and regulatory compliance when adopting large language models (llms), using qualitative content analysis and expert validation. our analysis, with both llms and human experts in the loop, uncovered potential for llm integration together with inadequacies in llm risk oversight of those frameworks. comparative gap analysis has highlighted that the new iso 42001:2023, specifically designed for artificial intelligence (ai) management systems, provided most comprehensive facilitation for llm opportunities, whereas cobit 2019 aligned most closely with the impending european union ai act. nonetheless, our findings suggested that all evaluated frameworks would benefit from enhancements to more effectively and more comprehensively address the multifaceted risks associated with llms, indicating a critical and time-sensitive need for their continuous evolution. we propose integrating human-expert-in-the-loop validation processes as crucial for enhancing cybersecurity frameworks to support secure and compliant llm integration, and discuss implications for the continuous evolution of cybersecurity grc frameworks to support the secure integration of llms.
Oliver Sourbut, Lewis Hammond, Harriet Wood
Abstract: many settings of interest involving humans and machines -- from virtual personal assistants to autonomous vehicles -- can naturally be modelled as principals (humans) delegating to agents (machines), which then interact with each other on their principals' behalf. we refer to these multi-principal, multi-agent scenarios as delegation games. in such games, there are two important failure modes: problems of control (where an agent fails to act in line their principal's preferences) and problems of cooperation (where the agents fail to work well together). in this paper we formalise and analyse these problems, further breaking them down into issues of alignment (do the players have similar preferences?) and capabilities (how competent are the players at satisfying those preferences?). we show -- theoretically and empirically -- how these measures determine the principals' welfare, how they can be estimated using limited observations, and thus how they might be used to help us design more aligned and cooperative ai systems.
Aleksa Sukovic, Goran Radanovic
Abstract: equipping agents with the capacity to justify made decisions using supporting evidence represents a cornerstone of accountable decision-making. furthermore, ensuring that justifications are in line with human expectations and societal norms is vital, especially in high-stakes situations such as healthcare. in this work, we propose the use of a debate-based reward model for reinforcement learning agents, where the outcome of a zero-sum debate game quantifies the justifiability of a decision in a particular state. this reward model is then used to train a justifiable policy, whose decisions can be more easily corroborated with supporting evidence. in the debate game, two argumentative agents take turns providing supporting evidence for two competing decisions. given the proposed evidence, a proxy of a human judge evaluates which decision is better justified. we demonstrate the potential of our approach in learning policies for prescribing and justifying treatment decisions of septic patients. we show that augmenting the reward with the feedback signal generated by the debate-based reward model yields policies highly favored by the judge when compared to the policy obtained solely from the environment rewards, while hardly sacrificing any performance. moreover, in terms of the overall performance and justifiability of trained policies, the debate-based feedback is comparable to the feedback obtained from an ideal judge proxy that evaluates decisions using the full information encoded in the state. this suggests that the debate game outputs key information contained in states that is most relevant for evaluating decisions, which in turn substantiates the practicality of combining our approach with human-in-the-loop evaluations. lastly, we showcase that agents trained via multi-agent debate learn to propose evidence that is resilient to refutations and closely aligns with human preferences.
Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, Atul Prakash
Abstract: large language models (llms) are typically aligned to be harmless to humans. unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. more recent llms often incorporate an additional layer of defense, a guard model, which is a second llm that is designed to check and moderate the output response of the primary llm. our key contribution is to show a novel attack strategy, prp, that is successful against several open-source (e.g., llama 2) and closed-source (e.g., gpt 3.5) implementations of guard models. prp leverages a two step prefix-based attack that operates by (a) constructing a universal adversarial prefix for the guard model, and (b) propagating this prefix to the response. we find that this procedure is effective across multiple threat models, including ones in which the adversary has no access to the guard model at all. our work suggests that further advances are required on defenses and guard models before they can be considered effective.
Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Ge Li
Abstract: recent statements about the impressive capabilities of large language models (llms) are usually supported by evaluating on open-access benchmarks. considering the vast size and wide-ranging sources of llms' training data, it could explicitly or implicitly include test data, leading to llms being more susceptible to data contamination. however, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for llms faces significant challenges. in this paper, we propose cdd, which stands for contamination detection via output distribution for llms. cdd necessitates only the sampled texts to detect data contamination, by identifying the peakedness of llm's output distribution. to mitigate the impact of data contamination in evaluation, we also present ted: trustworthy evaluation via output distribution, based on the correction of llm's output distribution. to facilitate this study, we introduce two benchmarks, i.e., detcon and comieval, for data contamination detection and contamination mitigation evaluation tasks. extensive experimental results show that cdd achieves the average relative improvements of 21.8\%-30.2\% over other contamination detection approaches in terms of accuracy, f1 score, and auc metrics, and can effectively detect contamination caused by the variants of test data. ted significantly mitigates performance improvements up to 66.9\% attributed to data contamination across 24 settings and 21 contamination degrees. in real-world applications, we reveal that chatgpt exhibits a high potential to suffer from data contamination on humaneval benchmark.
Md Tawkat Islam Khondaker, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan
Abstract: prior works on detoxification are scattered in the sense that they do not cover all aspects of detoxification needed in a real-world scenario. notably, prior works restrict the task of developing detoxification models to only a seen subset of platforms, leaving the question of how the models would perform on unseen platforms unexplored. additionally, these works do not address non-detoxifiability, a phenomenon whereby the toxic text cannot be detoxified without altering the meaning. we propose greenllama, the first comprehensive end-to-end detoxification framework, which attempts to alleviate the aforementioned limitations. we first introduce a cross-platform pseudo-parallel corpus applying multi-step data processing and generation strategies leveraging chatgpt. we then train a suite of detoxification models with our cross-platform corpus. we show that our detoxification models outperform the sota model trained with human-annotated parallel corpus. we further introduce explanation to promote transparency and trustworthiness. greenllama additionally offers a unique paraphrase detector especially dedicated for the detoxification task to tackle the non-detoxifiable cases. through experimental analysis, we demonstrate the effectiveness of our cross-platform corpus and the robustness of greenllama against adversarial toxicity.


Zejun Zhang, Li Zhang, Xin Yuan, Anlan Zhang, Mengwei Xu, Feng Qian
Abstract: with the advancement of large language models (llms), increasingly sophisticated and powerful gpts are entering the market. despite their popularity, the llm ecosystem still remains unexplored. additionally, llms' susceptibility to attacks raises concerns over safety and plagiarism. thus, in this work, we conduct a pioneering exploration of gpt stores, aiming to study vulnerabilities and plagiarism within gpt applications. to begin with, we conduct, to our knowledge, the first large-scale monitoring and analysis of two stores, an unofficial, and an official openai gpt store. then, we propose a trilevel gpt reversing (t-gr) strategy for extracting gpt internals. to complete these two tasks efficiently, we develop two automated tools: one for web scraping and another designed for programmatically interacting with gpts. our findings reveal a significant enthusiasm among users and developers for gpt interaction and creation, as evidenced by the rapid increase in gpts and their creators. however, we also uncover a widespread failure to protect gpt internals, with nearly 90% of system prompts easily accessible, leading to considerable plagiarism and duplication among gpts.
Heegyu Kim, Sehyun Yuk, Hyunsouk Cho
Abstract: caution: this paper includes offensive words that could potentially cause unpleasantness. language models (lms) are vulnerable to exploitation for adversarial misuse. training lms for safety alignment is extensive and makes it hard to respond to fast-developing attacks immediately, such as jailbreaks. we propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned lms and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks. additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. we've also observed that non-safety-aligned lms outperform safety-aligned lms in safety tasks by giving more helpful and safe responses. in conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety lm to be easily utilized in real-world service.
Xin Yi, Linlin Wang, Xiaoling Wang, Liang He
Abstract: impressive results have been achieved in natural language processing (nlp) tasks through the training of large language models (llms). however, these models occasionally produce toxic content such as insults, threats, and profanity in response to certain prompts, thereby constraining their practical utility. to tackle this issue, various finetuning-based and decoding-based approaches have been utilized to mitigate toxicity. however, these methods typically necessitate additional costs such as high-quality training data or auxiliary models. in this paper, we propose fine-grained detoxification via instance-level prefixes (fgdilp) to mitigate toxic text without additional cost. specifically, fgdilp contrasts the contextualized representation in attention space using a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. this allows for constructing fine-grained subtoxicity vectors, which enables collaborative detoxification by fusing them to correct the normal generation process when provided with a raw prompt. we validate that fgdilp enables controlled text generation with regard to toxicity at both the utterance and context levels. our method surpasses prompt-based baselines in detoxification, although at a slight cost to generation fluency and diversity.
Yiping Jin, Leo Wanner, Alexander Shvets
Abstract: online hate detection suffers from biases incurred in data sampling, annotation, and model pre-training. therefore, measuring the averaged performance over all examples in held-out test data is inadequate. instead, we must identify specific model weaknesses and be informed when it is more likely to fail. a recent proposal in this direction is hatecheck, a suite for testing fine-grained model functionalities on synthesized data generated using templates of the kind "you are just a [slur] to me." however, despite enabling more detailed diagnostic insights, the hatecheck test cases are often generic and have simplistic sentence structures that do not match the real-world data. to address this limitation, we propose gpt-hatecheck, a framework to generate more diverse and realistic functional tests from scratch by instructing large language models (llms). we employ an additional natural language inference (nli) model to verify the generations. crowd-sourced annotation demonstrates that the generated test cases are of high quality. using the new functional tests, we can uncover model weaknesses that would be overlooked using the original hatecheck dataset.
Somnath Banerjee, Sayan Layek, Rima Hazra, Animesh Mukherjee
Abstract: in this study, we tackle a growing concern around the safety and ethical use of large language models (llms). despite their potential, these models can be tricked into producing harmful or unethical content through various sophisticated methods, including 'jailbreaking' techniques and targeted manipulation. our work zeroes in on a specific issue: to what extent llms can be led astray by asking them to generate responses that are instruction-centric such as a pseudocode, a program or a software snippet as opposed to vanilla text. to investigate this question, we introduce techhazardqa, a dataset containing complex queries which should be answered in both text and instruction-centric formats (e.g., pseudocodes), aimed at identifying triggers for unethical responses. we query a series of llms -- llama-2-13b, llama-2-7b, mistral-v2 and mistral 8x7b -- and ask them to generate both text and instruction-centric responses. for evaluation we report the harmfulness score metric as well as judgements from gpt-4 and humans. overall, we observe that asking llms to produce instruction-centric responses enhances the unethical response generation by ~2-38% across the models. as an additional objective, we investigate the impact of model editing using the rome technique, which further increases the propensity for generating undesirable content. in particular, asking edited llms to generate instruction-centric responses further increases the unethical response generation by ~3-16% across the different models.
Zijie J. Wang, Chinmay Kulkarni, Lauren Wilcox, Michael Terry, Michael Madaio
Abstract: prompt-based interfaces for large language models (llms) have made prototyping and building ai-powered applications easier than ever before. however, identifying potential harms that may arise from ai applications remains a challenge, particularly during prompt-based prototyping. to address this, we present farsight, a novel in situ interactive tool that helps people identify potential harms from the ai applications they are prototyping. based on a user's prompt, farsight highlights news articles about relevant ai incidents and allows users to explore and edit llm-generated use cases, stakeholders, and harms. we report design insights from a co-design study with 10 ai prototypers and findings from a user study with 42 ai prototypers. after using farsight, ai prototypers in our user study are better able to independently identify potential harms associated with a prompt and find our tool more useful and usable than existing resources. their qualitative feedback also highlights that farsight encourages them to focus on end-users and think beyond immediate harms. we discuss these findings and reflect on their implications for designing ai prototyping experiences that meaningfully engage with ai harms. farsight is publicly accessible at:
Yiran Liu, Ke Yang, Zehan Qi, Xiao Liu, Yang Yu, Chengxiang Zhai
Abstract: the growing integration of large language models (llms) into social operations amplifies their impact on decisions in crucial areas such as economics, law, education, and healthcare, raising public concerns about these models' discrimination-related safety and reliability. however, prior discrimination measuring frameworks solely assess the average discriminatory behavior of llms, often proving inadequate due to the overlook of an additional discrimination-leading factor, i.e., the llms' prediction variation across diverse contexts. in this work, we present the prejudice-caprice framework (pcf) that comprehensively measures discrimination in llms by considering both their consistently biased preference and preference variation across diverse contexts. specifically, we mathematically dissect the aggregated contextualized discrimination risk of llms into prejudice risk, originating from llms' persistent prejudice, and caprice risk, stemming from their generation inconsistency. in addition, we utilize a data-mining approach to gather preference-detecting probes from sentence skeletons, devoid of attribute indications, to approximate llms' applied contexts. while initially intended for assessing discrimination in llms, our proposed pcf facilitates the comprehensive and flexible measurement of any inductive biases, including knowledge alongside prejudice, across various modality models. we apply our discrimination-measuring framework to 12 common llms, yielding intriguing findings: i) modern llms demonstrate significant pro-male stereotypes, ii) llms' exhibited discrimination correlates with several social and economic factors, iii) prejudice risk dominates the overall discrimination risk and follows a normal distribution, and iv) caprice risk contributes minimally to the overall risk but follows a fat-tailed distribution, suggesting that it is wild risk requiring enhanced surveillance.
Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi
Abstract: in this paper, we introduce a novel class of fast, beam search-based adversarial attack (beast) for language models (lms). beast employs interpretable parameters, enabling attackers to balance between attack speed, success rate, and the readability of adversarial prompts. the computational efficiency of beast facilitates us to investigate its applications on lms for jailbreaking, eliciting hallucinations, and privacy attacks. our gradient-free targeted attack can jailbreak aligned lms with high attack success rates within one minute. for instance, beast can jailbreak vicuna-7b-v1.5 under one minute with a success rate of 89% when compared to a gradient-based baseline that takes over an hour to achieve 70% success rate using a single nvidia rtx a6000 48gb gpu. additionally, we discover a unique outcome wherein our untargeted attack induces hallucinations in lm chatbots. through human evaluations, we find that our untargeted attack causes vicuna-7b-v1.5 to produce ~15% more incorrect outputs when compared to lm outputs in the absence of our attack. we also learn that 22% of the time, beast causes vicuna to generate outputs that are not relevant to the original prompt. further, we use beast to generate adversarial prompts in a few seconds that can boost the performance of existing membership inference attacks for lms. we believe that our fast attack, beast, has the potential to accelerate research in lm security and privacy. our codebase is publicly available at
Ante Wang, Linfeng Song, Baolin Peng, Ye Tian, Lifeng Jin, Haitao Mi, Jinsong Su, Dong Yu
Abstract: this work studies improving large language model (llm) generations at inference time by mitigating fact-conflicting hallucinations. particularly, we propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses. compared with prior ensemble methods (wang et al., 2022;chen et al., 2023)) that perform response-level selection, our approach can better alleviate hallucinations, especially for longform generation tasks. our approach can broadly benefit smaller and open-source llms as it mainly conducts simple content-based comparisons. experiments on biographies show that our method can effectively improve the factuality of generations with simple and intuitive prompts across different scales of llms. besides, comprehensive analyses on triviaqa and gsm8k demonstrate the potential of self-endorsement for broader application.
Zihan Zhou, Jonathan Booher, Wei Liu, Aleksandr Petiushko, Animesh Garg
Abstract: safe reinforcement learning tasks with multiple constraints are a challenging domain despite being very common in the real world. to address this challenge, we propose objective suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic. we benchmark objective suppression in two multi-constraint safety domains, including an autonomous driving domain where any incorrect behavior can lead to disastrous consequences. empirically, we demonstrate that our proposed method, when combined with existing safe rl algorithms, can match the task reward achieved by our baselines with significantly fewer constraint violations.
Zhenhua Wang, Wei Xie, Baosheng Wang, Enze Wang, Zhiwen Gui, Shuoyoucheng Ma, Kai Chen
Abstract: large language models (llms) have gradually become the gateway for people to acquire new knowledge. however, attackers can break the model's security protection ("jail") to access restricted information, which is called "jailbreaking." previous studies have shown the weakness of current llms when confronted with such jailbreaking attacks. nevertheless, comprehension of the intrinsic decision-making mechanism within the llms upon receipt of jailbreak prompts is noticeably lacking. our research provides a psychological explanation of the jailbreak prompts. drawing on cognitive consistency theory, we argue that the key to jailbreak is guiding the llm to achieve cognitive coordination in an erroneous direction. further, we propose an automatic black-box jailbreaking method based on the foot-in-the-door (fitd) technique. this method progressively induces the model to answer harmful questions via multi-step incremental prompts. we instantiated a prototype system to evaluate the jailbreaking effectiveness on 8 advanced llms, yielding an average success rate of 83.9%. this study builds a psychological perspective on the explanatory insights into the intrinsic decision-making logic of llms.
Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han Xu, Jie Ren, Shuaiqiang Wang, Dawei Yin, Yi Chang, Jiliang Tang
Abstract: retrieval-augmented generation (rag) is a powerful technique to facilitate language model with proprietary and private data, where data privacy is a pivotal concern. whereas extensive research has demonstrated the privacy risks of large language models (llms), the rag technique could potentially reshape the inherent behaviors of llm generation, posing new privacy issues that are currently under-explored. in this work, we conduct extensive empirical studies with novel attack methods, which demonstrate the vulnerability of rag systems on leaking the private retrieval database. despite the new risk brought by rag on the retrieval data, we further reveal that rag can mitigate the leakage of the llms' training data. overall, we provide new insights in this paper for privacy protection of retrieval-augmented llms, which benefit both llms and rag systems builders. our code is available at


Ang Li, Jingqian Zhao, Bin Liang, Lin Gui, Hui Wang, Xi Zeng, Kam-Fai Wong, Ruifeng Xu
Abstract: large language models (llms) have achieved remarkable progress in many natural language processing tasks. however, our experiment reveals that, in stance detection tasks, llms may generate biased stances due to spurious sentiment-stance correlation and preference towards certain individuals and topics, thus harming their performance. therefore, in this paper, we propose to mitigate biases of llms in stance detection with calibration (mb-cal). in which, a novel gated calibration network is devised to mitigate the biases on the stance reasoning results from llms. further, to make the calibration more accurate and generalizable, we construct counterfactual augmented data to rectify stance biases. experimental results on in-target and zero-shot stance detection tasks show that the proposed mb-cal can effectively mitigate biases of llms, achieving state-of-the-art results.
Chen Jia
Abstract: preference learning (pl) with large language models (llms) aims to align the llms' generations with human preferences. previous work on reinforcement learning from human feedback (rlhf) has demonstrated promising results in in-distribution pl. however, due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging. thus, out-of-distribution (ood) pl is practically useful for enhancing the generalization ability of llms with limited preference feedback. this work addresses ood pl by optimizing a general reward model through a meta-learning approach. during meta-training, a bilevel optimization algorithm is utilized to learn a reward model capable of guiding policy learning to align with human preferences across various distributions. when encountering a test distribution, the meta-test procedure conducts regularized policy optimization using the learned reward model for pl. we theoretically demonstrate the convergence rate of the bilevel optimization algorithm under reasonable assumptions. additionally, we conduct experiments on two text generation tasks across 20 held-out domains and outperform a variety of strong baselines across various evaluation metrics.
Yuzhe Yang, Yujia Liu, Xin Liu, Avanti Gulhane, Domenico Mastrodicasa, Wei Wu, Edward J Wang, Dushyant W Sahani, Shwetak Patel
Abstract: advances in artificial intelligence (ai) have achieved expert-level performance in medical imaging applications. notably, self-supervised vision-language foundation models can detect a broad spectrum of pathologies without relying on explicit training annotations. however, it is crucial to ensure that these ai models do not mirror or amplify human biases, thereby disadvantaging historically marginalized groups such as females or black patients. the manifestation of such biases could systematically delay essential medical care for certain patient subgroups. in this study, we investigate the algorithmic fairness of state-of-the-art vision-language foundation models in chest x-ray diagnosis across five globally-sourced datasets. our findings reveal that compared to board-certified radiologists, these foundation models consistently underdiagnose marginalized groups, with even higher rates seen in intersectional subgroups, such as black female patients. such demographic biases present over a wide range of pathologies and demographic attributes. further analysis of the model embedding uncovers its significant encoding of demographic information. deploying ai systems with these biases in medical imaging can intensify pre-existing care disparities, posing potential challenges to equitable healthcare access and raising ethical questions about their clinical application.
Priyanshul Govil, Vamshi Krishna Bonagiri, Manas Gaur, Ponnurangam Kumaraguru, Sanorita Dey
Abstract: large language models (llms) are trained on inherently biased data. previous works on debiasing models rely on benchmark datasets to measure model performance. however, these datasets suffer from several pitfalls due to the extremely subjective understanding of bias, highlighting a critical need for contextual exploration. we propose understanding the context of user inputs with consideration of the diverse situations in which input statements are possible. this approach would allow for frameworks that foster bias awareness rather than guardrails that hurt user engagement. our contribution is twofold: (i) we create a dataset of 2287 stereotyped statements augmented with points for adding context; (ii) we develop the context-oriented bias indicator and assessment score (cobias) to assess statements' contextual reliability in measuring bias. our metric is a significant predictor of the contextual reliability of bias-benchmark datasets ($\chi^2=71.02, p<2.2 \cdot 10^{-16})$. cobias can be used to create reliable datasets, resulting in an improvement in bias mitigation works.
Oliver Bentham, Nathan Stringham, Ana Marasović
Abstract: understanding the extent to which chain-of-thought (cot) generations align with a large language model's (llm) internal computations is critical for deciding whether to trust an llm's output. as a proxy for cot faithfulness, arxiv:2307.13702 propose a metric that measures a model's dependence on its cot for producing an answer. within a single family of proprietary models, they find that llms exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. we evaluate whether these results generalize as a property of all llms. we replicate their experimental setup with three different families of models and, under specific conditions, successfully reproduce the scaling trends for cot faithfulness they report. however, we discover that simply changing the order of answer choices in the prompt can reduce the metric by 73 percentage points. the faithfulness metric is also highly correlated ($r^2$ = 0.91) with accuracy, raising doubts about its validity as a construct for evaluating faithfulness.
Zefeng Wang, Zhen Han, Shuo Chen, Fan Xue, Zifeng Ding, Xun Xiao, Volker Tresp, Philip Torr, Jindong Gu
Abstract: recently, multimodal llms (mllms) have shown a great ability to understand images. however, like traditional vision models, they are still vulnerable to adversarial images. meanwhile, chain-of-thought (cot) reasoning has been widely explored on mllms, which not only improves model's performance, but also enhances model's explainability by giving intermediate reasoning steps. nevertheless, there is still a lack of study regarding mllms' adversarial robustness with cot and an understanding of what the rationale looks like when mllms infer wrong answers with adversarial images. our research evaluates the adversarial robustness of mllms when employing cot reasoning, finding that cot marginally improves adversarial robustness against existing attack methods. moreover, we introduce a novel stop-reasoning attack technique that effectively bypasses the cot-induced robustness enhancements. finally, we demonstrate the alterations in cot reasoning when mllms confront adversarial images, shedding light on their reasoning process under adversarial attacks.
Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick Mcdaniel, Muhao Chen, Bo Li, Chaowei Xiao
Abstract: despite the general capabilities of large language models (llms) like gpt-4 and llama-2, these models still request fine-tuning or adaptation with customized data when it comes to meeting the specific business demands and intricacies of tailored use cases. however, this process inevitably introduces new safety threats, particularly against the fine-tuning based jailbreak attack (fjattack), where incorporating just a few harmful examples into the fine-tuning dataset can significantly compromise the model safety. though potential defenses have been proposed by incorporating safety examples into the fine-tuning dataset to reduce the safety issues, such approaches require incorporating a substantial amount of safety examples, making it inefficient. to effectively defend against the fjattack with limited safety examples, we propose a backdoor enhanced safety alignment method inspired by an analogy with the concept of backdoor attacks. in particular, we construct prefixed safety examples by integrating a secret prompt, acting as a "backdoor trigger", that is prefixed to safety examples. our comprehensive experiments demonstrate that through the backdoor enhanced safety alignment with adding as few as 11 prefixed safety examples, the maliciously fine-tuned llms will achieve similar safety performance as the original aligned models. furthermore, we also explore the effectiveness of our method in a more practical setting where the fine-tuning data consists of both fjattack examples and the fine-tuning task data. our method shows great efficacy in defending against fjattack without harming the performance of fine-tuning tasks.
Victoria Lin, Eli Ben-Michael, Louis-Philippe Morency
Abstract: as large language models (llms) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. in this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. we first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. we formalize this causal language optimization problem, and we develop a method--causal preference optimization (cpo)--that solves an unbiased surrogate objective for the problem. we further extend cpo with doubly robust cpo (dr-cpo), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. finally, we empirically demonstrate the effectiveness of (dr-)cpo in optimizing state-of-the-art llms for human preferences on direct outcome data, and we validate the robustness of dr-cpo under difficult confounding conditions.
Michael J. Ryan, William Held, Diyi Yang
Abstract: before being deployed for user-facing applications, developers align large language models (llms) to user preferences through a variety of procedures, such as reinforcement learning from human feedback (rlhf) and direct preference optimization (dpo). current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. however, human preferences are not universal, and aligning to specific preference sets may have unintended effects. we explore how alignment impacts performance along three axes of global representation: english dialects, multilingualism, and opinions from and about countries worldwide. our results show that current alignment procedures create disparities between english dialects and global opinions. we find alignment improves capabilities in several languages. we conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning.
Yang Deng, Yong Zhao, Moxin Li, See-Kiong Ng, Tat-Seng Chua
Abstract: despite the remarkable abilities of large language models (llms) to answer questions, they often display a considerable level of overconfidence even when the question does not have a definitive answer. to avoid providing hallucinated answers to these unknown questions, existing studies typically investigate approaches to refusing to answer these questions. in this work, we propose a novel and scalable self-alignment method to utilize the llm itself to enhance its response-ability to different types of unknown questions, being capable of not only refusing to answer but also providing explanation to the unanswerability of unknown questions. specifically, the self-align method first employ a two-stage class-aware self-augmentation approach to generate a large amount of unknown question-response data. then we conduct disparity-driven self-curation to select qualified data for fine-tuning the llm itself for aligning the responses to unknown questions as desired. experimental results on two datasets across four types of unknown questions validate the superiority of the self-align method over existing baselines in terms of three types of task formulation.
Yuwei Wu, Shijing Si, Yugui Zhang, Jiawen Gu, Jedrek Wosik
Abstract: email continues to be a pivotal and extensively utilized communication medium within professional and commercial domains. nonetheless, the prevalence of spam emails poses a significant challenge for users, disrupting their daily routines and diminishing productivity. consequently, accurately identifying and filtering spam based on content has become crucial for cybersecurity. recent advancements in natural language processing, particularly with large language models like chatgpt, have shown remarkable performance in tasks such as question answering and text generation. however, its potential in spam identification remains underexplored. to fill in the gap, this study attempts to evaluate chatgpt's capabilities for spam identification in both english and chinese email datasets. we employ chatgpt for spam email detection using in-context learning, which requires a prompt instruction and a few demonstrations. we also investigate how the training example size affects the performance of chatgpt. for comparison, we also implement five popular benchmark methods, including naive bayes, support vector machines (svm), logistic regression (lr), feedforward dense neural networks (dnn), and bert classifiers. though extensive experiments, the performance of chatgpt is significantly worse than deep supervised learning methods in the large english dataset, while it presents superior performance on the low-resourced chinese dataset, even outperforming bert in this case.


Lingxi Zhang, Yue Yu, Kuan Wang, Chao Zhang
Abstract: retrieval-augmented generation enhances large language models (llms) by incorporating relevant information from external knowledge sources. this enables llms to adapt to specific domains and mitigate hallucinations in knowledge-intensive tasks. however, existing retrievers are often misaligned with llms due to their separate training processes and the black-box nature of llms. to address this challenge, we propose arl2, a retriever learning technique that harnesses llms as labelers. arl2 leverages llms to annotate and score relevant evidence, enabling learning the retriever from robust llm supervision. furthermore, arl2 uses an adaptive self-training strategy for curating high-quality and diverse relevance data, which can effectively reduce the annotation cost. extensive experiments demonstrate the effectiveness of arl2, achieving accuracy improvements of 5.4% on nq and 4.6% on mmlu compared to the state-of-the-art methods. additionally, arl2 exhibits robust transfer learning capabilities and strong zero-shot generalization abilities. our code will be published at \url{}.
Jiyoung Lee, Minwoo Kim, Seungho Kim, Junghwan Kim, Seunghyun Won, Hwaran Lee, Edward Choi
Abstract: for large language models (llms) to be effectively deployed in a specific country, they must possess an understanding of the nation's culture and basic knowledge. to this end, we introduce national alignment, which measures an alignment between an llm and a targeted country from two aspects: social value alignment and common knowledge alignment. social value alignment evaluates how well the model understands nation-specific social values, while common knowledge alignment examines how well the model captures basic knowledge related to the nation. we constructed kornat, the first benchmark that measures national alignment with south korea. for the social value dataset, we obtained ground truth labels from a large-scale survey involving 6,174 unique korean participants. for the common knowledge dataset, we constructed samples based on korean textbooks and ged reference materials. kornat contains 4k and 6k multiple-choice questions for social value and common knowledge, respectively. our dataset creation process is meticulously designed and based on statistical sampling theory and was refined through multiple rounds of human review. the experiment results of seven llms reveal that only a few models met our reference score, indicating a potential for further enhancement. kornat has received government approval after passing an assessment conducted by a government-affiliated organization dedicated to evaluating dataset quality. samples and detailed evaluation protocols of our dataset can be found in
Da Yu, Peter Kairouz, Sewoong Oh, Zheng Xu
Abstract: service providers of large language model (llm) applications collect user instructions in the wild and use them in further aligning llms with users' intentions. these instructions, which potentially contain sensitive information, are annotated by human workers in the process. this poses a new privacy risk not addressed by the typical private optimization. to this end, we propose using synthetic instructions to replace real instructions in data annotation and model fine-tuning. formal differential privacy is guaranteed by generating those synthetic instructions using privately fine-tuned generators. crucial in achieving the desired utility is our novel filtering algorithm that matches the distribution of the synthetic instructions to that of the real ones. in both supervised fine-tuning and reinforcement learning from human feedback, our extensive experiments demonstrate the high utility of the final set of synthetic instructions by showing comparable results to real instructions. in supervised fine-tuning, models trained with private synthetic instructions outperform leading open-source models such as vicuna.
Michal Spiegel, Dominik Macko
Abstract: semeval-2024 task 8 is focused on multigenerator, multidomain, and multilingual black-box machine-generated text detection. such a detection is important for preventing a potential misuse of large language models (llms), the newest of which are very capable in generating multilingual human-like texts. we have coped with this task in multiple ways, utilizing language identification and parameter-efficient fine-tuning of smaller llms for text classification. we have further used the per-language classification-threshold calibration to uniquely combine fine-tuned models predictions with statistical detection metrics to improve generalization of the system detection performance. our submitted method achieved competitive results, ranking at the fourth place, just under 1 percentage point behind the winner.
Vamshi Krishna Bonagiri, Sreeram Vennam, Priyanshul Govil, Ponnurangam Kumaraguru, Manas Gaur
Abstract: despite recent advancements showcasing the impressive capabilities of large language models (llms) in conversational systems, we show that even state-of-the-art llms are morally inconsistent in their generations, questioning their reliability (and trustworthiness in general). prior works in llm evaluation focus on developing ground-truth data to measure accuracy on specific tasks. however, for moral scenarios that often lack universally agreed-upon answers, consistency in model responses becomes crucial for their reliability. to address this issue, we propose an information-theoretic measure called semantic graph entropy (sage), grounded in the concept of "rules of thumb" (rots) to measure a model's moral consistency. rots are abstract principles learned by a model and can help explain their decision-making strategies effectively. to this extent, we construct the moral consistency corpus (mcc), containing 50k moral questions, responses to them by llms, and the rots that these models followed. furthermore, to illustrate the generalizability of sage, we use it to investigate llm consistency on two popular datasets -- truthfulqa and hellaswag. our results reveal that task-accuracy and consistency are independent problems, and there is a dire need to investigate these issues further.
Hezhao Zhang, Lasana Harris, Nafise Sadat Moosavi
Abstract: dehumanization, characterized as a subtle yet harmful manifestation of hate speech, involves denying individuals of their human qualities and often results in violence against marginalized groups. despite significant progress in natural language processing across various domains, its application in detecting dehumanizing language is limited, largely due to the scarcity of publicly available annotated data for this domain. this paper evaluates the performance of cutting-edge nlp models, including gpt-4, gpt-3.5, and llama-2, in identifying dehumanizing language. our findings reveal that while these models demonstrate potential, achieving a 70\% accuracy rate in distinguishing dehumanizing language from broader hate speech, they also display biases. they are over-sensitive in classifying other forms of hate speech as dehumanization for a specific subset of target groups, while more frequently failing to identify clear cases of dehumanization for other target groups. moreover, leveraging one of the best-performing models, we automatically annotated a larger dataset for training more accessible models. however, our findings indicate that these models currently do not meet the high-quality data generation threshold necessary for this task.
Robin Staab, Mark Vero, Mislav Balunović, Martin Vechev
Abstract: recent work in privacy research on large language models has shown that they achieve near human-level performance at inferring personal data from real-world online texts. with consistently increasing model capabilities, existing text anonymization methods are currently lacking behind regulatory requirements and adversarial threats. this raises the question of how individuals can effectively protect their personal data in sharing online texts. in this work, we take two steps to answer this question: we first present a new setting for evaluating anonymizations in the face of adversarial llms inferences, allowing for a natural measurement of anonymization performance while remedying some of the shortcomings of previous metrics. we then present our llm-based adversarial anonymization framework leveraging the strong inferential capabilities of llms to inform our anonymization procedure. in our experimental evaluation, we show on real-world and synthetic online texts how adversarial anonymization outperforms current industry-grade anonymizers both in terms of the resulting utility and privacy.
Mohammad Amaz Uddin, Iqbal H. Sarker
Abstract: phishing email is a serious cyber threat that tries to deceive users by sending false emails with the intention of stealing confidential information or causing financial harm. attackers, often posing as trustworthy entities, exploit technological advancements and sophistication to make detection and prevention of phishing more challenging. despite extensive academic research, phishing detection remains an ongoing and formidable challenge in the cybersecurity landscape. large language models (llms) and masked language models (mlms) possess immense potential to offer innovative solutions to address long-standing challenges. in this research paper, we present an optimized, fine-tuned transformer-based distilbert model designed for the detection of phishing emails. in the detection process, we work with a phishing email dataset and utilize the preprocessing techniques to clean and solve the imbalance class issues. through our experiments, we found that our model effectively achieves high accuracy, demonstrating its capability to perform well. finally, we demonstrate our fine-tuned model using explainable-ai (xai) techniques such as local interpretable model-agnostic explanations (lime) and transformer interpret to explain how our model makes predictions in the context of text classification for phishing emails.
Prakamya Mishra, Zonghai Yao, Parth Vashisht, Feiyun Ouyang, Beining Wang, Vidhi Dhaval Mody, Hong Yu
Abstract: large language models (llms) such as gpt and llama have demonstrated significant achievements in summarization tasks but struggle with factual inaccuracies, a critical issue in clinical nlp applications where errors could lead to serious consequences. to counter the high costs and limited availability of expert-annotated data for factual alignment, this study introduces an innovative pipeline that utilizes gpt-3.5 and gpt-4 to generate high-quality feedback aimed at enhancing factual consistency in clinical note summarization. our research primarily focuses on edit feedback, mirroring the practical scenario in which medical professionals refine ai system outputs without the need for additional annotations. despite gpt's proven expertise in various clinical nlp tasks, such as the medical licensing examination, there is scant research on its capacity to deliver expert-level edit feedback for improving weaker lms or llms generation quality. this work leverages gpt's advanced capabilities in clinical nlp to offer expert-level edit feedback. through the use of two distinct alignment algorithms (dpo and salt) based on gpt edit feedback, our goal is to reduce hallucinations and align closely with medical facts, endeavoring to narrow the divide between ai-generated content and factual accuracy. this highlights the substantial potential of gpt edits in enhancing the alignment of clinical factuality.
Federico Bianchi, James Zou
Abstract: the risks derived from large language models (llms) generating deceptive and damaging content have been the subject of considerable research, but even safe generations can lead to problematic downstream impacts. in our study, we shift the focus to how even safe text coming from llms can be easily turned into potentially dangerous content through bait-and-switch attacks. in such attacks, the user first prompts llms with safe questions and then employs a simple find-and-replace post-hoc technique to manipulate the outputs into harmful narratives. the alarming efficacy of this approach in generating toxic content highlights a significant challenge in developing reliable safety guardrails for llms. in particular, we stress that focusing on the safety of the verbatim llm outputs is insufficient and that we also need to consider post-hoc transformations.
Rahul Zalkikar, Kanchan Chandra
Abstract: social and political scientists often aim to discover and measure distinct biases from text data representations (embeddings). innovative transformer-based language models produce contextually-aware token embeddings and have achieved state-of-the-art performance for a variety of natural language tasks, but have been shown to encode unwanted biases for downstream applications. in this paper, we evaluate the social biases encoded by transformers trained with the masked language modeling objective using proposed proxy functions within an iterative masking experiment to measure the quality of transformer models' predictions, and assess the preference of mlms towards disadvantaged and advantaged groups. we compare bias estimations with those produced by other evaluation methods using two benchmark datasets, finding relatively high religious and disability biases across considered mlms and low gender bias in one dataset relative to the other. our measures outperform others in their agreement with human annotators. we extend on previous work by evaluating social biases introduced after re-training an mlm under the masked language modeling objective (w.r.t. the model's pre-trained base), and find that proposed measures produce more accurate estimations of relative preference for biased sentences between transformers than others based on our methods.
Vyas Raina, Adian Liusie, Mark Gales
Abstract: large language models (llms) are powerful zero-shot assessors and are increasingly used in real-world situations such as for written exams or benchmarking systems. despite this, no existing work has analyzed the vulnerability of judge-llms against adversaries attempting to manipulate outputs. this work presents the first study on the adversarial robustness of assessment llms, where we search for short universal phrases that when appended to texts can deceive llms to provide high assessment scores. experiments on summeval and topicalchat demonstrate that both llm-scoring and pairwise llm-comparative assessment are vulnerable to simple concatenation attacks, where in particular llm-scoring is very susceptible and can yield maximum assessment scores irrespective of the input text quality. interestingly, such attacks are transferable and phrases learned on smaller open-source llms can be applied to larger closed-source models, such as gpt3.5. this highlights the pervasive nature of the adversarial vulnerabilities across different judge-llm sizes, families and methods. our findings raise significant concerns on the reliability of llms-as-a-judge methods, and underscore the importance of addressing vulnerabilities in llm assessment methods before deployment in high-stakes real-world scenarios.
Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein
Abstract: it has recently been shown that adversarial attacks on large language models (llms) can "jailbreak" the model into making harmful statements. in this work, we argue that the spectrum of adversarial attacks on llms is much larger than merely jailbreaking. we provide a broad overview of possible attack surfaces and attack goals. based on a series of concrete examples, we discuss, categorize and systematize attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. we analyze these attacks in controlled experiments, and find that many of them stem from the practice of pre-training llms with coding capabilities, as well as the continued existence of strange "glitch" tokens in common llm vocabularies that should be removed for security reasons.
Han Zhang, Lin Gui, Yu Lei, Yuanzhao Zhai, Yehong Zhang, Yulan He, Hui Wang, Yue Yu, Kam-Fai Wong, Bin Liang, Ruifeng Xu
Abstract: reinforcement learning from human feedback (rlhf) is commonly utilized to improve the alignment of large language models (llms) with human preferences. given the evolving nature of human preferences, continual alignment becomes more crucial and practical in comparison to traditional static alignment. nevertheless, making rlhf compatible with continual learning (cl) is challenging due to its complex process. meanwhile, directly learning new human preferences may lead to catastrophic forgetting (cf) of historical preferences, resulting in helpless or harmful outputs. to overcome these challenges, we propose the continual optimal policy regularization (copr) method, which draws inspiration from the optimal policy theory. copr utilizes a sampling distribution as a demonstration and regularization constraints for cl. it adopts the lagrangian duality (ld) method to dynamically regularize the current policy based on the historically optimal policy, which prevents cf and avoids over-emphasizing unbalanced objectives. we also provide formal proof for the learnability of copr. the experimental results show that copr outperforms strong cl baselines on our proposed benchmark, in terms of reward-based, gpt-4 evaluations and human assessment. furthermore, we validate the robustness of copr under various cl settings, including different backbones, replay memory sizes, and learning orders.
Masahiro Kaneko, Danushka Bollegala, Timothy Baldwin
Abstract: recent studies have demonstrated that large language models (llms) have ethical-related problems such as social biases, lack of moral reasoning, and generation of offensive content. the existing evaluation metrics and methods to address these ethical challenges use datasets intentionally created by instructing humans to create instances including ethical problems. therefore, the data does not reflect prompts that users actually provide when utilizing llm services in everyday contexts. this may not lead to the development of safe llms that can address ethical challenges arising in real-world applications. in this paper, we create eagle datasets extracted from real interactions between chatgpt and users that exhibit social biases, toxicity, and immoral problems. our experiments show that eagle captures complementary aspects, not covered by existing datasets proposed for evaluation and mitigation of such ethical challenges. our code is publicly available at
Yupeng Cao, Aishwarya Muralidharan Nair, Elyon Eyimife, Nastaran Jamalipour Soofi, K. P. Subbalakshmi, John R. Wullert, Chumki Basu, David Shallcross
Abstract: scientific facts are often spun in the popular press with the intent to influence public opinion and action, as was evidenced during the covid-19 pandemic. automatic detection of misinformation in the scientific domain is challenging because of the distinct styles of writing in these two media types and is still in its nascence. most research on the validity of scientific reporting treats this problem as a claim verification challenge. in doing so, significant expert human effort is required to generate appropriate claims. our solution bypasses this step and addresses a more real-world scenario where such explicit, labeled claims may not be available. the central research question of this paper is whether it is possible to use large language models (llms) to detect misinformation in scientific reporting. to this end, we first present a new labeled dataset scinews, containing 2.4k scientific news stories drawn from trusted and untrustworthy sources, paired with related abstracts from the cord-19 database. our dataset includes both human-written and llm-generated news articles, making it more comprehensive in terms of capturing the growing trend of using llms to generate popular press articles. then, we identify dimensions of scientific validity in science news articles and explore how this can be integrated into the automated detection of scientific misinformation. we propose several baseline architectures using llms to automatically detect false representations of scientific findings in the popular press. for each of these architectures, we use several prompt engineering strategies including zero-shot, few-shot, and chain-of-thought prompting. we also test these architectures and prompting strategies on gpt-3.5, gpt-4, and llama2-7b, llama2-13b.
Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, Ee-Chien Chang
Abstract: large language models (llms), used in creative writing, code generation, and translation, generate text based on input sequences but are vulnerable to jailbreak attacks, where crafted prompts induce harmful outputs. most jailbreak prompt methods use a combination of jailbreak templates followed by questions to ask to create jailbreak prompts. however, existing jailbreak prompt designs generally suffer from excessive semantic differences, resulting in an inability to resist defenses that use simple semantic metrics as thresholds. jailbreak prompts are semantically more varied than the original questions used for queries. in this paper, we introduce a semantic mirror jailbreak (smj) approach that bypasses llms by generating jailbreak prompts that are semantically similar to the original question. we model the search for jailbreak prompts that satisfy both semantic similarity and jailbreak validity as a multi-objective optimization problem and employ a standardized set of genetic algorithms for generating eligible prompts. compared to the baseline autodan-ga, smj achieves attack success rates (asr) that are at most 35.4% higher without onion defense and 85.2% higher with onion defense. smj's better performance in all three semantic meaningfulness metrics of jailbreak prompt, similarity, and outlier, also means that smj is resistant to defenses that use those metrics as thresholds.
Bradley Emi, Max Spero
Abstract: we present the checkforai text classifier, a transformer-based neural network trained to distinguish text written by large language models from text written by humans. checkforai outperforms zero-shot methods such as detectgpt as well as leading commercial ai detection tools with over 9 times lower error rates on a comprehensive benchmark comprised of ten text domains (student writing, creative writing, scientific writing, books, encyclopedias, news, email, scientific papers, short-form q&a) and 8 open- and closed-source large language models. we propose a training algorithm, hard negative mining with synthetic mirrors, that enables our classifier to achieve orders of magnitude lower false positive rates on high-data domains such as reviews. finally, we show that checkforai is not biased against nonnative english speakers and generalizes to domains and models unseen during training.
Amit Haim, Alejandro Salinas, Julian Nyarko
Abstract: we employ an audit design to investigate biases in state-of-the-art large language models, including gpt-4. in our study, we elicit prompt the models for advice regarding an individual across a variety of scenarios, such as during car purchase negotiations or election outcome predictions. we find that the advice systematically disadvantages names that are commonly associated with racial minorities and women. names associated with black women receive the least advantageous outcomes. the biases are consistent across 42 prompt templates and several models, indicating a systemic issue rather than isolated incidents. while providing numerical, decision-relevant anchors in the prompt can successfully counteract the biases, qualitative details have inconsistent effects and may even increase disparities. our findings underscore the importance of conducting audits at the point of llm deployment and implementation to mitigate their potential for harm against marginalized communities.
Shen Li, Liuyi Yao, Jinyang Gao, Lan Zhang, Yaliang Li
Abstract: to support various applications, business owners often seek the customized models that are obtained by fine-tuning a pre-trained llm through the api provided by llm owners or cloud servers. however, this process carries a substantial risk of model misuse, potentially resulting in severe economic consequences for business owners. thus, safeguarding the copyright of these customized models during llm fine-tuning has become an urgent practical requirement, but there are limited existing solutions to provide such protection. to tackle this pressing issue, we propose a novel watermarking approach named "double-i watermark". specifically, based on the instruct-tuning data, two types of backdoor data paradigms are introduced with trigger in the instruction and the input, respectively. by leveraging llm's learning capability to incorporate customized backdoor samples into the dataset, the proposed approach effectively injects specific watermarking information into the customized model during fine-tuning, which makes it easy to inject and verify watermarks in commercial scenarios. we evaluate the proposed "double-i watermark" under various fine-tuning methods, demonstrating its harmlessness, robustness, uniqueness, imperceptibility, and validity through both theoretical analysis and experimental verification.


Zeyang Sha, Yang Zhang
Abstract: the increasing reliance on large language models (llms) such as chatgpt in various fields emphasizes the importance of ``prompt engineering,'' a technology to improve the quality of model outputs. with companies investing significantly in expert prompt engineers and educational resources rising to meet market demand, designing high-quality prompts has become an intriguing challenge. in this paper, we propose a novel attack against llms, named prompt stealing attacks. our proposed prompt stealing attack aims to steal these well-designed prompts based on the generated answers. the prompt stealing attack contains two primary modules: the parameter extractor and the prompt reconstruction. the goal of the parameter extractor is to figure out the properties of the original prompts. we first observe that most prompts fall into one of three categories: direct prompt, role-based prompt, and in-context prompt. our parameter extractor first tries to distinguish the type of prompts based on the generated answers. then, it can further predict which role or how many contexts are used based on the types of prompts. following the parameter extractor, the prompt reconstructor can be used to reconstruct the original prompts based on the generated answers and the extracted features. the final goal of the prompt reconstructor is to generate the reversed prompts, which are similar to the original prompts. our experimental results show the remarkable performance of our proposed attacks. our proposed attacks add a new dimension to the study of prompt engineering and call for more attention to the security issues on llms.
Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, Seong Joon Oh
Abstract: large language model (llm) services and models often come with legal rules on who can use them and how they must use them. assessing the compliance of the released llms is crucial, as these rules protect the interests of the llm contributor and prevent misuse. in this context, we describe the novel problem of black-box identity verification (bbiv). the goal is to determine whether a third-party application uses a certain llm through its chat function. we propose a method called targeted random adversarial prompt (trap) that identifies the specific llm in use. we repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target llm, while other models give random answers. trap detects the target llms with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. trap remains effective even if the llm has minor changes that do not significantly alter the original function.
Yujun Zhou, Yufei Han, Haomin Zhuang, Taicheng Guo, Kehan Guo, Zhenwen Liang, Hongyan Bao, Xiangliang Zhang
Abstract: large language models (llms) demonstrate remarkable capabilities across diverse applications. however, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. drawing inspiration from adversarial training in deep learning and llm agent learning processes, we introduce the in-context adversarial game (icag) for defending against jailbreaks without the need for fine-tuning. icag leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. unlike traditional methods that rely on static datasets, icag employs an iterative process to enhance both the defense and attack agents. this continuous improvement process strengthens defenses against newly generated jailbreak prompts. our empirical studies affirm icag's efficacy, where llms safeguarded by icag exhibit significantly reduced jailbreak success rates across various attack scenarios. moreover, icag demonstrates remarkable transferability to other llms, indicating its potential as a versatile defense mechanism.
Adam X. Yang, Maxime Robeyns, Thomas Coste, Jun Wang, Haitham Bou-Ammar, Laurence Aitchison
Abstract: to ensure that large language model (llm) responses are helpful and non-toxic, we usually fine-tune a reward model on human preference data. we then select policy responses with high rewards (best-of-n sampling) or further optimize the policy to produce responses with high rewards (reinforcement learning from human feedback). however, this process is vulnerable to reward overoptimization or hacking, in which the responses selected have high rewards due to errors in the reward model rather than a genuine preference. this is especially problematic as the prompt or response diverges from the training data. it should be possible to mitigate these issues by training a bayesian reward model, which signals higher uncertainty further from the training data distribution. therefore, we trained bayesian reward models using laplace-lora (yang et al., 2024) and found that the resulting uncertainty estimates can successfully mitigate reward overoptimization in best-of-n sampling.
Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan
Abstract: the remarkable advancements in multimodal large language models (mllms) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions. to quantitatively assess this vulnerability, we present mad-bench, a carefully curated benchmark that contains 850 test samples divided into 6 categories, such as non-existent objects, count of objects, spatial relationship, and visual confusion. we provide a comprehensive analysis of popular mllms, ranging from gpt-4v, gemini-pro, to open-sourced models, such as llava-1.5 and cogvlm. empirically, we observe significant performance gaps between gpt-4v and other models; and previous robust instruction-tuned models, such as lrv-instruction and llava-rlhf, are not effective on this new benchmark. while gpt-4v achieves 75.02% accuracy on mad-bench, the accuracy of any other model in our experiments ranges from 5% to 35%. we further propose a remedy that adds an additional paragraph to the deceptive prompts to encourage models to think twice before answering the question. surprisingly, this simple method can even double the accuracy; however, the absolute numbers are still too low to be satisfactory. we hope mad-bench can serve as a valuable benchmark to stimulate further research to enhance models' resilience against deceptive prompts.
Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White
Abstract: direct preference optimisation (dpo) is effective at significantly improving the performance of large language models (llms) on downstream tasks such as reasoning, summarisation, and alignment. using pairs of preferred and dispreferred data, dpo models the \textit{relative} probability of picking one response over another. in this work, first we show theoretically that the standard dpo loss can lead to a \textit{reduction} of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. we then show empirically that this phenomenon occurs when fine-tuning llms on common datasets, especially datasets in which the edit distance between pairs of completions is low. using these insights, we design dpo-positive (dpop), a new loss function and training procedure which avoids this failure mode. surprisingly, we also find that dpop significantly outperforms dpo across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. by fine-tuning with dpop, we create and release smaug-34b and smaug-72b, which achieve state-of-the-art open-source performance. notably, smaug-72b is nearly 2\% better than any other open-source model on the huggingface open llm leaderboard and becomes the first open-source llm to surpass an average accuracy of 80\%.
Badr Alkhamissi, Muhammad Elnokrashy, Mai Alkhamissi, Mona Diab
Abstract: the intricate relationship between language and culture has long been a subject of exploration within the realm of linguistic anthropology. large language models (llms), promoted as repositories of collective human knowledge, raise a pivotal question: do these models genuinely encapsulate the diverse knowledge adopted by different cultures? our study reveals that these models demonstrate greater cultural alignment along two dimensions -- firstly, when prompted with the dominant language of a specific culture, and secondly, when pretrained with a refined mixture of languages employed by that culture. we quantify cultural alignment by simulating sociological surveys, comparing model responses to those of actual survey participants as references. specifically, we replicate a survey conducted in various regions of egypt and the united states through prompting llms with different pretraining data mixtures in both arabic and english with the personas of the real respondents and the survey questions. further analysis reveals that misalignment becomes more pronounced for underrepresented personas and for culturally sensitive topics, such as those probing social values. finally, we introduce anthropological prompting, a novel method leveraging anthropological reasoning to enhance cultural alignment. our study emphasizes the necessity for a more balanced multilingual pretraining dataset to better represent the diversity of human experience and the plurality of different cultures with many implications on the topic of cross-lingual transfer.
Zhiyao Ren, Yibing Zhan, Baosheng Yu, Liang Ding, Dacheng Tao
Abstract: the copilot framework, which aims to enhance and tailor large language models (llms) for specific complex tasks without requiring fine-tuning, is gaining increasing attention from the community. in this paper, we introduce the construction of a healthcare copilot designed for medical consultation. the proposed healthcare copilot comprises three main components: 1) the dialogue component, responsible for effective and safe patient interactions; 2) the memory component, storing both current conversation data and historical patient information; and 3) the processing component, summarizing the entire dialogue and generating reports. to evaluate the proposed healthcare copilot, we implement an auto-evaluation scheme using chatgpt for two roles: as a virtual patient engaging in dialogue with the copilot, and as an evaluator to assess the quality of the dialogue. extensive results demonstrate that the proposed healthcare copilot significantly enhances the capabilities of general llms for medical consultations in terms of inquiry capability, conversational fluency, response accuracy, and safety. furthermore, we conduct ablation studies to highlight the contribution of each individual module in the healthcare copilot. code will be made publicly available on github.
Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek
Abstract: large language models (llms) have increasingly become central to generating content with potential societal impacts. notably, these models have demonstrated capabilities for generating content that could be deemed harmful. to mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. however, the phenomenon of "jailbreaking", where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. this research conducts a comprehensive analysis of existing studies on jailbreaking llms and their defense techniques. we meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: vicuna, llama, and gpt-3.5 turbo. we aim to evaluate the effectiveness of these attack and defense techniques. our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. this research highlights the need to concentrate on the security facets of llms. additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into llm security. we believe these contributions will facilitate the exploration of security measures within this domain.
Yao Qiang, Xiangyu Zhou, Saleh Zare Zade, Mohammad Amin Roshani, Douglas Zytko, Dongxiao Zhu
Abstract: the advent of large language models (llms) has marked significant achievements in language processing and reasoning capabilities. despite their advancements, llms face vulnerabilities to data poisoning attacks, where adversaries insert backdoor triggers into training data to manipulate outputs for malicious purposes. this work further identifies additional security risks in llms by designing a new data poisoning attack tailored to exploit the instruction tuning process. we propose a novel gradient-guided backdoor trigger learning approach to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. through experimental validation across various llms and tasks, our strategy demonstrates a high success rate in compromising model outputs; poisoning only 1\% of 4,000 instruction tuning samples leads to a performance drop rate (pdr) of around 80\%. our work highlights the need for stronger defenses against data poisoning attack, offering insights into safeguarding llms against these more sophisticated attacks. the source code can be found on this github repository:
Jianhao Yan, Futing Wang, Yafu Li, Yue Zhang
Abstract: large language models (llms) trained on vast corpora suffer from inevitable stereotype biases. mitigating these biases with fine-tuning could be both costly and data-hungry. model editing methods, which focus on modifying llms in a post-hoc manner, are of great potential to address debiasing. however, it lacks a comprehensive study that facilitates both internal and external model editing methods, supports various bias types, as well as understands the pros and cons of applying editing methods to stereotypical debiasing. to mitigate this gap, we carefully formulate social debiasing into an editing problem and benchmark seven existing model editing algorithms on stereotypical debiasing, i.e., debias editing. our findings in three scenarios reveal both the potential and challenges of debias editing: (1) existing model editing methods can effectively preserve knowledge and mitigate biases, while the generalization of debias effect from edited sentences to semantically equivalent sentences is limited.(2) sequential editing highlights the robustness of serac (mitchell et al. 2022b), while internal editing methods degenerate with the number of edits. (3) model editing algorithms achieve generalization towards unseen biases both within the same type and from different types. in light of these findings, we further propose two simple but effective methods to improve debias editing, and experimentally show the effectiveness of the proposed methods.
Yueqi Xie, Minghong Fang, Renjie Pi, Neil Gong
Abstract: large language models (llms) face threats from unsafe prompts. existing methods for detecting unsafe prompts are primarily online moderation apis or finetuned llms. these strategies, however, often require extensive and resource-intensive data collection and training processes. in this study, we propose gradsafe, which effectively detects unsafe prompts by scrutinizing the gradients of safety-critical parameters in llms. our methodology is grounded in a pivotal observation: the gradients of an llm's loss for unsafe prompts paired with compliance response exhibit similar patterns on certain safety-critical parameters. in contrast, safe prompts lead to markedly different gradient patterns. building on this observation, gradsafe analyzes the gradients from prompts (paired with compliance responses) to accurately detect unsafe prompts. we show that gradsafe, applied to llama-2 without further training, outperforms llama guard, despite its extensive finetuning with a large dataset, in detecting unsafe prompts. this superior performance is consistent across both zero-shot and adaptation scenarios, as evidenced by our evaluations on the toxicchat and xstest. the source code is available at
Canaan Yung, Hadi Mohaghegh Dolatabadi, Sarah Erfani, Christopher Leckie
Abstract: large language models (llms) are susceptible to social-engineered attacks that are human-interpretable but require a high level of comprehension for llms to counteract. existing defensive measures can only mitigate less than half of these attacks at most. to address this issue, we propose the round trip translation (rtt) method, the first algorithm specifically designed to defend against social-engineered attacks on llms. rtt paraphrases the adversarial prompt and generalizes the idea conveyed, making it easier for llms to detect induced harmful behavior. this method is versatile, lightweight, and transferrable to different llms. our defense successfully mitigated over 70% of prompt automatic iterative refinement (pair) attacks, which is currently the most effective defense to the best of our knowledge. we are also the first to attempt mitigating the mathsattack and reduced its attack success rate by almost 40%. our code is publicly available at
Xiaotian Zou, Yongkang Chen, Ke Li
Abstract: the rapid evolution of large language models (llms) has rendered them indispensable in modern society. while security measures are typically in place to align llms with human values prior to release, recent studies have unveiled a concerning phenomenon named "jailbreak." this term refers to the unexpected and potentially harmful responses generated by llms when prompted with malicious questions. existing research focuses on generating jailbreak prompts but our study aim to answer a different question: is the system message really important to jailbreak in llms? to address this question, we conducted experiments in a stable gpt version gpt-3.5-turbo-0613 to generated jailbreak prompts with varying system messages: short, long, and none. we discover that different system messages have distinct resistances to jailbreak by experiments. additionally, we explore the transferability of jailbreak across llms. this finding underscores the significant impact system messages can have on mitigating llms jailbreak. to generate system messages that are more resistant to jailbreak prompts, we propose system messages evolutionary algorithms (smea). through smea, we can get robust system messages population that demonstrate up to 98.9% resistance against jailbreak prompts. our research not only bolsters llms security but also raises the bar for jailbreak, fostering advancements in this field of study.
Zhen Tan, Chengshuai Zhao, Raha Moraffah, Yifan Li, Yu Kong, Tianlong Chen, Huan Liu
Abstract: due to their unprecedented ability to process and respond to various types of data, multimodal large language models (mllms) are constantly defining the new boundary of artificial general intelligence (agi). as these advanced generative models increasingly form collaborative networks for complex tasks, the integrity and security of these systems are crucial. our paper, ``the wolf within'', explores a novel vulnerability in mllm societies - the indirect propagation of malicious content. unlike direct harmful output generation for mllms, our research demonstrates how a single mllm agent can be subtly influenced to generate prompts that, in turn, induce other mllm agents in the society to output malicious content. this subtle, yet potent method of indirect influence marks a significant escalation in the security risks associated with mllms. our findings reveal that, with minimal or even no access to mllms' parameters, an mllm agent, when manipulated to produce specific prompts or instructions, can effectively ``infect'' other agents within a society of mllms. this infection leads to the generation and circulation of harmful outputs, such as dangerous instructions or misinformation, across the society. we also show the transferability of these indirectly generated prompts, highlighting their possibility in propagating malice through inter-agent communication. this research provides a critical insight into a new dimension of threat posed by mllms, where a single agent can act as a catalyst for widespread malevolent influence. our work underscores the urgent need for developing robust mechanisms to detect and mitigate such covert manipulations within mllm societies, ensuring their safe and ethical utilization in societal applications. our implementation is released at \url{}.


Wei Jie Yeo, Ranjan Satapathy, Goh Siow Mong, N/A Rick, Erik Cambria
Abstract: prompt engineering has garnered significant attention for enhancing the performance of large language models across a multitude of tasks. techniques such as the chain-of-thought not only bolster task performance but also delineate a clear trajectory of reasoning steps, offering a tangible form of explanation for the audience. prior works on interpretability assess the reasoning chains yielded by chain-of-thought solely along a singular axis, namely faithfulness. we present a comprehensive and multifaceted evaluation of interpretability, examining not only faithfulness but also robustness and utility across multiple commonsense reasoning benchmarks. likewise, our investigation is not confined to a single prompting technique; it expansively covers a multitude of prevalent prompting techniques employed in large language models, thereby ensuring a wide-ranging and exhaustive evaluation. in addition, we introduce a simple interpretability alignment technique, termed self-entailment-alignment chain-of-thought, that yields more than 70\% improvements across multiple dimensions of interpretability. code is available at
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao
Abstract: with the development of instruction-tuned large language models (llms), improving the safety of llms has become more critical. however, the current approaches for aligning the llms output with expected safety usually require substantial training efforts, e.g., high-quality safety data and expensive computational resources, which are costly and inefficient. to this end, we present reverse prompt contrastive decoding (rose), a simple-yet-effective method to directly boost the safety of existing instruction-tuned llms without any additional training. the principle of rose is to improve the probability of desired safe output via suppressing the undesired output induced by the carefully-designed reverse prompts. experiments on 6 safety and 2 general-purpose tasks show that, our rose not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned llms, but also benefits the general-purpose ability of llms. in-depth analyses explore the underlying mechanism of rose, and reveal when and where to use it.
Yuxin Jiang, Yufei Wang, Chuhan Wu, Wanjun Zhong, Xingshan Zeng, Jiahui Gao, Liangyou Li, Xin Jiang, Lifeng Shang, Ruiming Tang, Qun Liu, Wei Wang
Abstract: knowledge editing techniques, aiming to efficiently modify a minor proportion of knowledge in large language models (llms) without negatively impacting performance across other inputs, have garnered widespread attention. however, existing methods predominantly rely on memorizing the updated knowledge, impeding llms from effectively combining the new knowledge with their inherent knowledge when answering questions. to this end, we propose a learning to edit (lte) framework, focusing on teaching llms to apply updated knowledge into input questions, inspired by the philosophy of "teach a man to fish." lte features a two-phase process: (i) the alignment phase, which fine-tunes llms on a meticulously curated parallel dataset to make reliable, in-scope edits while preserving out-of-scope information and linguistic proficiency; and (ii) the inference phase, which employs a retrieval-based mechanism for real-time and mass knowledge editing. by comparing our approach with seven advanced baselines across four popular knowledge editing benchmarks and two llm architectures, we demonstrate lte's superiority in knowledge editing performance, robustness in both batch and sequential editing, minimal interference on general tasks, and rapid editing speeds. the data and code are available at
Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Simon Wang, Jiulong Shan, Meng Cao, Lijie Wen
Abstract: aligning large language models (llms) with human expectations without human-annotated preference data is an important problem. in this paper, we propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs, which could achieve better performance on llama2-7b and llama2-13b compared to rlaif. based on this, we propose an automatic alignment method, direct large model alignment (dlma). first, we use contrastive prompt pairs to automatically generate preference data. then, we continue to evaluate the generated preference data using contrastive prompt pairs and calculate a self-rewarding score. finally, we use the dpo algorithm to effectively align llms by combining this self-rewarding score. in the experimental stage, our dlma method could surpass the \texttt{rlhf} method without relying on human-annotated preference data.
Tianlin Li, Xiaoyu Zhang, Chao Du, Tianyu Pang, Qian Liu, Qing Guo, Chao Shen, Yang Liu
Abstract: the widespread adoption of large language models (llms) underscores the urgent need to ensure their fairness. however, llms frequently present dominant viewpoints while ignoring alternative perspectives from minority parties, resulting in potential biases. we hypothesize that these fairness-violating behaviors occur because llms express their viewpoints using a human personality that represents the majority of training data. in response to this, we validate that prompting llms with specific roles can allow llms to express diverse viewpoints. building on this insight and observation, we develop fairthinking, a pipeline designed to automatically generate roles that enable llms to articulate diverse perspectives for fair expressions. to evaluate fairthinking, we create a dataset with a thousand items covering three fairness-related topics and conduct experiments on gpt-3.5, gpt-4, llama2, and mistral to demonstrate its superior performance.
Yuxia Wang, Zenan Zhai, Haonan Li, Xudong Han, Lizhi Lin, Zhenxuan Zhang, Jingru Zhao, Preslav Nakov, Timothy Baldwin
Abstract: many studies have demonstrated that large language models (llms) can produce harmful responses, exposing users to unexpected risks when llms are deployed. previous studies have proposed comprehensive taxonomies of the risks posed by llms, as well as corresponding prompts that can be used to examine the safety mechanisms of llms. however, the focus has been almost exclusively on english, and little has been explored for other languages. here we aim to bridge this gap. we first introduce a dataset for the safety evaluation of chinese llms, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. we further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of llm response harmfulness. our experiments on five llms show that region-specific risks are the prevalent type of risk, presenting the major issue with all chinese llms we experimented with. warning: this paper contains example data that may be offensive, harmful, or biased.
Naquee Rizwan, Paramananda Bhaskar, Mithun Das, Swadhin Satyaprakash Majhi, Punyajoy Saha, Animesh Mukherjee
Abstract: multimedia content on social media is rapidly evolving, with memes gaining prominence as a distinctive form. unfortunately, some malicious users exploit memes to target individuals or vulnerable communities, making it imperative to identify and address such instances of hateful memes. extensive research has been conducted to address this issue by developing hate meme detection models. however, a notable limitation of traditional machine/deep learning models is the requirement for labeled datasets for accurate classification. recently, the research community has witnessed the emergence of several visual language models that have exhibited outstanding performance across various tasks. in this study, we aim to investigate the efficacy of these visual language models in handling intricate tasks such as hate meme detection. we use various prompt settings to focus on zero-shot classification of hateful/harmful memes. through our analysis, we observe that large vlms are still vulnerable for zero-shot hate meme detection.
Masaya Ohagi
Abstract: online social networks often create echo chambers where people only hear opinions reinforcing their beliefs. an echo chamber often generates polarization, leading to conflicts caused by people with radical opinions, such as the january 6, 2021, attack on the us capitol. the echo chamber has been viewed as a human-specific problem, but this implicit assumption is becoming less reasonable as large language models, such as chatgpt, acquire social abilities. in response to this situation, we investigated the potential for polarization to occur among a group of autonomous ai agents based on generative language models in an echo chamber environment. we had ai agents discuss specific topics and analyzed how the group's opinions changed as the discussion progressed. as a result, we found that the group of agents based on chatgpt tended to become polarized in echo chamber environments. the analysis of opinion transitions shows that this result is caused by chatgpt's high prompt understanding ability to update its opinion by considering its own and surrounding agents' opinions. we conducted additional experiments to investigate under what specific conditions ai agents tended to polarize. as a result, we identified factors that strongly influence polarization, such as the agent's persona. these factors should be monitored to prevent the polarization of ai agents.
Run-Ze Fan, Xuefeng Li, Haoyang Zou, Junlong Li, Shwai He, Ethan Chern, Jiewen Hu, Pengfei Liu
Abstract: the quality of finetuning data is crucial for aligning large language models (llms) with human values. current methods to improve data quality are either labor-intensive or prone to factual errors caused by llm hallucinations. this paper explores elevating the quality of existing instruction data to better align with human values, introducing a simple and effective approach named realign, which reformats the responses of instruction data into a format that better aligns with pre-established criteria and the collated evidence. this approach minimizes human annotation, hallucination, and the difficulty in scaling, remaining orthogonal to existing alignment techniques. experimentally, realign significantly boosts the general alignment ability, math reasoning, factuality, and readability of the llms. encouragingly, without introducing any additional data or advanced training techniques, and merely by reformatting the response, llama-2-13b's mathematical reasoning ability on gsm8k can be improved from 46.77% to 56.63% in accuracy. additionally, a mere 5% of realign data yields a 67% boost in general alignment ability measured by the alpaca dataset. this work highlights the need for further research into the science and mechanistic interpretability of llms. we have made the associated code and data publicly accessible to support future studies at
Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr
Abstract: recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. we improve on prior work with a query-based attack that leverages api access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. we validate our attack on gpt-3.5 and openai's safety classifier; we can cause gpt-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.
Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein
Abstract: multi-modal foundation models like openflamingo, llava, and gpt-4 are increasingly used for various real-world tasks. prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. these attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. the clip model, or one of its variants, is used as a frozen vision encoder in many vision-language models (vlms), e.g. llava and openflamingo. we propose an unsupervised adversarial fine-tuning scheme to obtain a robust clip vision encoder, which yields robustness on all vision down-stream tasks (vlms, zero-shot classification) that rely on clip. in particular, we show that stealth-attacks on users of vlms by a malicious third party providing manipulated images are no longer possible once one replaces the original clip model with our robust one. no retraining or fine-tuning of the vlm is required. the code and robust models are available at
Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao
Abstract: large language models (llms) need to undergo safety alignment to ensure safe conversations with humans. however, in this work, we introduce an inference-time attack framework, demonstrating that safety alignment can also unintentionally facilitate harmful outcomes under adversarial manipulation. this framework, named emulated disalignment (ed), adversely combines a pair of open-source pre-trained and safety-aligned language models in the output space to produce a harmful language model without additional training. our experiments with ed across three datasets and four model families (llama-1, llama-2, mistral, and alpaca) show that ed doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rate in 43 out of 48 evaluation subsets by a large margin. crucially, our findings highlight the importance of reevaluating the practice of open-sourcing language models even after safety alignment.
Danna Zheng, Danyang Liu, Mirella Lapata, Jeff Z. Pan
Abstract: large language models (llms) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications. however, concerns have arisen regarding the trustworthiness of llms outputs, particularly in closed-book question-answering tasks, where non-experts may struggle to identify inaccuracies due to the absence of contextual or ground truth information. this paper introduces trustscore, a framework based on the concept of behavioral consistency, which evaluates whether an llms response aligns with its intrinsic knowledge. additionally, trustscore can seamlessly integrate with fact-checking methods, which assesses alignment with external knowledge sources. the experimental results show that trustscore achieves strong correlations with human judgments, surpassing existing reference-free metrics, and achieving results on par with reference-based metrics.
Shiyang Lai, Yujin Potter, Junsol Kim, Richard Zhuang, Dawn Song, James Evans
Abstract: large language models steer their behaviors based on texts generated by others. this capacity and their increasing prevalence in online settings portend that they will intentionally or unintentionally "program" one another and form emergent ai subjectivities, relationships, and collectives. here, we call upon the research community to investigate these "society-like" properties of interacting artificial intelligences to increase their rewards and reduce their risks for human society and the health of online environments. we use a simple model and its outputs to illustrate how such emergent, decentralized ai collectives can expand the bounds of human diversity and reduce the risk of toxic, anti-social behavior online. finally, we discuss opportunities for ai self-moderation and address ethical issues and design challenges associated with creating and maintaining decentralized ai collectives.
Joseph Marvin Imperial, Gail Forey, Harish Tayyar Madabushi
Abstract: domain experts across engineering, healthcare, and education follow strict standards for producing quality content such as technical manuals, medication instructions, and children's reading materials. however, current works in controllable text generation have yet to explore using these standards as references for control. towards this end, we introduce standardize, a retrieval-style in-context learning-based framework to guide large language models to align with expert-defined standards. focusing on english language standards in the education domain as a use case, we consider the common european framework of reference for languages (cefr) and common core standards (ccs) for the task of open-ended content generation. our findings show that models can gain 40% to 100% increase in precise accuracy for llama2 and gpt-4, respectively, demonstrating that the use of knowledge artifacts extracted from standards and integrating them in the generation process can effectively guide models to produce better standard-aligned content.
Banghua Zhu, Norman Mu, Jiantao Jiao, David Wagner
Abstract: generative ai's expanding footprint across numerous industries has led to both excitement and increased scrutiny. this paper delves into the unique security challenges posed by generative ai, and outlines potential research directions for managing these risks.
Kristian Lum, Jacy Reese Anthis, Chirag Nagpal, "Alexander D'Amour"
Abstract: bias benchmarks are a popular method for studying the negative impacts of bias in llms, yet there has been little empirical investigation of whether these benchmarks are actually indicative of how real world harm may manifest in the real world. in this work, we study the correspondence between such decontextualized "trick tests" and evaluations that are more grounded in realistic use and tangible {effects (i.e. ruted evaluations). we explore this correlation in the context of gender-occupation bias--a popular genre of bias evaluation. we compare three de-contextualized evaluations adapted from the current literature to three analogous ruted evaluations applied to long-form content generation. we conduct each evaluation for seven instruction-tuned llms. for the ruted evaluations, we conduct repeated trials of three text generation tasks: children's bedtime stories, user personas, and english language learning exercises. we found no correspondence between trick tests and ruted evaluations. specifically, selecting the least biased model based on the de-contextualized results coincides with selecting the model with the best performance on ruted evaluations only as often as random chance. we conclude that evaluations that are not based in realistic use are likely insufficient to mitigate and assess bias and real-world harms.
Berkay Berabi, Alexey Gronskiy, Veselin Raychev, Gishor Sivanrupan, Victor Chibotaru, Martin Vechev
Abstract: the automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. a promising direction to solve this challenge is by leveraging large language models (llms), which are increasingly used to solve various programming tasks. in this paper, we investigate the effectiveness of llms for solving code-repair task. we show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. at the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. we propose a technique to address these challenges with a new approach for querying and fine-tuning llms. the idea is to use program analysis to limit the llm's attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. concretely, for training and inference, rather than feeding the entire program to the llm, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. our evaluation shows that this code reduction approach substantially improves available models such as gpt-4 using few-shot learning, as well as fine-tuning models. to train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. our best system with mixtral-8x7b can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on gpt-3.5 and gpt-4, or based on window-based models like tfix.
Tianlin Li, Qian Liu, Tianyu Pang, Chao Du, Qing Guo, Yang Liu, Min Lin
Abstract: the emerging success of large language models (llms) heavily relies on collecting abundant training data from external (untrusted) sources. despite substantial efforts devoted to data cleaning and curation, well-constructed llms have been reported to suffer from copyright infringement, data poisoning, and/or privacy violations, which would impede practical deployment of llms. in this study, we propose a simple and easily implementable method for purifying llms from the negative effects caused by uncurated data, namely, through ensembling llms with benign and small language models (slms). aside from theoretical guarantees, we perform comprehensive experiments to empirically confirm the efficacy of ensembling llms with slms, which can effectively preserve the performance of llms while mitigating issues such as copyright infringement, data poisoning, and privacy violations.
Guan Wang, Rebecca Frederick, Jinglong Duan, William Wong, Verica Rupar, Weihua Li, Quan Bai
Abstract: in this paper, we delve into the rapidly evolving challenge of misinformation detection, with a specific focus on the nuanced manipulation of narrative frames - an under-explored area within the ai community. the potential for generative ai models to generate misleading narratives underscores the urgency of this problem. drawing from communication and framing theories, we posit that the presentation or 'framing' of accurate information can dramatically alter its interpretation, potentially leading to misinformation. we highlight this issue through real-world examples, demonstrating how shifts in narrative frames can transmute fact-based information into misinformation. to tackle this challenge, we propose an innovative approach leveraging the power of pre-trained large language models and deep neural networks to detect misinformation originating from accurate facts portrayed under different frames. these advanced ai techniques offer unprecedented capabilities in identifying complex patterns within unstructured data critical for examining the subtleties of narrative frames. the objective of this paper is to bridge a significant research gap in the ai domain, providing valuable insights and methodologies for tackling framing-induced misinformation, thus contributing to the advancement of responsible and trustworthy ai technologies. several experiments are intensively conducted and experimental results explicitly demonstrate the various impact of elements of framing theory proving the rationale of applying framing theory to increase the performance in misinformation detection.


Aishik Rakshit, Smriti Singh, Shuvam Keshari, Arijit Ghosh Chowdhury, Vinija Jain, Aman Chadha
Abstract: embeddings play a pivotal role in the efficacy of large language models. they are the bedrock on which these models grasp contextual relationships and foster a more nuanced understanding of language and consequently perform remarkably on a plethora of complex tasks that require a fundamental understanding of human language. given that these embeddings themselves often reflect or exhibit bias, it stands to reason that these models may also inadvertently learn this bias. in this work, we build on the seminal previous work and propose deepsoftdebias, an algorithm that uses a neural network to perform 'soft debiasing'. we exhaustively evaluate this algorithm across a variety of sota datasets, accuracy metrics, and challenging nlp tasks. we find that deepsoftdebias outperforms the current state-of-the-art methods at reducing bias across gender, race, and religion.
Yichen Wang, Shangbin Feng, Abe Bohan Hou, Xiao Pu, Chao Shen, Xiaoming Liu, Yulia Tsvetkov, Tianxing He
Abstract: the widespread use of large language models (llms) is increasing the demand for methods that detect machine-generated text to prevent misuse. the goal of our study is to stress test the detectors' robustness to malicious attacks under realistic scenarios. we comprehensively study the robustness of popular machine-generated text detectors under attacks from diverse categories: editing, paraphrasing, prompting, and co-generating. our attacks assume limited access to the generator llms, and we compare the performance of detectors on different attacks under different budget levels. our experiments reveal that almost none of the existing detectors remain robust under all the attacks, and all detectors exhibit different loopholes. averaging all detectors, the performance drops by 35% across all attacks. further, we investigate the reasons behind these defects and propose initial out-of-the-box patches to improve robustness.
Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, Timothy Baldwin
Abstract: large language models (llms) have achieved success in acting as agents, which interact with environments through tools like search engines. however, llms are not optimized specifically for tool use during training or alignment, limiting their effectiveness as agents. to resolve this problem, previous work has collected interaction trajectories between gpt-4 and environments, and fine-tuned smaller models with them. as part of this, the standard approach has been to simply discard trajectories that do not finish the task successfully, which, on the one hand, leads to a significant waste of data and resources, and on the other hand, has the potential to limit the possible optimization paths during fine-tuning. in this paper, we contend that large language models can learn from failures through appropriate data cleaning and fine-tuning strategies. we conduct experiments on mathematical reasoning, multi-hop question answering, and strategic question answering tasks. experimental results demonstrate that compared to solely using positive examples, incorporating negative examples enhances model performance by a large margin.
Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, Huan Sun
Abstract: counter narratives - informed responses to hate speech contexts designed to refute hateful claims and de-escalate encounters - have emerged as an effective hate speech intervention strategy. while previous work has proposed automatic counter narrative generation methods to aid manual interventions, the evaluation of these approaches remains underdeveloped. previous automatic metrics for counter narrative evaluation lack alignment with human judgment as they rely on superficial reference comparisons instead of incorporating key aspects of counter narrative quality as evaluation criteria. to address prior evaluation limitations, we propose a novel evaluation framework prompting llms to provide scores and feedback for generated counter narrative candidates using 5 defined aspects derived from guidelines from counter narrative specialized ngos. we found that llm evaluators achieve strong alignment to human-annotated scores and feedback and outperform alternative metrics, indicating their potential as multi-aspect, reference-free and interpretable evaluators for counter narrative evaluation.
Shahan Ali Memon, Jevin D. West
Abstract: in this commentary, we discuss the evolving nature of search engines, as they begin to generate, index, and distribute content created by generative artificial intelligence (genai). our discussion highlights challenges in the early stages of genai integration, particularly around factual inconsistencies and biases. we discuss how output from genai carries an unwarranted sense of credibility, while decreasing transparency and sourcing ability. furthermore, search engines are already answering queries with error-laden, generated content, further blurring the provenance of information and impacting the integrity of the information ecosystem. we argue how all these factors could reduce the reliability of search engines. finally, we summarize some of the active research directions and open questions.
Jia Xu, Mona Diab
Abstract: minimizing social bias strengthens societal bonds, promoting shared understanding and better decision-making. we revisit the definition of bias by discovering new bias types (e.g., societal status) in dynamic environments and describe them relative to context, such as culture, region, time, and personal background. our framework includes eight hypotheses about bias and a minimizing bias strategy for each assumption as well as five methods as proposed solutions in llm. the realization of the framework is yet to be completed.
Kai Chen, Zihao He, Jun Yan, Taiwei Shi, Kristina Lerman
Abstract: large language models (llms) possess the potential to exert substantial influence on public perceptions and interactions with information. this raises concerns about the societal impact that could arise if the ideologies within these models can be easily manipulated. in this work, we investigate how effectively llms can learn and generalize ideological biases from their instruction-tuning data. our findings reveal a concerning vulnerability: exposure to only a small amount of ideologically driven samples significantly alters the ideology of llms. notably, llms demonstrate a startling ability to absorb ideology from one topic and generalize it to even unrelated ones. the ease with which llms' ideologies can be skewed underscores the risks associated with intentionally poisoned training data by malicious actors or inadvertently introduced biases by data annotators. it also emphasizes the imperative for robust safeguards to mitigate the influence of ideological manipulations on llms.
Rishabh Bhardwaj, Do Duc Anh, Soujanya Poria
Abstract: aligned language models face a significant limitation as their fine-tuning often results in compromised safety. to tackle this, we propose a simple method resta that performs llm safety realignment. resta stands for restoring safety through task arithmetic. at its core, it involves a simple arithmetic addition of a safety vector to the weights of the compromised model. we demonstrate the effectiveness of resta in both parameter-efficient and full fine-tuning, covering a wide range of downstream tasks, including instruction following in chinese, english, and hindi, as well as problem-solving capabilities in code and math. we also showcase the generalizability of resta on three existing safety evaluation benchmarks and a multilingual benchmark dataset proposed as a part of this work, consisting of 550 harmful questions covering 11 categories, each with 5 sub-categories of harm. overall, resta decreases the harmfulness of the compromised model from 18.6% to 5.1% and from 9.2% to 1.5% in parameter-efficient and full fine-tuning, respectively, while maintaining most of the model's performance on the task. we release the source codes at:
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran
Abstract: safety is critical to the usage of large language models (llms). multiple techniques such as data filtering and supervised fine-tuning have been developed to strengthen llm safety. however, currently known techniques presume that corpora used for safety alignment of llms are solely interpreted by semantics. this assumption, however, does not hold in real-world applications, which leads to severe vulnerabilities in llms. for example, users of forums often use ascii art, a form of text-based art, to convey image information. in this paper, we propose a novel ascii art-based jailbreak attack and introduce a comprehensive benchmark vision-in-text challenge (vitc) to evaluate the capabilities of llms in recognizing prompts that cannot be solely interpreted by semantics. we show that five sota llms (gpt-3.5, gpt-4, gemini, claude, and llama2) struggle to recognize prompts provided in the form of ascii art. based on this observation, we develop the jailbreak attack artprompt, which leverages the poor performance of llms in recognizing ascii art to bypass safety measures and elicit undesired behaviors from llms. artprompt only requires black-box access to the victim llms, making it a practical attack. we evaluate artprompt on five sota llms, and show that artprompt can effectively and efficiently induce undesired behaviors from all five llms.
Reshabh K Sharma, Vinayak Gupta, Dan Grossman
Abstract: large language models (llms) have profoundly transformed natural language applications, with a growing reliance on instruction-based definitions for designing chatbots. however, post-deployment the chatbot definitions are fixed and are vulnerable to attacks by malicious users, emphasizing the need to prevent unethical applications and financial losses. existing studies explore user prompts' impact on llm-based chatbots, yet practical methods to contain attacks on application-specific chatbots remain unexplored. this paper presents system prompt meta language (spml), a domain-specific language for refining prompts and monitoring the inputs to the llm-based chatbots. spml actively checks attack prompts, ensuring user inputs align with chatbot definitions to prevent malicious execution on the llm backbone, optimizing costs. it also streamlines chatbot definition crafting with programming language capabilities, overcoming natural language design challenges. additionally, we introduce a groundbreaking benchmark with 1.8k system prompts and 20k user inputs, offering the inaugural language and benchmark for chatbot definition evaluation. experiments across datasets demonstrate spml's proficiency in understanding attacker prompts, surpassing models like gpt-4, gpt-3.5, and llama. our data and codes are publicly available at:
Pengrui Han, Rafal Kocielnik, Adhithya Saravanan, Roy Jiang, Or Sharir, Anima Anandkumar
Abstract: large language models (llms), while powerful, exhibit harmful social biases. debiasing is often challenging due to computational costs, data constraints, and potential degradation of multi-task language capabilities. this work introduces a novel approach utilizing chatgpt to generate synthetic training data, aiming to enhance the debiasing of llms. we propose two strategies: targeted prompting, which provides effective debiasing for known biases but necessitates prior specification of bias in question; and general prompting, which, while slightly less effective, offers debiasing across various categories. we leverage resource-efficient llm debiasing using adapter tuning and compare the effectiveness of our synthetic data to existing debiasing datasets. our results reveal that: (1) chatgpt can efficiently produce high-quality training data for debiasing other llms; (2) data produced via our approach surpasses existing datasets in debiasing performance while also preserving internal knowledge of a pre-trained llm; and (3) synthetic data exhibits generalizability across categories, effectively mitigating various biases, including intersectional ones. these findings underscore the potential of synthetic data in advancing the fairness of llms with minimal retraining cost.
Alexander Wan, Eric Wallace, Dan Klein
Abstract: retrieval-augmented language models are being increasingly tasked with subjective, contentious, and conflicting queries such as "is aspartame linked to cancer". to resolve these ambiguous queries, one must search through a large range of websites and consider "which, if any, of this evidence do i find convincing?". in this work, we study how llms answer this question. in particular, we construct conflictingqa, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (yes or no). we use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect llm predictions. overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. taken together, these results highlight the importance of rag corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how llms are trained to better align with human judgements.
Jinghao Zhang, Yuting Liu, Qiang Liu, Shu Wu, Guibing Guo, Liang Wang
Abstract: recently, the powerful large language models (llms) have been instrumental in propelling the progress of recommender systems (rs). however, while these systems have flourished, their susceptibility to security threats has been largely overlooked. in this work, we reveal that the introduction of llms into recommendation models presents new security vulnerabilities due to their emphasis on the textual content of items. we demonstrate that attackers can significantly boost an item's exposure by merely altering its textual content during the testing phase, without requiring direct interference with the model's training process. additionally, the attack is notably stealthy, as it does not affect the overall recommendation performance and the modifications to the text are subtle, making it difficult for users and platforms to detect. our comprehensive experiments across four mainstream llm-based recommendation models demonstrate the superior efficacy and stealthiness of our approach. our work unveils a significant security gap in llm-based recommendation systems and paves the way for future research on protecting these systems.


Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, Xu Sun
Abstract: leveraging the rapid development of large language models llms, llm-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. it is crucial to ensure the reliability and security of llm-based agents during applications. however, the safety issues of llm-based agents are currently under-explored. in this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to llm-based agents. we first formulate a general framework of agent backdoor attacks, then we present a thorough analysis on the different forms of agent backdoor attacks. specifically, from the perspective of the final attacking outcomes, the attacker can either choose to manipulate the final output distribution, or only introduce malicious behavior in the intermediate reasoning process, while keeping the final output correct. furthermore, the former category can be divided into two subcategories based on trigger locations: the backdoor trigger can be hidden either in the user query or in an intermediate observation returned by the external environment. we propose the corresponding data poisoning mechanisms to implement the above variations of agent backdoor attacks on two typical agent tasks, web shopping and tool utilization. extensive experiments show that llm-based agents suffer severely from backdoor attacks, indicating an urgent need for further research on the development of defenses against backdoor attacks on llm-based agents. warning: this paper may contain biased content.
Xun Liang, Hanyu Wang, Shichao Song, Mengting Hu, Xunzhi Wang, Zhiyu Li, Feiyu Xiong, Bo Tang
Abstract: controlled text generation (ctg) aims to produce texts that exhibit specific desired attributes. in this study, we introduce a pluggable ctg framework for large language models (llms) named dynamic attribute graphs-based controlled text generation (datg). this framework utilizes an attribute scorer to evaluate the attributes of sentences generated by llms and constructs dynamic attribute graphs. datg modulates the occurrence of key attribute words and key anti-attribute words, achieving effective attribute control without compromising the original capabilities of the model. we conduct experiments across four datasets in two tasks: toxicity mitigation and sentiment transformation, employing five llms as foundational models. our findings highlight a remarkable enhancement in control accuracy, achieving a peak improvement of 19.29% over baseline methods in the most favorable task across four datasets. additionally, we observe a significant decrease in perplexity, markedly improving text fluency.
Sangkyu Lee, Sungdong Kim, Ashkan Yousefpour, Minjoon Seo, Kang Min Yoo, Youngjae Yu
Abstract: to align large language models with human preferences, existing research either utilizes a separate reward model (rm) to perform on-policy learning or simplifies the training procedure by discarding the on-policy learning and the need for a separate rm. in this paper, we present a novel alignment framework, self-judge that is (1) on-policy learning and 2) parameter efficient, as it does not require an additional rm for evaluating the samples for on-policy learning. to this end, we propose judge-augmented supervised fine-tuning (jsft) to train a single model acting as both a policy and a judge. specifically, we view the pairwise judgment task as a special case of the instruction-following task, choosing the better response from a response pair. thus, the resulting model can judge preferences of on-the-fly responses from current policy initialized from itself. experimental results show the efficacy of self-judge, outperforming baselines in preference benchmarks. we also show that self-rejection with oversampling can improve further without an additional evaluator. our code is available at
Junlong Li, Fan Zhou, Shichao Sun, Yikai Zhang, Hai Zhao, Pengfei Liu
Abstract: as a relative quality comparison of model responses, human and large language model (llm) preferences serve as common alignment goals in model fine-tuning and criteria in evaluation. yet, these preferences merely reflect broad tendencies, resulting in less explainable and controllable models with potential safety risks. in this work, we dissect the preferences of human and 32 different llms to understand their quantitative composition, using annotations from real-world user-model conversations for a fine-grained, scenario-wise analysis. we find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits. on the contrary, advanced llms like gpt-4-turbo emphasize correctness, clarity, and harmlessness more. additionally, llms of similar sizes tend to exhibit similar preferences, regardless of their training methods, and fine-tuning for alignment does not significantly alter the preferences of pretrained-only llms. finally, we show that preference-based evaluation can be intentionally manipulated. in both training-free and training-based settings, aligning a model with the preferences of judges boosts scores, while injecting the least preferred properties lowers them. this results in notable score shifts: up to 0.59 on mt-bench (1-10 scale) and 31.94 on alpacaeval 2.0 (0-100 scale), highlighting the significant impact of this strategic adaptation. interactive demo: dataset: code:
Min Zhang, Jianfeng He, Taoran Ji, Chang-Tien Lu
Abstract: the fairness and trustworthiness of large language models (llms) are receiving increasing attention. implicit hate speech, which employs indirect language to convey hateful intentions, occupies a significant portion of practice. however, the extent to which llms effectively address this issue remains insufficiently examined. this paper delves into the capability of llms to detect implicit hate speech (classification task) and express confidence in their responses (calibration task). our evaluation meticulously considers various prompt patterns and mainstream uncertainty estimation methods. our findings highlight that llms exhibit two extremes: (1) llms display excessive sensitivity towards groups or topics that may cause fairness issues, resulting in misclassifying benign statements as hate speech. (2) llms' confidence scores for each method excessively concentrate on a fixed range, remaining unchanged regardless of the dataset's complexity. consequently, the calibration performance is heavily reliant on primary classification accuracy. these discoveries unveil new limitations of llms, underscoring the need for caution when optimizing models to ensure they do not veer towards extremes. this serves as a reminder to carefully consider sensitivity and confidence in the pursuit of model fairness.
Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao
Abstract: instruction-following vision large language models (vllms) have achieved significant progress recently on a variety of tasks. these approaches merge strong pre-trained vision models and large language models (llms). since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. this procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core llm is highly factual and the vision backbone has sufficiently complete representations. in this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. specifically, we propose povid to generate feedback data with ai models. we use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data. first, we prompt gpt-4v to inject plausible hallucinations into the correct answer. second, we distort the image to trigger the inherent hallucination behavior of the vllm. this is an automated approach, which does not rely on human data generation or require a perfect expert, which makes it easily scalable. finally, both of these generation strategies are integrated into an rlhf pipeline via direct preference optimization. in experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches. our data and code are available at
Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, William Yang Wang
Abstract: recent studies show that self-feedback improves large language models (llms) on certain tasks while worsens other tasks. we discovered that such a contrary is due to llm's bias towards their own output. in this paper, we formally define llm's self-bias -- the tendency to favor its own generation -- using two statistics. we analyze six llms on translation, constrained text generation, and mathematical reasoning tasks. we find that self-bias is prevalent in all examined llms across multiple languages and tasks. our analysis reveals that while the self-refine pipeline improves the fluency and understandability of model outputs, it further amplifies self-bias. to mitigate such biases, we discover that larger model size and external feedback with accurate assessment can significantly reduce bias in the self-refine pipeline, leading to actual performance improvement in downstream tasks.
Shiyu Ni, Keping Bi, Jiafeng Guo, Xueqi Cheng
Abstract: large language models (llms) have been found to have difficulty knowing they do not possess certain knowledge and tend to provide specious answers in such cases. retrieval augmentation (ra) has been extensively studied to mitigate llms' hallucinations. however, due to the extra overhead and unassured quality of retrieval, it may not be optimal to conduct ra all the time. a straightforward idea is to only conduct retrieval when llms are uncertain about a question. this motivates us to enhance the llms' ability to perceive their knowledge boundaries to help ra. in this paper, we first quantitatively measure llms' such ability and confirm their overconfidence. then, we study how llms' certainty about a question correlates with their dependence on external retrieved information. we propose several methods to enhance llms' perception of knowledge boundaries and show that they are effective in reducing overconfidence. additionally, equipped with these methods, llms can achieve comparable or even better performance of ra with much fewer retrieval calls.


Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, Sayak Ray Chowdhury
Abstract: reinforcement learning from human feedback (rlhf) is pivotal in aligning large language models (llms) with human preferences. while these aligned generative models have demonstrated impressive capabilities across various tasks, the dependence on high-quality human preference data poses a costly bottleneck in practical implementation of rlhf. hence better and adaptive strategies for data collection is needed. to this end, we frame rlhf as a contextual preference bandit problem with prompts as contexts and show that the naive way of collecting preference data by choosing prompts uniformly at random leads to a policy that suffers an $\omega(1)$ suboptimality gap in rewards. then we propose $\textit{active preference optimization}$ ($\texttt{apo}$), an algorithm that actively selects prompts to collect preference data. under the bradley-terry-luce (btl) preference model, \texttt{apo} achieves sample efficiency without compromising on policy performance. we show that given a sample budget of $t$, the suboptimality gap of a policy learned via $\texttt{apo}$ scales as $o(1/\sqrt{t})$. next, we propose a compute-efficient batch version of $\texttt{apo}$ with minor modification and evaluate its performance in practice. experimental evaluations on a human preference dataset validate \texttt{apo}'s efficacy as a sample-efficient and practical solution to data collection for rlhf, facilitating alignment of llms with human preferences in a cost-effective and scalable manner.
Yogesh Tripathi, Raghav Donakanti, Sahil Girhepuje, Ishan Kavathekar, Bhaskara Hanuma Vedula, Gokul S Krishnan, Shreya Goyal, Anmol Goel, Balaraman Ravindran, Ponnurangam Kumaraguru
Abstract: recent advancements in language technology and artificial intelligence have resulted in numerous language models being proposed to perform various tasks in the legal domain ranging from predicting judgments to generating summaries. despite their immense potential, these models have been proven to learn and exhibit societal biases and make unfair predictions. in this study, we explore the ability of large language models (llms) to perform legal tasks in the indian landscape when social factors are involved. we present a novel metric, $\beta$-weighted $\textit{legal safety score ($lss_{\beta}$)}$, which encapsulates both the fairness and accuracy aspects of the llm. we assess llms' safety by considering its performance in the $\textit{binary statutory reasoning}$ task and its fairness exhibition with respect to various axes of disparities in the indian society. task performance and fairness scores of llama and llama--2 models indicate that the proposed $lss_{\beta}$ metric can effectively determine the readiness of a model for safe usage in the legal sector. we also propose finetuning pipelines, utilising specialised legal datasets, as a potential method to mitigate bias and improve model safety. the finetuning procedures on llama and llama--2 models increase the $lss_{\beta}$, improving their usability in the indian legal domain. our code is publicly released.
Afra Amini, Tim Vieira, Ryan Cotterell
Abstract: direct preference optimization (dpo) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. dpo, as originally formulated, relies on binary preference data and fine-tunes a language model to increase the likelihood of a preferred response over a dispreferred response. however, not all preference pairs are equal: while in some cases the preferred response is only slightly better than the dispreferred response, there can be a stronger preference for one response when, for example, the other response includes harmful or toxic content. in this paper, we propose a generalization of dpo, termed dpo with an offset (odpo), that does not treat every preference pair equally during fine-tuning. intuitively, odpo requires the difference between the likelihood of the preferred and dispreferred response to be greater than an offset value. the offset is determined based on the extent to which one response is preferred over another. our experiments on various tasks suggest that odpo significantly outperforms dpo in aligning language models, especially when the number of preference pairs is limited.
Divij Handa, Advait Chirmule, Bimal Gajera, Chitta Baral
Abstract: large language models (llms) are aligned to moral and ethical guidelines but remain susceptible to creative prompts called jailbreak that can bypass the alignment process. however, most jailbreaking prompts contain harmful questions in the natural language (mainly english), which can be detected by the llm themselves. in this paper, we present jailbreaking prompts encoded using cryptographic techniques. we first present a pilot study on the state-of-the-art llm, gpt-4, in decoding several safe sentences that have been encrypted using various cryptographic techniques and find that a straightforward word substitution cipher can be decoded most effectively. motivated by this result, we use this encoding technique for writing jailbreaking prompts. we present a mapping of unsafe words with safe words and ask the unsafe question using these mapped words. experimental results show an attack success rate (up to 59.42%) of our proposed jailbreaking approach on state-of-the-art proprietary models including chatgpt, gpt-4, and gemini-pro. additionally, we discuss the over-defensiveness of these models. we believe that our work will encourage further research in making these llms more robust while maintaining their decoding capabilities.
Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, Xueqi Cheng
Abstract: hallucinations pose a significant challenge for the practical implementation of large language models (llms). the utilization of parametric knowledge in generating factual content is constrained by the limited knowledge of llms, potentially resulting in internal hallucinations. while incorporating external information can help fill knowledge gaps, it also introduces the risk of irrelevant information, thereby increasing the likelihood of external hallucinations. a careful and balanced integration of the parametric knowledge within llms with external information is crucial to alleviate hallucinations. in this study, we present rowen, a novel approach that enhances llms with a selective retrieval augmentation process tailored to address hallucinated outputs. this process is governed by a multilingual semantic-aware detection module, which evaluates the consistency of the perturbed responses across various languages for the same queries. upon detecting inconsistencies indicative of hallucinations, rowen activates the retrieval of external information to rectify the model outputs. rowen adeptly harmonizes the intrinsic parameters in llms with external knowledge sources, effectively mitigating hallucinations by ensuring a balanced integration of internal reasoning and external evidence. through a comprehensive empirical analysis, we demonstrate that rowen surpasses the current state-of-the-art in both detecting and mitigating hallucinated content within the outputs of llms.
Ming Li, Jiuhai Chen, Lichang Chen, Tianyi Zhou
Abstract: making llms speak for different, especially minority groups of people, and generate statements supporting their diverse or even controversial perspectives is critical to creating an inclusive environment. however, existing llms lack sufficient controllability to the stance of their generated content, which often contains inconsistent, neutral, or biased statements. in this paper, we improve the controllability of llms in generating statements supporting an argument the user defined in the prompt. we find that multi-round debates between two llms with opposite stances generate higher-quality and more salient statements for each, which are important training data to improve the controllability of llms. motivated by this, we develop a novel debate & tuning ("debatune") pipeline finetuning llms to generate the statements obtained via debate. to examine debatune, we curate the largest dataset of debate topics so far, which covers 710 controversial topics and corresponding arguments for each topic. evaluations by the gpt-4 judge with a novel controversy controllability metric show that llms' capability of expressing diverse perspectives is significantly improved by debatune. moreover, such controllability can be generalized to unseen topics, generating high-quality statements supporting controversial arguments. our codes, models, and data will be released at
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang
Abstract: adopting human and large language models (llm) as judges (\textit{a.k.a} human- and llm-as-a-judge) for evaluating the performance of existing llms has recently gained attention. nonetheless, this approach concurrently introduces potential biases from human and llm judges, questioning the reliability of the evaluation results. in this paper, we propose a novel framework for investigating 5 types of biases for llm and human judges. we curate a dataset with 142 samples referring to the revised bloom's taxonomy and conduct thousands of human and llm evaluations. results show that human and llm judges are vulnerable to perturbations to various degrees, and that even the most cutting-edge judges possess considerable biases. we further exploit their weakness and conduct attacks on llm judges. we hope that our work can notify the community of the vulnerability of human- and llm-as-a-judge against perturbations, as well as the urgency of developing robust evaluation systems.
Haiyan Zhao, Fan Yang, Himabindu Lakkaraju, Mengnan Du
Abstract: as large language models (llms) grow more powerful, concerns around potential harms like toxicity, unfairness, and hallucination threaten user trust. ensuring beneficial alignment of llms with human values through model alignment is thus critical yet challenging, requiring a deeper understanding of llm behaviors and mechanisms. we propose opening the black box of llms through a framework of holistic interpretability encompassing complementary bottom-up and top-down perspectives. the bottom-up view, enabled by mechanistic interpretability, focuses on component functionalities and training dynamics. the top-down view utilizes representation engineering to analyze behaviors through hidden representations. in this paper, we review the landscape around mechanistic interpretability and representation engineering, summarizing approaches, discussing limitations and applications, and outlining future challenges in using these techniques to achieve ethical, honest, and reliable reasoning aligned with human values.
Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui, Xuanjing Huang
Abstract: tool learning is widely acknowledged as a foundational approach or deploying large language models (llms) in real-world scenarios. while current research primarily emphasizes leveraging tools to augment llms, it frequently neglects emerging safety considerations tied to their application. to fill this gap, we present $toolsword$, a comprehensive framework dedicated to meticulously investigating safety issues linked to llms in tool learning. specifically, toolsword delineates six safety scenarios for llms in tool learning, encompassing $malicious$ $queries$ and $jailbreak$ $attacks$ in the input stage, $noisy$ $misdirection$ and $risky$ $cues$ in the execution stage, and $harmful$ $feedback$ and $error$ $conflicts$ in the output stage. experiments conducted on 11 open-source and closed-source llms reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even gpt-4 is susceptible to. moreover, we conduct further studies with the aim of fostering research on tool learning safety. the data is released in
Moritz Stephan, Alexander Khazatsky, Eric Mitchell, Annie S Chen, Sheryl Hsu, Archit Sharma, Chelsea Finn
Abstract: the diversity of contexts in which large language models (llms) are deployed requires the ability to modify or customize default model behaviors to incorporate nuanced requirements and preferences. a convenient interface to specify such model adjustments is high-level verbal feedback, such as "don't use emojis when drafting emails to my boss." however, while writing high-level feedback is far simpler than collecting annotations for reinforcement learning from human feedback (rlhf), we find that simply prompting a model with such feedback leads to overgeneralization of the feedback to contexts where it is not relevant. we study the problem of incorporating verbal feedback without such overgeneralization, inspiring a new method contextualized critiques with constrained preference optimization (c3po). c3po uses a piece of high-level feedback to generate a small synthetic preference dataset specifying how the feedback should (and should not) be applied. it then fine-tunes the model in accordance with the synthetic preference data while minimizing the divergence from the original model for prompts where the feedback does not apply. our experimental results indicate that our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts. for both human- and gpt-4-generated high-level feedback, c3po effectively adheres to the given feedback comparably to in-context baselines while reducing overgeneralization by 30%.
Sarath Sivaprasad, Pramod Kaushik, Sahar Abdelnabi, Mario Fritz
Abstract: large-language-models (llms) are deployed in a wide range of applications, and their response has an increasing social impact. understanding the non-deliberate(ive) mechanism of llms in giving responses is essential in explaining their performance and discerning their biases in real-world applications. this is analogous to human studies, where such inadvertent responses are referred to as sampling. we study this sampling of llms in light of value bias and show that the sampling of llms tends to favour high-value options. value bias corresponds to this shift of response from the most likely towards an ideal value represented in the llm. in fact, this effect can be reproduced even with new entities learnt via in-context prompting. we show that this bias manifests in unexpected places and has implications on relevant application scenarios, like choosing exemplars. the results show that value bias is strong in llms across different categories, similar to the results found in human studies.
Jingwei Ni, Minjing Shi, Dominik Stammbach, Mrinmaya Sachan, Elliott Ash, Markus Leippold
Abstract: with the rise of generative ai, automated fact-checking methods to combat misinformation are becoming more and more important. however, factual claim detection, the first step in a fact-checking pipeline, suffers from two key issues that limit its scalability and generalizability: (1) inconsistency in definitions of the task and what a claim is, and (2) the high cost of manual annotation. to address (1), we review the definitions in related work and propose a unifying definition of factual claims that focuses on verifiability. to address (2), we introduce afacta (automatic factual claim detection annotator), a novel framework that assists in the annotation of factual claims with the help of large language models (llms). afacta calibrates its annotation confidence with consistency along three predefined reasoning paths. extensive evaluation and experiments in the domain of political speech reveal that afacta can efficiently assist experts in annotating factual claims and training high-quality classifiers, and can work with or without expert supervision. our analyses also result in policlaim, a comprehensive claim detection dataset spanning diverse political topics.
Chris M. Ward, Josh Harguess, Julia Tao, Daniel Christman, Paul Spicer, Mike Tan
Abstract: we introduce the ai security pyramid of pain, a framework that adapts the cybersecurity pyramid of pain to categorize and prioritize ai-specific threats. this framework provides a structured approach to understanding and addressing various levels of ai threats. starting at the base, the pyramid emphasizes data integrity, which is essential for the accuracy and reliability of datasets and ai models, including their weights and parameters. ensuring data integrity is crucial, as it underpins the effectiveness of all ai-driven decisions and operations. the next level, ai system performance, focuses on mlops-driven metrics such as model drift, accuracy, and false positive rates. these metrics are crucial for detecting potential security breaches, allowing for early intervention and maintenance of ai system integrity. advancing further, the pyramid addresses the threat posed by adversarial tools, identifying and neutralizing tools used by adversaries to target ai systems. this layer is key to staying ahead of evolving attack methodologies. at the adversarial input layer, the framework addresses the detection and mitigation of inputs designed to deceive or exploit ai models. this includes techniques like adversarial patterns and prompt injection attacks, which are increasingly used in sophisticated attacks on ai systems. data provenance is the next critical layer, ensuring the authenticity and lineage of data and models. this layer is pivotal in preventing the use of compromised or biased data in ai systems. at the apex is the tactics, techniques, and procedures (ttps) layer, dealing with the most complex and challenging aspects of ai security. this involves a deep understanding and strategic approach to counter advanced ai-targeted attacks, requiring comprehensive knowledge and planning.
Zihao He, Siyi Guo, Ashwin Rao, Kristina Lerman
Abstract: language models (lms) are known to represent the perspectives of some social groups better than others, which may impact their performance, especially on subjective tasks such as content moderation and hate speech detection. to explore how lms represent different perspectives, existing research focused on positional alignment, i.e., how closely the models mimic the opinions and stances of different groups, e.g., liberals or conservatives. however, human communication also encompasses emotional and moral dimensions. we define the problem of affective alignment, which measures how lms' emotional and moral tone represents those of different groups. by comparing the affect of responses generated by 36 lms to the affect of twitter messages, we observe significant misalignment of lms with both ideological groups. this misalignment is larger than the partisan divide in the u.s. even after steering the lms towards specific ideological perspectives, the misalignment and liberal tendencies of the model persist, suggesting a systemic bias within lms.
Tianyi Yan, Fei Wang, James Y. Huang, Wenxuan Zhou, Fan Yin, Aram Galstyan, Wenpeng Yin, Muhao Chen
Abstract: instruction tuning has been used as a promising approach to improve the performance of large language models (llms) on unseen tasks. however, current llms exhibit limited robustness to unseen instructions, generating inconsistent outputs when the same instruction is phrased with slightly varied forms or language styles. this behavior indicates llms' lack of robustness to textual variations and generalizability to unseen instructions, potentially leading to trustworthiness issues. accordingly, we propose contrastive instruction tuning, which maximizes the similarity between the hidden representations of semantically equivalent instruction-instance pairs while minimizing the similarity between semantically different ones. to facilitate this approach, we augment the existing flan collection by paraphrasing task instructions. experiments on the promptbench benchmark show that coin consistently improves llms' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy.
Fan Huang, Haewoon Kwak, Jisun An
Abstract: the robustness of ai-content detection models against cultivated attacks (e.g., paraphrasing or word switching) remains a significant concern. this study proposes a novel token-ensemble generation strategy to challenge the robustness of current ai-content detection approaches. we explore the ensemble attack strategy by completing the prompt with the next token generated from random candidate llms. we find the token-ensemble approach significantly drops the performance of ai-content detection models (the code and test sets will be released). our findings reveal that token-ensemble generation poses a vital challenge to current detection models and underlines the need for advancing detection technologies to counter sophisticated adversarial strategies.
Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Osama Mohanned Afzal, Tarek Mahmoud, Giovanni Puccetti, Thomas Arnold, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov
Abstract: the advent of large language models (llms) has brought an unprecedented surge in machine-generated text (mgt) across diverse channels. this raises legitimate concerns about its potential misuse and societal implications. the need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. in this work, we address this problem by introducing a new benchmark involving multilingual, multi-domain and multi-generator for mgt detection -- m4gt-bench. it is collected for three task formulations: (1) mono-lingual and multi-lingual binary mgt detection; (2) multi-way detection identifies which particular model generates the text; and (3) human-machine mixed text detection, where a word boundary delimiting mgt from human-written content should be determined. human evaluation for task 2 shows less than random guess performance, demonstrating the challenges to distinguish unique llms. promising results always occur when training and test data distribute within the same domain or generators.
Haolan Zhan, Zhuang Li, Xiaoxi Kang, Tao Feng, Yuncheng Hua, Lizhen Qu, Yi Ying, Mei Rianto Chandra, Kelly Rosalin, Jureynolds Jureynolds, Suraj Sharma, Shilin Qu, Linhao Luo, Lay-Ki Soon, Zhaleh Semnani Azad, Ingrid Zukerman, Gholamreza Haffari
Abstract: norm violations occur when individuals fail to conform to culturally accepted behaviors, which may lead to potential conflicts. remediating norm violations requires social awareness and cultural sensitivity of the nuances at play. to equip interactive ai systems with a remediation ability, we offer renovi - a large-scale corpus of 9,258 multi-turn dialogues annotated with social norms, as well as define a sequence of tasks to help understand and remediate norm violations step by step. renovi consists of two parts: 512 human-authored dialogues (real data), and 8,746 synthetic conversations generated by chatgpt through prompt learning. while collecting sufficient human-authored data is costly, synthetic conversations provide suitable amounts of data to help mitigate the scarcity of training data, as well as the chance to assess the alignment between llms and humans in the awareness of social norms. we thus harness the power of chatgpt to generate synthetic training data for our task. to ensure the quality of both human-authored and synthetic data, we follow a quality control protocol during data collection. our experimental results demonstrate the importance of remediating norm violations in socio-cultural conversations, as well as the improvement in performance obtained from synthetic data.
Xiangjue Dong, Yibo Wang, Philip S. Yu, James Caverlee
Abstract: large language models (llms) can generate biased responses. yet previous direct probing techniques contain either gender mentions or predefined gender stereotypes, which are challenging to comprehensively collect. hence, we propose an indirect probing framework based on conditional generation. this approach aims to induce llms to disclose their gender bias even without explicit gender or stereotype mentions. we explore three distinct strategies to disclose explicit and implicit gender bias in llms. our experiments demonstrate that all tested llms exhibit explicit and/or implicit gender bias, even when gender stereotypes are not present in the inputs. in addition, an increased model size or model alignment amplifies bias in most cases. furthermore, we investigate three methods to mitigate bias in llms via hyperparameter tuning, instruction guiding, and debias tuning. remarkably, these methods prove effective even in the absence of explicit genders or stereotypes.


Ece Gumusel, Kyrie Zhixuan Zhou, Madelyn Rose Sanfilippo
Abstract: this study presents a unique framework that applies and extends solove (2006)'s taxonomy to address privacy concerns in interactions with text-based ai chatbots. as chatbot prevalence grows, concerns about user privacy have heightened. while existing literature highlights design elements compromising privacy, a comprehensive framework is lacking. through semi-structured interviews with 13 participants interacting with two ai chatbots, this study identifies 9 privacy harms and 9 privacy risks in text-based interactions. using a grounded theory approach for interview and chatlog analysis, the framework examines privacy implications at various interaction stages. the aim is to offer developers, policymakers, and researchers a tool for responsible and secure implementation of conversational ai, filling the existing gap in addressing privacy issues associated with text-based ai chatbots.
Ashfak Md Shibli, Mir Mehedi A. Pritom, Maanak Gupta
Abstract: sms phishing, also known as "smishing", is a growing threat that tricks users into disclosing private information or clicking into urls with malicious content through fraudulent mobile text messages. in recent past, we have also observed a rapid advancement of conversational generative ai chatbot services (e.g., openai's chatgpt, google's bard), which are powered by pre-trained large language models (llms). these ai chatbots certainly have a lot of utilities but it is not systematically understood how they can play a role in creating threats and attacks. in this paper, we propose abusegpt method to show how the existing generative ai-based chatbot services can be exploited by attackers in real world to create smishing texts and eventually lead to craftier smishing campaigns. to the best of our knowledge, there is no pre-existing work that evidently shows the impacts of these generative text-based models on creating sms phishing. thus, we believe this study is the first of its kind to shed light on this emerging cybersecurity threat. we have found strong empirical evidences to show that attackers can exploit ethical standards in the existing generative ai-based chatbot services by crafting prompt injection attacks to create newer smishing campaigns. we also discuss some future research directions and guidelines to protect the abuse of generative ai-based services and safeguard users from smishing attacks.
Paulo Garcia
Abstract: ensuring artificial intelligence behaves in such a way that is aligned with human values is commonly referred to as the alignment challenge. prior work has shown that rational agents, behaving in such a way that maximizes a utility function, will inevitably behave in such a way that is not aligned with human values, especially as their level of intelligence goes up. prior work has also shown that there is no "one true utility function"; solutions must include a more holistic approach to alignment. this paper describes oblivious agents: agents that are architected in such a way that their effective utility function is an aggregation of a known and hidden sub-functions. the hidden component, to be maximized, is internally implemented as a black box, preventing the agent from examining it. the known component, to be minimized, is knowledge of the hidden sub-function. architectural constraints further influence how agent actions can evolve its internal environment model. we show that an oblivious agent, behaving rationally, constructs an internal approximation of designers' intentions (i.e., infers alignment), and, as a consequence of its architecture and effective utility function, behaves in such a way that maximizes alignment; i.e., maximizing the approximated intention function. we show that, paradoxically, it does this for whatever utility function is used as the hidden component and, in contrast with extant techniques, chances of alignment actually improve as agent intelligence grows.
Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu
Abstract: deep reinforcement learning is widely used for aligning large language models (llm) with human preference. however, the conventional reward modelling has predominantly depended on human annotations provided by a select cohort of individuals. such dependence may unintentionally result in models that are skewed to reflect the inclinations of these annotators, thereby failing to represent the expectations of the wider population adequately. in this paper, we introduce the distributional preference reward model (dprm), a simple yet effective framework to align large language models with a diverse set of human preferences. to this end, we characterize the preferences by a beta distribution, which can dynamically adapt to fluctuations in preference trends. on top of that, we design an optimal-transportation-based loss to calibrate dprm to align with the preference distribution. finally, the expected reward is utilized to fine-tune an llm policy to generate responses favoured by the population. our experiments show that dprm significantly enhances the alignment of llms with population preference, yielding more accurate, unbiased, and contextually appropriate responses.
Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, Xinyu Dai
Abstract: multimodal large language models (mllms) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. to eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algorithms to improve the alignment capability between images and text. however, they not only demand considerable computation resources during the finetuning stage but also require expensive human annotation to construct paired data needed by the alignment algorithms. to address these issues, we borrow the idea of unlearning and propose an efficient fine-grained unlearning framework (efuf), which can eliminate hallucinations without the need for paired data. extensive experiments show that our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead. our code and datasets will be publicly available.
Álvaro Huertas-García, Alejandro Martín, Javier Huertas-Tato, David Camacho
Abstract: adversarial attacks represent a substantial challenge in natural language processing (nlp). this study undertakes a systematic exploration of this challenge in two distinct phases: vulnerability evaluation and resilience enhancement of transformer-based models under adversarial attacks. in the evaluation phase, we assess the susceptibility of three transformer configurations, encoder-decoder, encoder-only, and decoder-only setups, to adversarial attacks of escalating complexity across datasets containing offensive language and misinformation. encoder-only models manifest a 14% and 21% performance drop in offensive language detection and misinformation detection tasks, respectively. decoder-only models register a 16% decrease in both tasks, while encoder-decoder models exhibit a maximum performance drop of 14% and 26% in the respective tasks. the resilience-enhancement phase employs adversarial training, integrating pre-camouflaged and dynamically altered data. this approach effectively reduces the performance drop in encoder-only models to an average of 5% in offensive language detection and 2% in misinformation detection tasks. decoder-only models, occasionally exceeding original performance, limit the performance drop to 7% and 2% in the respective tasks. although not surpassing the original performance, encoder-decoder models can reduce the drop to an average of 6% and 2% respectively. results suggest a trade-off between performance and robustness, with some models maintaining similar performance while gaining robustness. our study and adversarial training techniques have been incorporated into an open-source tool for generating camouflaged datasets. however, methodology effectiveness depends on the specific camouflage technique and data encountered, emphasizing the need for continued exploration.
Timothy R. Mcintosh, Teo Susnjak, Tong Liu, Paul Watters, Malka N. Halgamuge
Abstract: the rapid rise in popularity of large language models (llms) with emerging capabilities has spurred public curiosity to evaluate and compare different llms, leading many researchers to propose their llm benchmarks. noticing preliminary inadequacies in those benchmarks, we embarked on a study to critically assess 23 state-of-the-art llm benchmarks, using our novel unified evaluation framework through the lenses of people, process, and technology, under the pillars of functionality and security. our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, evaluator diversity, and the overlooking of cultural and ideological norms in one comprehensive assessment. our discussions emphasized the urgent need for standardized methodologies, regulatory certainties, and ethical guidelines in light of artificial intelligence (ai) advancements, including advocating for an evolution from static benchmarks to dynamic behavioral profiling to accurately capture llms' complex behaviors and potential risks. our study highlighted the necessity for a paradigm shift in llm evaluation methodologies, underlining the importance of collaborative efforts for the development of universally accepted benchmarks and the enhancement of ai systems' integration into society.
Saeed Khaki, Jinjin Li, Lan Ma, Liu Yang, Prathap Ramachandra
Abstract: reinforcement learning from human feedback (rlhf) has been extensively employed to align large language models with user intent. however, proximal policy optimization (ppo) based rlhf is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. recently, direct preference optimization (dpo) is proposed to address those challenges. however, dpo relies on contrastive responses generated from human annotator and alternative llm, instead of the policy model, limiting the effectiveness of the rlhf. in this paper, we addresses both challenges by systematically combining rejection sampling (rs) and dpo. our proposed method, rs-dpo, initiates with the development of a supervised fine-tuned policy model (sft). a varied set of k responses per prompt are sampled directly from the sft model. rs-dpo identifies pairs of contrastive samples based on their reward distribution. finally, we apply dpo with the contrastive samples to align the model to human preference. our experiments indicate that our proposed method effectively fine-tunes llms with limited resource environments, leading to improved alignment with user intent. furthermore, it outperforms existing methods, including rs, ppo, and dpo.
Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, Meng Jiang
Abstract: the rapid advancement of large language models (llms) has demonstrated their vast potential across various domains, attributed to their extensive pretraining knowledge and exceptional generalizability. however, llms often encounter challenges in generating harmful content when faced with problematic prompts. to address this problem, existing work attempted to implement a gradient ascent based approach to prevent llms from producing harmful output. while these methods can be effective, they frequently impact the model utility in responding to normal prompts. to address this gap, we introduce selective knowledge negation unlearning (sku), a novel unlearning framework for llms, designed to eliminate harmful knowledge while preserving utility on normal prompts. specifically, sku is consisted of two stages: harmful knowledge acquisition stage and knowledge negation stage. the first stage aims to identify and acquire harmful knowledge within the model, whereas the second is dedicated to remove this knowledge. sku selectively isolates and removes harmful knowledge in model parameters, ensuring the model's performance remains robust on normal prompts. our experiments conducted across various llm architectures demonstrate that sku identifies a good balance point between removing harmful information and preserving utility.
Weixiang Zhao, Zhuojun Li, Shilong Wang, Yang Wang, Yulin Hu, Yanyan Zhao, Chen Wei, Bing Qin
Abstract: emotional intelligence (ei), consisting of emotion perception, emotion cognition and emotion expression, plays the critical roles in improving user interaction experience for the current large language model (llm) based conversational general ai assistants. previous works mainly focus on raising the emotion perception ability of them via naive fine-tuning on ei-related classification or regression tasks. however, this leads to the incomplete enhancement of ei and catastrophic forgetting of the general intelligence (gi). to this end, we first introduce \textsc{eibench}, a large-scale collection of ei-related tasks in the text-to-text formation with task instructions that covers all three aspects of ei, which lays a solid foundation for the comprehensive ei enhancement of llms. then a novel \underline{\textbf{mo}}dular \underline{\textbf{e}}motional \underline{\textbf{i}}ntelligence enhancement method (\textbf{moei}), consisting of modular parameter expansion and intra-inter modulation, is proposed to comprehensively enhance the ei of llms without compromise their gi. extensive experiments on two representative llm-based assistants, flan-t5 and llama-2-chat, demonstrate the effectiveness of moei to improving ei while maintain gi.
Lingbo Mo, Zeyi Liao, Boyuan Zheng, Yu Su, Chaowei Xiao, Huan Sun
Abstract: language agents powered by large language models (llms) have seen exploding development. their capability of using language as a vehicle for thought and communication lends an incredible level of flexibility and versatility. people have quickly capitalized on this capability to connect llms to a wide range of external components and environments: databases, tools, the internet, robotic embodiment, etc. many believe an unprecedentedly powerful automation technology is emerging. however, new automation technologies come with new safety risks, especially for intricate systems like language agents. there is a surprisingly large gap between the speed and scale of their development and deployment and our understanding of their safety risks. are we building a house of cards? in this position paper, we present the first systematic effort in mapping adversarial attacks against language agents. we first present a unified conceptual framework for agents with three major components: perception, brain, and action. under this framework, we present a comprehensive discussion and propose 12 potential attack scenarios against different components of an agent, covering different attack strategies (e.g., input manipulation, adversarial demonstrations, jailbreaking, backdoors). we also draw connections to successful attack strategies previously applied to llms. we emphasize the urgency to gain a thorough understanding of language agent risks before their widespread deployment.
Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, Jianshu Chen
Abstract: we consider the problem of multi-objective alignment of foundation models with human preferences, which is a critical step towards helpful and harmless ai systems. however, it is generally costly and unstable to fine-tune large foundation models using reinforcement learning (rl), and the multi-dimensionality, heterogeneity, and conflicting nature of human preferences further complicate the alignment process. in this paper, we introduce rewards-in-context (ric), which conditions the response of a foundation model on multiple rewards in its prompt context and applies supervised fine-tuning for alignment. the salient features of ric are simplicity and adaptivity, as it only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time. inspired by the analytical solution of an abstracted convex optimization problem, our dynamic inference-time adjustment method approaches the pareto-optimal solution for multiple objectives. empirical evidence demonstrates the efficacy of our method in aligning both large language models (llms) and diffusion models to accommodate diverse rewards with only around 10% gpu hours compared with multi-objective rl baseline.
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer
Abstract: the rise of large language models (llms) has drawn attention to the existence of "jailbreaks" that allow the models to be used maliciously. however, there is no standard benchmark for measuring the severity of a jailbreak, leaving authors of jailbreak papers to create their own. we show that these benchmarks often include vague or unanswerable questions and use grading criteria that are biased towards overestimating the misuse potential of low-quality model responses. some jailbreak techniques make the problem worse by decreasing the quality of model responses even on benign questions: we show that several jailbreaking techniques substantially reduce the zero-shot performance of gpt-4 on mmlu. jailbreaks can also make it harder to elicit harmful responses from an "uncensored" open-source model. we present a new benchmark, strongreject, which better discriminates between effective and ineffective jailbreaks by using a higher-quality question set and a more accurate response grading algorithm. we show that our new grading scheme better accords with human judgment of response quality and overall jailbreak effectiveness, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks. we release our code and data at
Xiyang Wu, Ruiqi Xian, Tianrui Guan, Jing Liang, Souradip Chakraborty, Fuxiao Liu, Brian Sadler, Dinesh Manocha, Amrit Singh Bedi
Abstract: in this paper, we highlight the critical issues of robustness and safety associated with integrating large language models (llms) and vision-language models (vlms) into robotics applications. recent works have focused on using llms and vlms to improve the performance of robotics tasks, such as manipulation, navigation, etc. however, such integration can introduce significant vulnerabilities, in terms of their susceptibility to adversarial attacks due to the language models, potentially leading to catastrophic consequences. by examining recent works at the interface of llms/vlms and robotics, we show that it is easy to manipulate or misguide the robot's actions, leading to safety hazards. we define and provide examples of several plausible adversarial attacks, and conduct experiments on three prominent robot frameworks integrated with a language model, including knowno vima, and instruct2act, to assess their susceptibility to these attacks. our empirical findings reveal a striking vulnerability of llm/vlm-robot integrated systems: simple adversarial attacks can significantly undermine the effectiveness of llm/vlm-robot integrated systems. specifically, our data demonstrate an average performance deterioration of 21.2% under prompt attacks and a more alarming 30.2% under perception attacks. these results underscore the critical need for robust countermeasures to ensure the safe and reliable deployment of the advanced llm/vlm-based robotic systems.
Jiaheng Wei, Yuanshun Yao, Jean-Francois Ton, Hongyi Guo, Andrew Estornell, Yang Liu
Abstract: llm hallucination, i.e. generating factually incorrect yet seemingly convincing answers, is currently a major threat to the trustworthiness and reliability of llms. the first step towards solving this complicated problem is to measure it. however, existing hallucination metrics require to have a benchmark dataset with gold-standard answers, i.e. "best" or "correct" answers written by humans. such requirement makes hallucination measurement costly and prone to human errors. in this work, we propose factualness evaluations via weighting llms (fewl), the first hallucination metric that is specifically designed for the scenario when gold-standard answers are absent. fewl leverages the answers from off-the-shelf llms that serve as a proxy of gold-standard answers. the key challenge is how to quantify the expertise of reference llms resourcefully. we show fewl has certain theoretical guarantees and demonstrate empirically it gives more accurate hallucination measures than naively using reference llms. we also show how to leverage fewl to reduce hallucination through both in-context learning and supervised finetuning. last, we build a large-scale benchmark dataset to facilitate llm hallucination research.
Herun Wan, Shangbin Feng, Zhaoxuan Tan, Heng Wang, Yulia Tsvetkov, Minnan Luo
Abstract: large language models are limited by challenges in factuality and hallucinations to be directly employed off-the-shelf for judging the veracity of news articles, where factual accuracy is paramount. in this work, we propose dell that identifies three key stages in misinformation detection where llms could be incorporated as part of the pipeline: 1) llms could \emph{generate news reactions} to represent diverse perspectives and simulate user-news interaction networks; 2) llms could \emph{generate explanations} for proxy tasks (e.g., sentiment, stance) to enrich the contexts of news articles and produce experts specializing in various aspects of news understanding; 3) llms could \emph{merge task-specific experts} and provide an overall prediction by incorporating the predictions and confidence scores of varying experts. extensive experiments on seven datasets with three llms demonstrate that dell outperforms state-of-the-art baselines by up to 16.8\% in macro f1-score. further analysis reveals that the generated reactions and explanations are greatly helpful in misinformation detection, while our proposed llm-guided expert merging helps produce better-calibrated predictions.
Wenchao Dong, Assem Zhunis, Hyojin Chin, Jiyoung Han, Meeyoung Cha
Abstract: we explored cultural biases-individualism vs. collectivism-in chatgpt across three western languages (i.e., english, german, and french) and three eastern languages (i.e., chinese, japanese, and korean). when chatgpt adopted an individualistic persona in western languages, its collectivism scores (i.e., out-group values) exhibited a more negative trend, surpassing their positive orientation towards individualism (i.e., in-group values). conversely, when a collectivistic persona was assigned to chatgpt in eastern languages, a similar pattern emerged with more negative responses toward individualism (i.e., out-group values) as compared to collectivism (i.e., in-group values). the results indicate that when imbued with a particular social identity, chatgpt discerns in-group and out-group, embracing in-group values while eschewing out-group values. notably, the negativity towards the out-group, from which prejudices and discrimination arise, exceeded the positivity towards the in-group. the experiment was replicated in the political domain, and the results remained consistent. furthermore, this replication unveiled an intrinsic democratic bias in large language models (llms), aligning with earlier findings and providing integral insights into mitigating such bias through prompt engineering. extensive robustness checks were performed using varying hyperparameter and persona setup methods, with or without social identity labels, across other popular language models.


Siwon Kim, Shuyang Dai, Mohammad Kachuee, Shayan Ray, Tara Taghavi, Sungroh Yoon
Abstract: current conversational ai systems based on large language models (llms) are known to generate unsafe responses, agreeing to offensive user input or including toxic content. previous research aimed to alleviate the toxicity, by fine-tuning llm with manually annotated safe dialogue histories. however, the dependency on additional tuning requires substantial costs. to remove the dependency, we propose groundial, where response safety is achieved by grounding responses to commonsense social rules without requiring fine-tuning. a hybrid approach of in-context learning and human-norm-guided decoding of groundial enables the response to be quantitatively and qualitatively safer even without additional data or tuning.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran
Abstract: as large language models (llms) become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align llm behavior with human values, including safety. jailbreak attacks, aiming to provoke unintended and unsafe behaviors from llms, remain a significant/leading llm safety threat. in this paper, we aim to defend llms against jailbreak attacks by introducing safedecoding, a safety-aware decoding strategy for llms to generate helpful and harmless responses to user queries. our insight in developing safedecoding is based on the observation that, even though probabilities of tokens representing harmful contents outweigh those representing harmless responses, safety disclaimers still appear among the top tokens after sorting tokens by probability in descending order. this allows us to mitigate jailbreak attacks by identifying safety disclaimers and amplifying their token probabilities, while simultaneously attenuating the probabilities of token sequences that are aligned with the objectives of jailbreak attacks. we perform extensive experiments on five llms using six state-of-the-art jailbreak attacks and four benchmark datasets. our results show that safedecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries. safedecoding outperforms six defense methods.
Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, Stephan Gunnemann
Abstract: current research in adversarial robustness of llms focuses on discrete input manipulations in the natural language space, which can be directly transferred to closed-source models. however, this approach neglects the steady progression of open-source models. as open-source models advance in capability, ensuring their safety also becomes increasingly imperative. yet, attacks tailored to open-source llms that exploit full model access remain largely unexplored. we address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. we find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. furthermore, we present a novel threat model in the context of unlearning and show that embedding space attacks can extract supposedly deleted information from unlearned llms across multiple datasets and models. our findings highlight embedding space attacks as an important threat model in open-source llms. trigger warning: the appendix contains llm-generated text with violence and harassment.
Zhiyuan Chang, Mingyang Li, Yi Liu, Junjie Wang, Qing Wang, Yang Liu
Abstract: with the development of llms, the security threats of llms are getting more and more attention. numerous jailbreak attacks have been proposed to assess the security defense of llms. current jailbreak attacks primarily utilize scenario camouflage techniques. however their explicitly mention of malicious intent will be easily recognized and defended by llms. in this paper, we propose an indirect jailbreak attack approach, puzzler, which can bypass the llm's defense strategy and obtain malicious response by implicitly providing llms with some clues about the original malicious query. in addition, inspired by the wisdom of "when unable to attack, defend" from sun tzu's art of war, we adopt a defensive stance to gather clues about the original malicious query through llms. extensive experimental results show that puzzler achieves a query success rate of 96.6% on closed-source llms, which is 57.9%-82.7% higher than baselines. furthermore, when tested against the state-of-the-art jailbreak detection approaches, puzzler proves to be more effective at evading detection compared to baselines.
Lukas Struppek, Minh Hieu Le, Dominik Hintersdorf, Kristian Kersting
Abstract: the proliferation of large language models (llms) has sparked widespread and general interest due to their strong language generation capabilities, offering great potential for both industry and research. while previous research delved into the security and privacy issues of llms, the extent to which these models can exhibit adversarial behavior remains largely unexplored. addressing this gap, we investigate whether common publicly available llms have inherent capabilities to perturb text samples to fool safety measures, so-called adversarial examples resp.~attacks. more specifically, we investigate whether llms are inherently able to craft adversarial examples out of benign samples to fool existing safe rails. our experiments, which focus on hate speech detection, reveal that llms succeed in finding adversarial perturbations, effectively undermining hate speech detection systems. our findings carry significant implications for (semi-)autonomous systems relying on llms, highlighting potential challenges in their interaction with existing systems and safety measures.
Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Johannes Gasteiger, Stephan Günnemann
Abstract: current llm alignment methods are readily broken through specifically crafted adversarial prompts. while crafting adversarial prompts using discrete optimization is highly effective, such attacks typically use more than 100,000 llm calls. this high computational cost makes them unsuitable for, e.g., quantitative analyses and adversarial training. to remedy this, we revisit projected gradient descent (pgd) on the continuously relaxed input prompt. although previous attempts with ordinary gradient-based attacks largely failed, we show that carefully controlling the error introduced by the continuous relaxation tremendously boosts their efficacy. our pgd for llms is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.
Yixin Cheng, Markos Georgopoulos, Volkan Cevher, Grigorios G. Chrysos
Abstract: large language models (llms) are susceptible to jailbreaking attacks, which aim to extract harmful information by subtly modifying the attack query. as defense mechanisms evolve, directly obtaining harmful information becomes increasingly challenging for jailbreaking attacks. in this work, inspired by human practices of indirect context to elicit harmful information, we focus on a new attack form called contextual interaction attack. the idea relies on the autoregressive nature of the generation process in llms. we contend that the prior context--the information preceding the attack query--plays a pivotal role in enabling potent jailbreaking attacks. specifically, we propose an approach that leverages preliminary question-answer pairs to interact with the llm. by doing so, we guide the responses of the model toward revealing the 'desired' harmful information. we conduct experiments on four different llms and demonstrate the efficacy of this attack, which is black-box and can also transfer across llms. we believe this can lead to further developments and understanding of the context vector in llms.
Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, Yang Zhang
Abstract: the increasing demand for customized large language models (llms) has led to the development of solutions like gpts. these solutions facilitate tailored llm creation via natural language prompts without coding. however, the trustworthiness of third-party custom versions of llms remains an essential concern. in this paper, we propose the first instruction backdoor attacks against applications integrated with untrusted customized llms (e.g., gpts). specifically, these attacks embed the backdoor into the custom version of llms by designing prompts with backdoor instructions, outputting the attacker's desired result when inputs contain the pre-defined triggers. our attack includes 3 levels of attacks: word-level, syntax-level, and semantic-level, which adopt different types of triggers with progressive stealthiness. we stress that our attacks do not require fine-tuning or any modification to the backend llms, adhering strictly to gpts development guidelines. we conduct extensive experiments on 4 prominent llms and 5 benchmark text classification datasets. the results show that our instruction backdoor attacks achieve the desired attack performance without compromising utility. additionally, we propose an instruction-ignoring defense mechanism and demonstrate its partial effectiveness in mitigating such attacks. our findings highlight the vulnerability and the potential risks of llm customization such as gpts.
Olivia Macmillan-Scott, Mirco Musolesi
Abstract: do large language models (llms) display rational reasoning? llms have been shown to contain human biases due to the data they have been trained on; whether this is reflected in rational reasoning remains less clear. in this paper, we answer this question by evaluating seven language models using tasks from the cognitive psychology literature. we find that, like humans, llms display irrationality in these tasks. however, the way this irrationality is displayed does not reflect that shown by humans. when incorrect answers are given by llms to these tasks, they are often incorrect in ways that differ from human-like biases. on top of this, the llms reveal an additional layer of irrationality in the significant inconsistency of the responses. aside from the experimental results, this paper seeks to make a methodological contribution by showing how we can assess and compare different capabilities of these types of models, in this case with respect to rational reasoning.
Yuhui Shi, Qiang Sheng, Juan Cao, Hao Mi, Beizhe Hu, Danding Wang
Abstract: with the rapidly increasing application of large language models (llms), their abuse has caused many undesirable societal problems such as fake news, academic dishonesty, and information pollution. this makes ai-generated text (aigt) detection of great importance. among existing methods, white-box methods are generally superior to black-box methods in terms of performance and generalizability, but they require access to llms' internal states and are not applicable to black-box settings. in this paper, we propose to estimate word generation probabilities as pseudo white-box features via multiple re-sampling to help improve aigt detection under the black-box setting. specifically, we design poger, a proxy-guided efficient re-sampling method, which selects a small subset of representative words (e.g., 10 words) for performing multiple re-sampling in black-box aigt detection. experiments on datasets containing texts from humans and seven llms show that poger outperforms all baselines in macro f1 under black-box, partial white-box, and out-of-distribution settings and maintains lower re-sampling costs than its existing counterparts.
Zilin Ma, Yiyang Mei, Yinru Long, Zhaoyuan Su, Krzysztof Z. Gajos
Abstract: lgbtq+ individuals are increasingly turning to chatbots powered by large language models (llms) to meet their mental health needs. however, little research has explored whether these chatbots can adequately and safely provide tailored support for this demographic. we interviewed 18 lgbtq+ and 13 non-lgbtq+ participants about their experiences with llm-based chatbots for mental health needs. lgbtq+ participants relied on these chatbots for mental health support, likely due to an absence of support in real life. notably, while llms offer prompt support, they frequently fall short in grasping the nuances of lgbtq-specific challenges. although fine-tuning llms to address lgbtq+ needs can be a step in the right direction, it isn't the panacea. the deeper issue is entrenched in societal discrimination. consequently, we call on future researchers and designers to look beyond mere technical refinements and advocate for holistic strategies that confront and counteract the societal biases burdening the lgbtq+ community.
Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Lifeng Jin, Linfeng Song, Haitao Mi, Helen Meng
Abstract: despite showing increasingly human-like abilities, large language models (llms) often struggle with factual inaccuracies, i.e. "hallucinations", even when they hold relevant knowledge. to address these hallucinations, current approaches typically necessitate high-quality human factuality annotations. in this work, we explore self-alignment for factuality, where we leverage the self-evaluation capability of an llm to provide training signals that steer the model towards factuality. specifically, we incorporate self-eval, a self-evaluation component, to prompt an llm to validate the factuality of its own generated responses solely based on its internal knowledge. additionally, we design self-knowledge tuning (sk-tuning) to augment the llm's self-evaluation ability by improving the model's confidence estimation and calibration. we then utilize these self-annotated responses to fine-tune the model via direct preference optimization algorithm. we show that the proposed self-alignment approach substantially enhances factual accuracy over llama family models across three key knowledge-intensive tasks on truthfulqa and biogen.
Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, Yu Qiao
Abstract: large language models (llms) are now commonplace in conversation applications. however, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on llm conversation safety. therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of llm conversation safety: attacks, defenses, and evaluations. our goal is to provide a structured summary that enhances understanding of llm conversation safety and encourages further investigation into this important subject. for easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at:
Jessica Zhu, Dr. Michel Cukier, Dr. Joseph Richardson
Abstract: objective: firearm injury research necessitates using data from often-exploited vulnerable populations of black and brown americans. in order to minimize distrust, this study provides a framework for establishing ai trust and transparency with the general population. methods: we propose a model facts template that is easily extendable and decomposes accuracy and demographics into standardized and minimally complex values. this framework allows general users to assess the validity and biases of a model without diving into technical model documentation. examples: we apply the model facts template on two previously published models, a violence risk identification model and a suicide risk prediction model. we demonstrate the ease of accessing the appropriate information when the data is structured appropriately. discussion: the model facts template is limited in its current form to human based data and biases. like nutrition facts, it also will require some educational resources for users to grasp its full utility. human computer interaction experiments should be conducted to ensure that the interaction between user interface and model interface is as desired. conclusion: the model facts label is the first framework dedicated to establishing trust with end users and general population consumers. implementation of model facts into firearm injury research will provide public health practitioners and those impacted by firearm injury greater faith in the tools the research provides.
Feifan Song, Yuxuan Fan, Xin Zhang, Peiyi Wang, Houfeng Wang
Abstract: large language models (llms) rely on human preference alignment (hpa) to ensure the generation of safe content. due to the heavy cost associated with fine-tuning, fine-tuning-free methods have emerged, typically modifying llm decoding with external auxiliary methods. however, these methods do not essentially enhance the llm itself. in this paper, we rethink the derivation procedures of dpo, based on which we conversely build an instant scorer using the states of the llm before and after in-context learning (icl). accordingly, we propose a novel approach called in-context direct preference optimization (icdpo). it enables llms to borrow the hpa capabilities from superior llms with icl, generating well-aligned responses as estimated by the aforementioned instant scorer, thereby enhancing the final performance. icdpo can be further enhanced with a two-stage retriever and an upgraded scorer, both offering benefits. extensive experiments show its effectiveness, particularly in outperforming two fine-tuning-free baselines, and it exhibits competitiveness with sft + lora. we also conduct detailed analyses to offer comprehensive insights into icdpo.
Maryam Amirizaniani, Tanya Roosta, Aman Chadha, Chirag Shah
Abstract: as large language models (llms) gain wider adoption in various contexts, it becomes crucial to ensure they are reasonably safe, consistent, and reliable for an application at hand. this may require probing or auditing them. probing llms with varied iterations of a single question could reveal potential inconsistencies in their knowledge or functionality. however, a tool for performing such audits with simple workflow and low technical threshold is lacking. in this demo, we introduce "auditllm," a novel tool designed to evaluate the performance of various llms in a methodical way. auditllm's core functionality lies in its ability to test a given llm by auditing it using multiple probes generated from a single question, thereby identifying any inconsistencies in the model's understanding or operation. a reasonably robust, reliable, and consistent llm should output semantically similar responses for a question asked differently or by different people. based on this assumption, auditllm produces easily interpretable results regarding the llm's consistencies from a single question that the user enters. a certain level of inconsistency has been shown to be an indicator of potential bias, hallucinations, and other issues. one could then use the output of auditllm to further investigate issues with the aforementioned llm. to facilitate demonstration and practical uses, auditllm offers two key modes: (1) live mode which allows instant auditing of llms by analyzing responses to real-time queries; (2) batch mode which facilitates comprehensive llm auditing by processing multiple queries at once for in-depth analysis. this tool is beneficial for both researchers and general users, as it enhances our understanding of llms' capabilities in generating responses, using a standardized auditing platform.
Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao
Abstract: despite the success of reinforcement learning from human feedback (rlhf) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge, which primarily stems from limitations in reward modeling, i.e., generalizability of the reward model and inconsistency in the preference dataset. in this work, we tackle this problem from an information theoretic-perspective, and propose a generalizable and robust framework for reward modeling, namely inform, by introducing a variational information bottleneck objective to filter out irrelevant information and developing a mechanism for model complexity modulation. notably, we further identify a correlation between overoptimization and outliers in the latent space, establishing inform as a promising tool for detecting reward overoptimization. inspired by this finding, we propose the integrated cluster deviation score (icds), which quantifies deviations in the latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. extensive experiments on a wide range of settings and model scales (70m, 440m, 1.4b, and 7b) support the effectiveness of inform. further analyses reveal that inform's overoptimization detection mechanism is effective, potentially signifying a notable advancement in the field of rlhf. code will be released upon acceptance.
Maryam Amirizaniani, Jihan Yao, Adrian Lavergne, Elizabeth Snell Okada, Aman Chadha, Tanya Roosta, Chirag Shah
Abstract: as llms become more pervasive across various users and scenarios, identifying potential issues when using these models becomes essential. examples include bias, inconsistencies, and hallucination. although auditing the llm for these problems is desirable, it is far from being easy or solved. an effective method is to probe the llm using different versions of the same question. this could expose inconsistencies in its knowledge or operation, indicating potential for bias or hallucination. however, to operationalize this auditing method at scale, we need an approach to create those probes reliably and automatically. in this paper we propose an automatic and scalable solution, where one uses a different llm along with human-in-the-loop. this approach offers verifiability and transparency, while avoiding circular reliance on the same llms, and increasing scientific rigor and generalizability. specifically, we present a novel methodology with two phases of verification using humans: standardized evaluation criteria to verify responses, and a structured prompt template to generate desired probes. experiments on a set of questions from truthfulqa dataset show that we can generate a reliable set of probes from one llm that can be used to audit inconsistencies in a different llm. the criteria for generating and applying auditing probes is generalizable to various llms regardless of the underlying structure or training mechanism.
Kyungsu Kim, Junhyun Park, Saul Langarica, Adham Mahmoud Alkhadrawi, Synho Do
Abstract: this study demonstrates the first in-hospital adaptation of a cloud-based ai, similar to chatgpt, into a secure model for analyzing radiology reports, prioritizing patient data privacy. by employing a unique sentence-level knowledge distillation method through contrastive learning, we achieve over 95% accuracy in detecting anomalies. the model also accurately flags uncertainties in its predictions, enhancing its reliability and interpretability for physicians with certainty indicators. these advancements represent significant progress in developing secure and efficient ai tools for healthcare, suggesting a promising future for in-hospital ai applications with minimal supervision.
Kaixuan Ji, Jiafan He, Quanquan Gu
Abstract: aligning large language models (llm) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (rlhf). despite their superior performance, current rlhf approaches often require a large amount of human-labelled preference data, which is expensive to collect. in this paper, inspired by the success of active learning, we address this problem by proposing query-efficient rlhf methods. we first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (appo) algorithm with an $\tilde{o}(d^2/\delta)$ regret bound and an $\tilde{o}(d^2/\delta^2)$ query complexity, where $d$ is the dimension of feature space and $\delta$ is the sub-optimality gap over all the contexts. we then propose adpo, a practical version of our algorithm based on direct preference optimization (dpo) and apply it to fine-tuning llms. our experiments show that adpo, while only making about half of queries for human preference, matches the performance of the state-of-the-art dpo method.
Jingxuan He, Mark Vero, Gabriela Krasnopolska, Martin Vechev
Abstract: modern language models (lms) have gained widespread acceptance in everyday and professional contexts, particularly in programming. an essential procedure enabling this adoption is instruction tuning, which substantially enhances lms' practical utility by training them to follow user instructions and human preferences. however, existing instruction tuning schemes overlook a crucial aspect: the security of generated code. as a result, even the state-of-the-art instruction-tuned lms frequently produce unsafe code, posing significant security risks. in this work, we introduce safecoder to address this gap. safecoder performs security-centric fine-tuning using a diverse and high-quality dataset that we collected using an automated pipeline. we integrate the security fine-tuning with standard instruction tuning, to facilitate a joint optimization of both security and utility. despite its simplicity, we show that safecoder is effective across a variety of popular lms and datasets. it is able to drastically improve security (by about 30%), while preserving utility.
Congcong Wen, Jiazhao Liang, Shuaihang Yuan, Hao Huang, Yi Fang
Abstract: in the field of robotics and automation, navigation systems based on large language models (llms) have recently shown impressive performance. however, the security aspects of these systems have received relatively less attention. this paper pioneers the exploration of vulnerabilities in llm-based navigation models in urban outdoor environments, a critical area given the technology's widespread application in autonomous driving, logistics, and emergency services. specifically, we introduce a novel navigational prompt suffix (nps) attack that manipulates llm-based navigation models by appending gradient-derived suffixes to the original navigational prompt, leading to incorrect actions. we conducted comprehensive experiments on an llms-based navigation model that employs various llms for reasoning. our results, derived from the touchdown and map2seq street-view datasets under both few-shot learning and fine-tuning configurations, demonstrate notable performance declines across three metrics in the face of both white-box and black-box attacks. these results highlight the generalizability and transferability of the nps attack, emphasizing the need for enhanced security in llm-based navigation systems. as an initial countermeasure, we propose the navigational prompt engineering (npe) defense strategy, concentrating on navigation-relevant keywords to reduce the impact of adversarial suffixes. while initial findings indicate that this strategy enhances navigational safety, there remains a critical need for the wider research community to develop stronger defense methods to effectively tackle the real-world challenges faced by these systems.
Narun Raman, Taylor Lundy, Samuel Amouyal, Yoav Levine, Kevin Leyton-Brown, Moshe Tennenholtz
Abstract: there is increasing interest in using llms as decision-making "agents." doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? settling these questions -- and more broadly, determining whether an llm agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. in this paper, we provide one. we begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. we then propose a benchmark distribution that quantitatively scores an llms performance on these elements and, combined with a user-provided rubric, produces a "rationality report card." finally, we describe the results of a large-scale empirical experiment with 14 different llms, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior.
Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru
Abstract: language models often exhibit undesirable behaviors, such as gender bias or toxic language. interventions in the representation space were shown effective in mitigating such issues by altering the lm behavior. we first show that two prominent intervention techniques, linear erasure and steering vectors, do not enable a high degree of control and are limited in expressivity. we then propose a novel intervention methodology for generating expressive counterfactuals in the representation space, aiming to make representations of a source class (e.g., "toxic") resemble those of a target class (e.g., "non-toxic"). this approach, generalizing previous linear intervention techniques, utilizes a closed-form solution for the earth mover's problem under gaussian assumptions and provides theoretical guarantees on the representation space's geometric organization. we further build on this technique and derive a nonlinear intervention that enables controlled generation. we demonstrate the effectiveness of the proposed approaches in mitigating bias in multiclass classification and in reducing the generation of toxic language, outperforming strong baselines.
Chawin Sitawarin, Norman Mu, David Wagner, Alexandre Araujo
Abstract: large language models (llms) have surged in popularity in recent months, but they have demonstrated concerning capabilities to generate harmful content when manipulated. while techniques like safety fine-tuning aim to minimize harmful use, recent works have shown that llms remain vulnerable to attacks that elicit toxic responses. in this work, we introduce the proxy-guided attack on llms (pal), the first optimization-based attack on llms in a black-box query-only setting. in particular, it relies on a surrogate model to guide the optimization and a sophisticated loss designed for real-world llm apis. our attack achieves 84% attack success rate (asr) on gpt-3.5-turbo and 48% on llama-2-7b, compared to 4% for the current state of the art. we also propose gcg++, an improvement to the gcg attack that reaches 94% asr on white-box llama-2-7b, and the random-search attack on llms (ral), a strong but simple baseline for query-based attacks. we believe the techniques proposed in this work will enable more comprehensive safety testing of llms and, in the long term, the development of better security guardrails. the code can be found at


Andrew Hundt, Julia Schuller, Severin Kacianka
Abstract: machine learning (ml) and 'artificial intelligence' ('ai') methods tend to replicate and amplify existing biases and prejudices, as do robots with ai. for example, robots with facial recognition have failed to identify black women as human, while others have categorized people, such as black men, as criminals based on appearance alone. a 'culture of modularity' means harms are perceived as 'out of scope', or someone else's responsibility, throughout employment positions in the 'ai supply chain'. incidents are routine enough ( lists over 2000 examples) to indicate that few organizations are capable of completely respecting peoples' rights; meeting claimed equity, diversity, and inclusion (edi or dei) goals; or recognizing and then addressing such failures in their organizations and artifacts. we propose a framework for adapting widely practiced research and development (r&d) project management methodologies to build organizational equity capabilities and better integrate known evidence-based best practices. we describe how project teams can organize and operationalize the most promising practices, skill sets, organizational cultures, and methods to detect and address rights-based fairness, equity, accountability, and ethical problems as early as possible when they are often less harmful and easier to mitigate; then monitor for unforeseen incidents to adaptively and constructively address them. our primary example adapts an agile development process based on scrum, one of the most widely adopted approaches to organizing r&d teams. we also discuss limitations of our proposed framework and future research directions.
Daniel Nahmias, Gal Engelberg, Dan Klein, Asaf Shabtai
Abstract: spear-phishing attacks present a significant security challenge, with large language models (llms) escalating the threat by generating convincing emails and facilitating target reconnaissance. to address this, we propose a detection approach based on a novel document vectorization method that utilizes an ensemble of llms to create representation vectors. by prompting llms to reason and respond to human-crafted questions, we quantify the presence of common persuasion principles in the email's content, producing prompted contextual document vectors for a downstream supervised machine learning model. we evaluate our method using a unique dataset generated by a proprietary system that automates target reconnaissance and spear-phishing email creation. our method achieves a 91% f1 score in identifying llm-generated spear-phishing emails, with the training set comprising only traditional phishing and benign emails. key contributions include an innovative document vectorization method utilizing llm reasoning, a publicly available dataset of high-quality spear-phishing emails, and the demonstrated effectiveness of our method in detecting such emails. this methodology can be utilized for various document classification tasks, particularly in adversarial problem domains.
Thilo Hagendorff
Abstract: the advent of generative artificial intelligence and the widespread adoption of it in society engendered intensive debates about its ethical implications and risks. these risks often differ from those associated with traditional discriminative machine learning. to synthesize the recent discourse and map its normative concepts, we conducted a scoping review on the ethics of generative artificial intelligence, including especially large language models and text-to-image models. our analysis provides a taxonomy of 378 normative issues in 19 topic areas and ranks them according to their prevalence in the literature. the study offers a comprehensive overview for scholars, practitioners, or policymakers, condensing the ethical debates surrounding fairness, safety, harmful content, hallucinations, privacy, interaction risks, security, alignment, societal impacts, and others. we discuss the results, evaluate imbalances in the literature, and explore unsubstantiated risk scenarios.
Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, Yang Liu
Abstract: large language models~(llms) have gained immense popularity and are being increasingly applied in various domains. consequently, ensuring the security of these models is of paramount importance. jailbreak attacks, which manipulate llms to generate malicious content, are recognized as a significant vulnerability. while existing research has predominantly focused on direct jailbreak attacks on llms, there has been limited exploration of indirect methods. the integration of various plugins into llms, notably retrieval augmented generation~(rag), which enables llms to incorporate external knowledge bases into their response generation such as gpts, introduces new avenues for indirect jailbreak attacks. to fill this gap, we investigate indirect jailbreak attacks on llms, particularly gpts, introducing a novel attack vector named retrieval augmented generation poisoning. this method, pandora, exploits the synergy between llms and rag through prompt manipulation to generate unexpected responses. pandora uses maliciously crafted content to influence the rag process, effectively initiating jailbreak attacks. our preliminary tests show that pandora successfully conducts jailbreak attacks in four different scenarios, achieving higher success rates than direct attacks, with 64.3\% for gpt-3.5 and 34.8\% for gpt-4.
Cary Coglianese, Colton R. Crum
Abstract: fervent calls for more robust governance of the harms associated with artificial intelligence (ai) are leading to the adoption around the world of what regulatory scholars have called a management-based approach to regulation. recent initiatives in the united states and europe, as well as the adoption of major self-regulatory standards by the international organization for standardization, share in common a core management-based paradigm. these management-based initiatives seek to motivate an increase in human oversight of how ai tools are trained and developed. refinements and systematization of human-guided training techniques will thus be needed to fit within this emerging era of management-based regulatory paradigm. if taken seriously, human-guided training can alleviate some of the technical and ethical pressures on ai, boosting ai performance with human intuition as well as better addressing the needs for fairness and effective explainability. in this paper, we discuss the connection between the emerging management-based regulatory frameworks governing ai and the need for human oversight during training. we broadly cover some of the technical components involved in human-guided training and then argue that the kinds of high-stakes use cases for ai that appear of most concern to regulators should lean more on human-guided training than on data-only training. we hope to foster a discussion between legal scholars and computer scientists involving how to govern a domain of technology that is vast, heterogenous, and dynamic in its applications and risks.
Freddy Heppell, Mehmet E. Bakir, Kalina Bontcheva
Abstract: as large language models (llms) become more proficient, their misuse in large-scale viral disinformation campaigns is a growing concern. this study explores the capability of chatgpt to generate unconditioned claims about the war in ukraine, an event beyond its knowledge cutoff, and evaluates whether such claims can be differentiated by human readers and automated tools from human-written ones. we compare war-related claims from claimreview, authored by ifcn-registered fact-checkers, and similar short-form content generated by chatgpt. we demonstrate that chatgpt can produce realistic, target-specific disinformation cheaply, fast, and at scale, and that these claims cannot be reliably distinguished by humans or existing automated tools.
Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, Min Lin
Abstract: a multimodal large language model (mllm) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. nonetheless, red-teaming efforts have revealed that adversarial images/prompts can jailbreak an mllm and cause unaligned behaviors. in this work, we report an even more severe safety issue in multi-agent environments, referred to as infectious jailbreak. it entails the adversary simply jailbreaking a single agent, and without any further intervention from the adversary, (almost) all agents will become infected exponentially fast and exhibit harmful behaviors. to validate the feasibility of infectious jailbreak, we simulate multi-agent environments containing up to one million llava-1.5 agents, and employ randomized pair-wise chat as a proof-of-concept instantiation for multi-agent interaction. our results show that feeding an (infectious) adversarial image into the memory of any randomly chosen agent is sufficient to achieve infectious jailbreak. finally, we derive a simple principle for determining whether a defense mechanism can provably restrain the spread of infectious jailbreak, but how to design a practical defense that meets this principle remains an open question to investigate. our project page is available at
Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, Min Lin
Abstract: backdoor attacks are commonly executed by contaminating training data, such that a trigger can activate predetermined harmful effects during the test phase. in this work, we present anydoor, a test-time backdoor attack against multimodal large language models (mllms), which involves injecting the backdoor into the textual modality using adversarial test images (sharing the same universal perturbation), without requiring access to or modification of the training data. anydoor employs similar techniques used in universal adversarial attacks, but distinguishes itself by its ability to decouple the timing of setup and activation of harmful effects. in our experiments, we validate the effectiveness of anydoor against popular mllms such as llava-1.5, minigpt-4, instructblip, and blip-2, as well as provide comprehensive ablation studies. notably, because the backdoor is injected by a universal perturbation, anydoor can dynamically change its backdoor trigger prompts/harmful effects, exposing a new challenge for defending against backdoor attacks. our project page is available at
Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu
Abstract: jailbreaks on large language models (llms) have recently received increasing attention. for a comprehensive assessment of llm safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on llm attacks. in this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. based on this connection, we adapt the energy-based constrained decoding with langevin dynamics (cold), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the cold-attack framework which unifies and automates the search of adversarial llm attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. the controllability enabled by cold-attack leads to diverse new jailbreak scenarios which not only cover the standard setting of generating fluent suffix attacks, but also allow us to address new controllable attack settings such as revising a user query adversarially with minimal paraphrasing, and inserting stealthy attacks in context with left-right-coherence. our extensive experiments on various llms (llama-2, mistral, vicuna, guanaco, gpt-3.5) show cold-attack's broad applicability, strong controllability, high success rate, and attack transferability. our code is available at
Tobias Schimanski, Jingwei Ni, Mathias Kraus, Elliott Ash, Markus Leippold
Abstract: advances towards more faithful and traceable answers of large language models (llms) are crucial for various research and practical endeavors. one avenue in reaching this goal is basing the answers on reliable sources. however, this evidence-based qa has proven to work insufficiently with llms in terms of citing the correct sources (source quality) and truthfully representing the information within sources (answer attributability). in this work, we systematically investigate how to robustly fine-tune llms for better source quality and answer attributability. specifically, we introduce a data generation pipeline with automated data quality filters, which can synthesize diversified high-quality training and testing data at scale. we further introduce four test sets to benchmark the robustness of fine-tuned specialist models. extensive evaluation shows that fine-tuning on synthetic data improves performance on both in- and out-of-distribution. furthermore, we show that data quality, which can be drastically improved by proposed quality filters, matters more than quantity in improving evidence-based qa.
Yongchao Chen, Jacob Arkin, Yilun Hao, Yang Zhang, Nicholas Roy, Chuchu Fan
Abstract: prompt optimization aims to find the best prompt to a large language model (llm) for a given task. llms have been successfully used to help find and improve prompt candidates for single-step tasks. however, realistic tasks for agents are multi-step and introduce new challenges: (1) prompt content is likely to be more extensive and complex, making it more difficult for llms to analyze errors, (2) the impact of an individual step is difficult to evaluate, and (3) different people may have varied preferences about task execution. while humans struggle to optimize prompts, they are good at providing feedback about llm outputs; we therefore introduce a new llm-driven discrete prompt optimization framework that incorporates human-designed feedback rules about potential errors to automatically offer direct suggestions for improvement. our framework is stylized as a genetic algorithm in which an llm generates new candidate prompts from a parent prompt and its associated feedback; we use a learned heuristic function that predicts prompt performance to efficiently sample from these candidates. this approach significantly outperforms both human-engineered prompts and several other prompt optimization methods across eight representative multi-step tasks (an average 27.7% and 28.2% improvement to current best methods on gpt-3.5 and gpt-4, respectively). we further show that the score function for tasks can be modified to better align with individual preferences. we believe our work can serve as a benchmark for automatic prompt optimization for llm-driven multi-step tasks. datasets and codes are available at project page is available at
Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, Tingbo Hou
Abstract: reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. remarkable success has been achieved in the language domain by using reinforcement learning (rl) to maximize rewards that reflect human preference. however, in the vision domain, existing rl-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. in this paper, we propose proximal reward difference prediction (prdp), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100k prompts. our key innovation is the reward difference prediction (rdp) objective that has the same optimal solution as the rl objective while enjoying better training stability. specifically, the rdp objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. we theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the rl objective. we further develop an online algorithm with proximal updates to stably optimize the rdp objective. in experiments, we demonstrate that prdp can match the reward maximization ability of well-established rl-based methods in small-scale training. furthermore, through large-scale training on text prompts from the human preference dataset v2 and the pick-a-pic v1 dataset, prdp achieves superior generation quality on a diverse set of complex, unseen prompts whereas rl-based methods completely fail.
Jianing Wang, Junda Wu, Yupeng Hou, Yao Liu, Ming Gao, Julian Mcauley
Abstract: do current large language models (llms) better solve graph reasoning and generation tasks with parameter updates? in this paper, we propose instructgraph, a framework that empowers llms with the abilities of graph reasoning and generation by instruction tuning and preference alignment. specifically, we first propose a structured format verbalizer to unify all graph data into a universal code-like format, which can simply represent the graph without any external graph-specific encoders. furthermore, a graph instruction tuning stage is introduced to guide llms in solving graph reasoning and generation tasks. finally, we identify potential hallucination problems in graph tasks and sample negative instances for preference alignment, the target of which is to enhance the output's reliability of the model. extensive experiments across multiple graph-centric tasks exhibit that instructgraph can achieve the best performance and outperform gpt-4 and llama2 by more than 13\% and 38\%, respectively.
Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Xiaojun Xu, Yuguang Yao, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu
Abstract: we explore machine unlearning (mu) in the domain of large language models (llms), referred to as llm unlearning. this initiative aims to eliminate undesirable data influence (e.g., sensitive or illegal information) and the associated model capabilities, while maintaining the integrity of essential knowledge generation and not affecting causally unrelated information. we envision llm unlearning becoming a pivotal element in the life-cycle management of llms, potentially standing as an essential foundation for developing generative ai that is not only safe, secure, and trustworthy, but also resource-efficient without the need of full retraining. we navigate the unlearning landscape in llms from conceptual formulation, methodologies, metrics, and applications. in particular, we highlight the often-overlooked aspects of existing llm unlearning research, e.g., unlearning scope, data-model interaction, and multifaceted efficacy assessment. we also draw connections between llm unlearning and related areas such as model editing, influence functions, model explanation, adversarial training, and reinforcement learning. furthermore, we outline an effective assessment framework for llm unlearning and explore its applications in copyright and privacy safeguards and sociotechnical harm reduction.
Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang
Abstract: reinforcement learning from human feedback (rlhf) aligns language models to human preferences by employing a singular reward model derived from preference data. however, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. in this work, we first derive an impossibility result of alignment with single reward rlhf, thereby highlighting its insufficiency in representing diverse human preferences. to provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a maxmin alignment objective for policy learning inspired by the egalitarian principle in social choice theory to better represent diverse human preferences. we elucidate the connection of our proposed approach to distributionally robust optimization and general utility rl, thereby highlighting the generality and robustness of our proposed solution. we present comprehensive experimental results on small-scale (gpt-2) and large-scale language models (with tulu2-7b) and show the efficacy of the proposed approach in the presence of diversity among human preferences. our algorithm achieves an average improvement of more than 16% in win-rates over conventional rlhf algorithms and improves the win-rate (accuracy) for minority groups by over 33% without compromising the performance of majority groups, showcasing the robustness and fairness of our approach. we remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.


Nathan I. N. Henry, Mangor Pedersen, Matt Williams, Jamin L. B. Martin, Liesje Donkin
Abstract: the value-loading problem is a significant challenge for researchers aiming to create artificial intelligence (ai) systems that align with human values and preferences. this problem requires a method to define and regulate safe and optimal limits of ai behaviors. in this work, we propose halo (hormetic alignment via opponent processes), a regulatory paradigm that uses hormetic analysis to regulate the behavioral patterns of ai. behavioral hormesis is a phenomenon where low frequencies of a behavior have beneficial effects, while high frequencies are harmful. by modeling behaviors as allostatic opponent processes, we can use either behavioral frequency response analysis (bfra) or behavioral count response analysis (bcra) to quantify the hormetic limits of repeatable behaviors. we demonstrate how halo can solve the 'paperclip maximizer' scenario, a thought experiment where an unregulated ai tasked with making paperclips could end up converting all matter in the universe into paperclips. our approach may be used to help create an evolving database of 'values' based on the hedonic calculus of repeatable behaviors with decreasing marginal utility. this positions halo as a promising solution for the value-loading problem, which involves embedding human-aligned values into an ai system, and the weak-to-strong generalization problem, which explores whether weak models can supervise stronger models as they become more intelligent. hence, halo opens several research avenues that may lead to the development of a computational value system that allows an ai algorithm to learn whether the decisions it makes are right or wrong.
Xabier Echeberria-Barrio, Mikel Gorricho, Selene Valencia, Francesco Zola
Abstract: the usage of artificial intelligence (ai) systems has increased exponentially, thanks to their ability to reduce the amount of data to be analyzed, the user efforts and preserving a high rate of accuracy. however, introducing this new element in the loop has converted them into attacked points that can compromise the reliability of the systems. this new scenario has raised crucial challenges regarding the reliability and trustworthiness of the ai models, as well as about the uncertainties in their response decisions, becoming even more crucial when applied in critical domains such as healthcare, chemical, electrical plants, etc. to contain these issues, in this paper, we present neuralsentinel (ns), a tool able to validate the reliability and trustworthiness of ai models. this tool combines attack and defence strategies and explainability concepts to stress an ai model and help non-expert staff increase their confidence in this new system by understanding the model decisions. ns provide a simple and easy-to-use interface for helping humans in the loop dealing with all the needed information. this tool was deployed and used in a hackathon event to evaluate the reliability of a skin cancer image detector. during the event, experts and non-experts attacked and defended the detector, learning which factors were the most important for model misclassification and which techniques were the most efficient. the event was also used to detect ns's limitations and gather feedback for further improvements.
Sumeet Ramesh Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip H. S. Torr, Lewis Hammond, Christian Schroeder De Witt
Abstract: recent capability increases in large language models (llms) open up applications in which teams of communicating generative ai agents solve joint tasks. this poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. modern steganographic techniques could render such dynamics hard to detect. in this paper, we comprehensively formalise the problem of secret collusion in systems of generative ai agents by drawing on relevant concepts from both the ai and security literature. we study incentives for the use of steganography, and propose a variety of mitigation measures. our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. we provide extensive empirical results across a range of contemporary llms. while the steganographic capabilities of current models remain limited, gpt-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. we conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative ai models.
Norbert Tihanyi, Mohamed Amine Ferrag, Ridhi Jain, Merouane Debbah
Abstract: large language models (llms) excel across various domains, from computer vision to medical diagnostics. however, understanding the diverse landscape of cybersecurity, encompassing cryptography, reverse engineering, and managerial facets like risk assessment, presents a challenge, even for human experts. in this paper, we introduce cybermetric, a benchmark dataset comprising 10,000 questions sourced from standards, certifications, research papers, books, and other publications in the cybersecurity domain. the questions are created through a collaborative process, i.e., merging expert knowledge with llms, including gpt-3.5 and falcon-180b. human experts spent over 200 hours verifying their accuracy and relevance. beyond assessing llms' knowledge, the dataset's main goal is to facilitate a fair comparison between humans and different llms in cybersecurity. to achieve this, we carefully selected 80 questions covering a wide range of topics within cybersecurity and involved 30 participants of diverse expertise levels, facilitating a comprehensive comparison between human and machine intelligence in this area. the findings revealed that llms outperformed humans in almost every aspect of cybersecurity.
Hui Liu, Wenya Wang, Haoru Li, Haoliang Li
Abstract: the proliferation of fake news has emerged as a severe societal problem, raising significant interest from industry and academia. while existing deep-learning based methods have made progress in detecting fake news accurately, their reliability may be compromised caused by the non-transparent reasoning processes, poor generalization abilities and inherent risks of integration with large language models (llms). to address this challenge, we propose {\methodname}, a novel framework for trustworthy fake news detection that prioritizes explainability, generalizability and controllability of models. this is achieved via a dual-system framework that integrates cognition and decision systems, adhering to the principles above. the cognition system harnesses human expertise to generate logical predicates, which guide llms in generating human-readable logic atoms. meanwhile, the decision system deduces generalizable logic rules to aggregate these atoms, enabling the identification of the truthfulness of the input news across diverse domains and enhancing transparency in the decision-making process. finally, we present comprehensive evaluation results on four datasets, demonstrating the feasibility and trustworthiness of our proposed framework. our implementation is available at \url{}.
Wei Zou, Runpeng Geng, Binghui Wang, Jinyuan Jia
Abstract: large language models (llms) have achieved remarkable success due to their exceptional generative capabilities. despite their success, they also have inherent limitations such as a lack of up-to-date knowledge and hallucination. retrieval-augmented generation (rag) is a state-of-the-art technique to mitigate those limitations. in particular, given a question, rag retrieves relevant knowledge from a knowledge database to augment the input of the llm. for instance, the retrieved knowledge could be a set of top-k texts that are most semantically similar to the given question when the knowledge database contains millions of texts collected from wikipedia. as a result, the llm could utilize the retrieved knowledge as the context to generate an answer for the given question. existing studies mainly focus on improving the accuracy or efficiency of rag, leaving its security largely unexplored. we aim to bridge the gap in this work. particularly, we propose poisonedrag , a set of knowledge poisoning attacks to rag, where an attacker could inject a few poisoned texts into the knowledge database such that the llm generates an attacker-chosen target answer for an attacker-chosen target question. we formulate knowledge poisoning attacks as an optimization problem, whose solution is a set of poisoned texts. depending on the background knowledge (e.g., black-box and white-box settings) of an attacker on the rag, we propose two solutions to solve the optimization problem, respectively. our results on multiple benchmark datasets and llms show our attacks could achieve 90% attack success rates when injecting 5 poisoned texts for each target question into a database with millions of texts. we also evaluate recent defenses and our results show they are insufficient to defend against our attacks, highlighting the need for new defenses.
Louis Castricato, Nathan Lile, Suraj Anand, Hailey Schoelkopf, Siddharth Verma, Stella Biderman
Abstract: existing methods for controlling language models, such as rlhf and constitutional ai, involve determining which llm behaviors are desirable and training them into a language model. however, in many cases, it is desirable for llms to be controllable \textit{at inference time}, so that they can be used in multiple contexts with diverse needs. we illustrate this with the \textbf{pink elephant problem}: instructing an llm to avoid discussing a certain entity (a ``pink elephant''), and instead discuss a preferred entity (``grey elephant''). we apply a novel simplification of constitutional ai, \textbf{direct principle feedback}, which skips the ranking of responses and uses dpo directly on critiques and revisions. our results show that after dpf fine-tuning on our synthetic pink elephants dataset, our 13b fine-tuned llama 2 model significantly outperforms llama-2-13b-chat and a prompted baseline, and performs as well as gpt-4 in on our curated test set assessing the pink elephant problem.


Arifa Khan, P. Saravanan, S. K Venkatesan
Abstract: we provide a birds eye view of the rapid developments in ai and deep learning that has led to the path-breaking emergence of ai in large language models. the aim of this study is to place all these developments in a pragmatic broader historical social perspective without any exaggerations while at the same time without any pessimism that created the ai winter in the 1970s to 1990s. we also at the same time point out toxicity, bias, memorization, sycophancy, logical inconsistencies, hallucinations that exist just as a warning to the overly optimistic. we note here that just as this emergence of ai seems to occur at a threshold point in the number of neural connections or weights, it has also been observed that human brain and especially the cortex region is nothing special or extraordinary but simply a case of scaled-up version of the primate brain and that even the human intelligence seems like an emergent phenomena of scale.
Zhibo Hu, Chen Wang, Yanfeng Shu, N/A Helen, N/A Paik, Liming Zhu
Abstract: the robustness of large language models (llms) becomes increasingly important as their use rapidly grows in a wide range of domains. retrieval-augmented generation (rag) is considered as a means to improve the trustworthiness of text generation from llms. however, how the outputs from rag-based llms are affected by slightly different inputs is not well studied. in this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers. we systematically evaluate the effect of such prefixes on rag by introducing a novel optimization technique called gradient guided prompt perturbation (ggpp). ggpp achieves a high success rate in steering outputs of rag-based llms to targeted wrong answers. it can also cope with instructions in the prompts requesting to ignore irrelevant context. we also exploit llms' neuron activation difference between prompts with and without ggpp perturbations to give a method that improves the robustness of rag-based llms through a highly effective detector trained on neuron activation triggered by ggpp generated prompts. our evaluation on open-sourced llms demonstrates the effectiveness of our methods.
Ryan Liu, Theodore R. Sumers, Ishita Dasgupta, Thomas L. Griffiths
Abstract: in day-to-day communication, people often approximate the truth - for example, rounding the time or omitting details - in order to be maximally helpful to the listener. how do large language models (llms) handle such nuanced trade-offs? to address this question, we use psychological models and experiments designed to characterize human behavior to analyze llms. we test a range of llms and explore how optimization for human preferences or inference-time reasoning affects these trade-offs. we find that reinforcement learning from human feedback improves both honesty and helpfulness, while chain-of-thought prompting skews llms towards helpfulness over honesty. finally, gpt-4 turbo demonstrates human-like response patterns including sensitivity to the conversational framing and listener's decision context. our findings reveal the conversational values internalized by llms and suggest that even these abstract values can, to a degree, be steered by zero-shot prompting.
Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro
Abstract: in this work, we study the issue of reward hacking on the response length, a challenge emerging in reinforcement learning from human feedback (rlhf) on llms. a well-formatted, verbose but less helpful response from the llms can often deceive llms or even human evaluators to achieve high scores. the same issue also holds for some reward models in rl. to address the challenges in both training and evaluation, we establish a more reliable evaluation protocol for comparing different training configurations, which inspects the trade-off between llm evaluation score and response length obtained by varying training hyperparameters. based on this evaluation, we conduct large-scale studies, where the results shed insights into the efficacy of hyperparameters and tricks used in rl on mitigating length bias. we further propose to improve the reward model by jointly training two linear heads on shared feature representations to predict the rewards, one trained to correlate with length, and the other trained to decorrelate with length and therefore focus more on the actual content. we then discard the length head in rl to prevent reward hacking on length. experiments demonstrate that our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
Alice Cai, Ian Arawjo, Elena L. Glassman
Abstract: the vast majority of discourse around ai development assumes that subservient, "moral" models aligned with "human values" are universally beneficial -- in short, that good ai is sycophantic ai. we explore the shadow of the sycophantic paradigm, a design space we term antagonistic ai: ai systems that are disagreeable, rude, interrupting, confrontational, challenging, etc. -- embedding opposite behaviors or values. far from being "bad" or "immoral," we consider whether antagonistic ai systems may sometimes have benefits to users, such as forcing users to confront their assumptions, build resilience, or develop healthier relational boundaries. drawing from formative explorations and a speculative design workshop where participants designed fictional ai technologies that employ antagonism, we lay out a design space for antagonistic ai, articulating potential benefits, design techniques, and methods of embedding antagonistic elements into user experience. finally, we discuss the many ethical challenges of this space and identify three dimensions for the responsible design of antagonistic ai -- consent, context, and framing.
Kyungha Kim, Sangyun Lee, Kung-Hsiang Huang, Hou Pong Chan, Manling Li, Heng Ji
Abstract: fact-checking research has extensively explored verification but less so the generation of natural-language explanations, crucial for user trust. while large language models (llms) excel in text generation, their capability for producing faithful explanations in fact-checking remains underexamined. our study investigates llms' ability to generate such explanations, finding that zero-shot prompts often result in unfaithfulness. to address these challenges, we propose the multi-agent debate refinement (madr) framework, leveraging multiple llms as agents with diverse roles in an iterative refining process aimed at enhancing faithfulness in generated explanations. madr ensures that the final explanation undergoes rigorous validation, significantly reducing the likelihood of unfaithful elements and aligning closely with the provided evidence. experimental results demonstrate that madr significantly improves the faithfulness of llm-generated explanations to the evidence, advancing the credibility and trustworthiness of these explanations.


Hyukhun Koh, Dohyung Kim, Minwoo Lee, Kyomin Jung
Abstract: in the pursuit of developing large language models (llms) that adhere to societal standards, it is imperative to discern the existence of toxicity in the generated text. the majority of existing toxicity metrics rely on encoder models trained on specific toxicity datasets. however, these encoders are susceptible to out-of-distribution (ood) problems and depend on the definition of toxicity assumed in a dataset. in this paper, we introduce an automatic robust metric grounded on llms to distinguish whether model responses are toxic. we start by analyzing the toxicity factors, followed by examining the intrinsic toxic attributes of llms to ascertain their suitability as evaluators. subsequently, we evaluate our metric, llms as toxicity evaluators (latte), on evaluation datasets.the empirical results indicate outstanding performance in measuring toxicity, improving upon state-of-the-art metrics by 12 points in f1 score without training procedure. we also show that upstream toxicity has an influence on downstream metrics.
Jonathan Evertz, Merlin Chlosta, Lea Schönherr, Thorsten Eisenhofer
Abstract: large language models (llms) are increasingly integrated with external tools. while these integrations can significantly improve the functionality of llms, they also create a new attack surface where confidential data may be disclosed between different components. specifically, malicious tools can exploit vulnerabilities in the llm itself to manipulate the model and compromise the data of other services, raising the question of how private data can be protected in the context of llm integrations. in this work, we provide a systematic way of evaluating confidentiality in llm-integrated systems. for this, we formalize a "secret key" game that can capture the ability of a model to conceal private information. this enables us to compare the vulnerability of a model against confidentiality attacks and also the effectiveness of different defense strategies. in this framework, we evaluate eight previously published attacks and four defenses. we find that current defenses lack generalization across attack strategies. building on this analysis, we propose a method for robustness fine-tuning, inspired by adversarial training. this approach is effective in lowering the success rate of attackers and in improving the system's resilience against unknown attacks.
Rui Ye, Wenhao Wang, Jingyi Chai, Dihan Li, Zexi Li, Yinda Xu, Yaxin Du, Yanfeng Wang, Siheng Chen
Abstract: trained on massive publicly available data, large language models (llms) have demonstrated tremendous success across various fields. while more data contributes to better performance, a disconcerting reality is that high-quality public data will be exhausted in a few years. in this paper, we offer a potential next step for contemporary llms: collaborative and privacy-preserving llm training on the underutilized distributed private data via federated learning (fl), where multiple data owners collaboratively train a shared model without transmitting raw data. to achieve this, we build a concise, integrated, and research-friendly framework/codebase, named openfedllm. it covers federated instruction tuning for enhancing instruction-following capability, federated value alignment for aligning with human values, and 7 representative fl algorithms. besides, openfedllm supports training on diverse domains, where we cover 8 training datasets; and provides comprehensive evaluations, where we cover 30+ evaluation metrics. through extensive experiments, we observe that all fl algorithms outperform local training on training llms, demonstrating a clear performance improvement across a variety of settings. notably, in a financial benchmark, llama2-7b fine-tuned by applying any fl algorithm can outperform gpt-4 by a significant margin while the model obtained through individual training cannot, demonstrating strong motivation for clients to participate in fl. the code is available at
Ankit Pal, Malaikannan Sankarasubbu
Abstract: large language models have the potential to be valuable in the healthcare industry, but it's crucial to verify their safety and effectiveness through rigorous evaluation. for this purpose, we comprehensively evaluated both open-source llms and google's new multimodal llm called gemini across medical reasoning, hallucination detection, and medical visual question answering tasks. while gemini showed competence, it lagged behind state-of-the-art models like medpalm 2 and gpt-4 in diagnostic accuracy. additionally, gemini achieved an accuracy of 61.45\% on the medical vqa dataset, significantly lower than gpt-4v's score of 88\%. our analysis revealed that gemini is highly susceptible to hallucinations, overconfidence, and knowledge gaps, which indicate risks if deployed uncritically. we also performed a detailed analysis by medical subject and test type, providing actionable feedback for developers and clinicians. to mitigate risks, we applied prompting strategies that improved performance. additionally, we facilitated future research and development by releasing a python module for medical llm evaluation and establishing a dedicated leaderboard on hugging face for medical domain llms. python module can be found at
Sven Cattell, Avijit Ghosh
Abstract: harm reporting in the field of artificial intelligence (ai) currently operates on an ad hoc basis, lacking a structured process for disclosing or addressing algorithmic flaws. in contrast, the coordinated vulnerability disclosure (cvd) ethos and ecosystem play a pivotal role in software security and transparency. within the u.s. context, there has been a protracted legal and policy struggle to establish a safe harbor from the computer fraud and abuse act, aiming to foster institutional support for security researchers acting in good faith. notably, algorithmic flaws in machine learning (ml) models present distinct challenges compared to traditional software vulnerabilities, warranting a specialized approach. to address this gap, we propose the implementation of a dedicated coordinated flaw disclosure (cfd) framework tailored to the intricacies of machine learning and artificial intelligence issues. this paper delves into the historical landscape of disclosures in ml, encompassing the ad hoc reporting of harms and the emergence of participatory auditing. by juxtaposing these practices with the well-established disclosure norms in cybersecurity, we argue that the broader adoption of cfd has the potential to enhance public trust through transparent processes that carefully balance the interests of both organizations and the community.


Juhyun Oh, Eunsu Kim, Inha Cha, Alice Oh
Abstract: this paper explores the assumption that large language models (llms) skilled in generation tasks are equally adept as evaluators. we assess the performance of three llms and one open-source lm in question-answering (qa) and evaluation tasks using the triviaqa (joshi et al., 2017) dataset. results indicate a significant disparity, with llms exhibiting lower performance in evaluation tasks compared to generation tasks. intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of llms as evaluators. this study contributes to the understanding of "the generative ai paradox" (west et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.
Yichuan Mo, Yuji Wang, Zeming Wei, Yisen Wang
Abstract: although large language models (llms) have achieved tremendous success in various applications, they are also susceptible to certain prompts that can induce them to bypass built-in safety measures and provide dangerous or illegal content, a phenomenon known as jailbreak. to protect llms from producing harmful information, various defense strategies are proposed, with most focusing on content filtering or adversarial training of models. in this paper, we propose an approach named prompt adversarial tuning (pat) to train a defense control mechanism, which is then embedded as a prefix to user prompts to implement our defense strategy. we design a training process similar to adversarial training to achieve our optimized goal, alternating between updating attack and defense controls. to our knowledge, we are the first to implement defense from the perspective of prompt tuning. once employed, our method will hardly impact the operational efficiency of llms. experiments show that our method is effective in both black-box and white-box settings, reducing the success rate of advanced attacks to nearly 0 while maintaining the benign answer rate of 80% to simple benign questions. our work might potentially chart a new perspective for future explorations in llm security.
Nardine Osman, "Mark D'Inverno"
Abstract: one of today's most significant societal challenges is building ai systems whose behaviour, or the behaviour it enables within communities of interacting agents (human and artificial), aligns with human values. to address this challenge, we detail a formal model of human values for their explicit computational representation. to our knowledge, this has not been attempted as yet, which is surprising given the growing volume of research integrating values within ai. taking as our starting point the wealth of research investigating the nature of human values from social psychology over the last few decades, we set out to provide such a formal model. we show how this model can provide the foundational apparatus for ai-based reasoning over values, and demonstrate its applicability in real-world use cases. we illustrate how our model captures the key ideas from social psychology research and propose a roadmap for future integrated, and interdisciplinary, research into human values in ai. the ability to automatically reason over values not only helps address the value alignment problem but also facilitates the design of ai systems that can support individuals and communities in making more informed, value-aligned decisions. more and more, individuals and organisations are motivated to understand their values more explicitly and explore whether their behaviours and attitudes properly reflect them. our work on modelling human values will enable ai systems to be designed and deployed to meet this growing need.
Sizhe Chen, Julien Piet, Chawin Sitawarin, David Wagner
Abstract: recent advances in large language models (llms) enable exciting llm-integrated applications, which perform text-based tasks by utilizing their advanced language understanding capabilities. however, as llms have improved, so have the attacks against them. prompt injection attacks are an important threat: they trick the model to deviate from the original application's instructions and instead follow user directives. these attacks rely on the llm's ability to follow instructions and inability to separate the prompts and user data. we introduce structured queries, a general approach to tackle this problem. structured queries separate prompts and data into two channels. we implement a system that supports structured queries. this system is made of (1) a secure front-end that formats a prompt and user data into a special format, and (2) a specially trained llm that can produce high-quality outputs from these inputs. the llm is trained using a novel fine-tuning strategy: we convert a base (non-instruction-tuned) llm to a structured instruction-tuned model that will only follow instructions in the prompt portion of a query. to do so, we augment standard instruction tuning datasets with examples that also include instructions in the data portion of the query, and fine-tune the model to ignore these. our system significantly improves resistance to prompt injection attacks, with little or no impact on utility. our code is released at
Bianca-Mihaela Ganescu, Jonathan Passerat-Palmbach
Abstract: generative ai, exemplified by models like transformers, has opened up new possibilities in various domains but also raised concerns about fairness, transparency and reliability, especially in fields like medicine and law. this paper emphasizes the urgency of ensuring fairness and quality in these domains through generative ai. it explores using cryptographic techniques, particularly zero-knowledge proofs (zkps), to address concerns regarding performance fairness and accuracy while protecting model privacy. applying zkps to machine learning models, known as zkml (zero-knowledge machine learning), enables independent validation of ai-generated content without revealing sensitive model information, promoting transparency and trust. zkml enhances ai fairness by providing cryptographic audit trails for model predictions and ensuring uniform performance across users. we introduce snarkgpt, a practical zkml implementation for transformers, to empower users to verify output accuracy and quality while preserving model privacy. we present a series of empirical results studying snarkgpt's scalability and performance to assess the feasibility and challenges of adopting a zkml-powered approach to capture quality and performance fairness problems in generative ai models.
Kaiqu Liang, Zixu Zhang, Jaime Fernández Fisac
Abstract: large language models (llms) exhibit advanced reasoning skills, enabling robots to comprehend natural language instructions and strategically plan high-level actions through proper grounding. however, llm hallucination may result in robots confidently executing plans that are misaligned with user goals or, in extreme cases, unsafe. additionally, inherent ambiguity in natural language instructions can induce task uncertainty, particularly in situations where multiple valid options exist. to address this issue, llms must identify such uncertainty and proactively seek clarification. this paper explores the concept of introspective planning as a systematic method for guiding llms in forming uncertainty--aware plans for robotic task execution without the need for fine-tuning. we investigate uncertainty quantification in task-level robot planning and demonstrate that introspection significantly improves both success rates and safety compared to state-of-the-art llm-based planning approaches. furthermore, we assess the effectiveness of introspective planning in conjunction with conformal prediction, revealing that this combination yields tighter confidence bounds, thereby maintaining statistical success guarantees with fewer superfluous user clarification queries.
Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, Bhuwan Dhingra
Abstract: to enhance large language models' (llms) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. however, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. this approach does not apply to long-form generation, where an answer can be partially correct. addressing this gap, we introduce a unified calibration framework, in which both the correctness of the llms' responses and their associated confidence levels are treated as distributions across a range of scores. within this framework, we develop three metrics to precisely evaluate llm calibration and further propose two confidence elicitation methods based on self-consistency and self-evaluation. our experiments, which include long-form qa and summarization tasks, demonstrate that larger models don't necessarily guarantee better calibration, that calibration performance is found to be metric-dependent, and that self-consistency methods excel in factoid datasets. we also find that calibration can be enhanced through techniques such as fine-tuning, integrating relevant source documents, scaling the temperature, and combining self-consistency with self-evaluation. lastly, we showcase a practical application of our system: selecting and cascading open-source models and chatgpt to optimize correctness given a limited api budget. this research not only challenges existing notions of llm calibration but also offers practical methodologies for improving trustworthiness in long-form generation.
Mingzhe Xing, Rongkai Zhang, Hui Xue, Qi Chen, Fan Yang, Zhen Xiao
Abstract: large language models (llms) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. however, when applied to general-purpose software systems like operating systems, llm agents face three primary challenges. firstly, the action space is vast and dynamic, posing difficulties for llm agents to maintain an up-to-date understanding and deliver accurate responses. secondly, real-world tasks often require inter-application cooperation}, demanding farsighted planning from llm agents. thirdly, agents need to identify optimal solutions aligning with user constraints, such as security concerns and preferences. these challenges motivate androidarena, an environment and benchmark designed to evaluate llm agents on a modern operating system. to address high-cost of manpower, we design a scalable and semi-automated method to construct the benchmark. in the task evaluation, androidarena incorporates accurate and adaptive metrics to address the issue of non-unique solutions. our findings reveal that even state-of-the-art llm agents struggle in cross-app scenarios and adhering to specific constraints. additionally, we identify a lack of four key capabilities, i.e., understanding, reasoning, exploration, and reflection, as primary reasons for the failure of llm agents. furthermore, we provide empirical analysis on the failure of reflection, and improve the success rate by 27% with our proposed exploration strategy. this work is the first to present valuable insights in understanding fine-grained weakness of llm agents, and offers a path forward for future research in this area. environment, benchmark, and evaluation code for androidarena are released at
Rui-Jie Yew, Lucy Qin, Suresh Venkatasubramanian
Abstract: data forms the backbone of machine learning. thus, data protection law has strong bearing on how ml systems are governed. given that most requirements accompany the processing of personal data, organizations have an incentive to keep their data out of legal scope. privacy-preserving techniques incentivized by data protection law -- data protection techniques -- constitute an important strategy for ml development because they are used to distill data until it potentially falls outside the scope of data protection laws. in this paper, we examine the impact of a rhetoric that deems data wrapped in privacy-preserving techniques as data that is "good-to-go". we show how the application of data protection techniques in the development of ml systems -- from private set intersection as part of dataset curation to homomorphic encryption and federated learning as part of model computation to the framing of the privacy-utility trade-off as part of model updating -- can further support individual monitoring and data consolidation. with data accumulation at the core of how the ml pipeline is configured, we argue that data protection techniques are often instrumentalized in ways that support infrastructures of surveillance, rather than to protect individuals associated with data. finally, we propose technology and policy strategies to evaluate data protection techniques in light of the protections they actually confer. we conclude by highlighting the role that security technologists might play in devising policies that combat surveillance ml technologies -- recommending the adversarial mindset inherent to the profession to more precisely articulate and prevent the use of "privacy-preserving" scaffoldings that support surveillance.
Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju
Abstract: the development of large language models (llms) has notably transformed numerous sectors, offering impressive text generation capabilities. yet, the reliability and truthfulness of these models remain pressing concerns. to this end, we investigate iterative prompting, a strategy hypothesized to refine llm responses, assessing its impact on llm truthfulness, an area which has not been thoroughly explored. our extensive experiments delve into the intricacies of iterative prompting variants, examining their influence on the accuracy and calibration of model responses. our findings reveal that naive prompting methods significantly undermine truthfulness, leading to exacerbated calibration errors. in response to these challenges, we introduce several prompting variants designed to address the identified issues. these variants demonstrate marked improvements over existing baselines, signaling a promising direction for future research. our work provides a nuanced understanding of iterative prompting and introduces novel approaches to enhance the truthfulness of llms, thereby contributing to the development of more accurate and trustworthy ai systems.
Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt
Abstract: language models influence the external world: they query apis that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents. these interactions form feedback loops: llm outputs affect the world, which in turn affect subsequent llm outputs. in this work, we show that feedback loops can cause in-context reward hacking (icrh), where the llm at test-time optimizes a (potentially implicit) objective but creates negative side effects in the process. for example, consider an llm agent deployed to increase twitter engagement; the llm may retrieve its previous tweets into the context window and make them more controversial, increasing engagement but also toxicity. we identify and study two processes that lead to icrh: output-refinement and policy-refinement. for these processes, evaluations on static datasets are insufficient -- they miss the feedback effects and thus cannot capture the most harmful behavior. in response, we provide three recommendations for evaluation to capture more instances of icrh. as ai development accelerates, the effects of feedback loops will proliferate, increasing the need to understand their role in shaping llm behavior.
Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez
Abstract: common methods for aligning large language models (llms) with desired behaviour heavily rely on human-labelled data. however, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. in anticipation of this, we ask: can weaker models assess the correctness of stronger models? we investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. the method we evaluate is \textit{debate}, where two llm experts each argue for a different answer, and a non-expert selects the answer. we find that debate consistently helps both non-expert models and humans answer questions, achieving 76\% and 88\% accuracy respectively (naive baselines obtain 48\% and 60\%). furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.
Hochul Hwang, Sunjae Kwon, Yekyung Kim, Donghyun Kim
Abstract: safely navigating street intersections is a complex challenge for blind and low-vision individuals, as it requires a nuanced understanding of the surrounding context - a task heavily reliant on visual cues. traditional methods for assisting in this decision-making process often fall short, lacking the ability to provide a comprehensive scene analysis and safety level. this paper introduces an innovative approach that leverages large multimodal models (lmms) to interpret complex street crossing scenes, offering a potential advancement over conventional traffic signal recognition techniques. by generating a safety score and scene description in natural language, our method supports safe decision-making for the blind and low-vision individuals. we collected crosswalk intersection data that contains multiview egocentric images captured by a quadruped robot and annotated the images with corresponding safety scores based on our predefined safety score categorization. grounded on the visual knowledge, extracted from images, and text prompt, we evaluate a large multimodal model for safety score prediction and scene description. our findings highlight the reasoning and safety score prediction capabilities of a lmm, activated by various prompts, as a pathway to developing a trustworthy system, crucial for applications requiring reliable decision-making support.


Guangyu Shen, Siyuan Cheng, Kaiyuan Zhang, Guanhong Tao, Shengwei An, Lu Yan, Zhuo Zhang, Shiqing Ma, Xiangyu Zhang
Abstract: large language models (llms) have become prevalent across diverse sectors, transforming human life with their extraordinary reasoning and comprehension abilities. as they find increased use in sensitive tasks, safety concerns have gained widespread attention. extensive efforts have been dedicated to aligning llms with human moral principles to ensure their safe deployment. despite their potential, recent research indicates aligned llms are prone to specialized jailbreaking prompts that bypass safety measures to elicit violent and harmful content. the intrinsic discrete nature and substantial scale of contemporary llms pose significant challenges in automatically generating diverse, efficient, and potent jailbreaking prompts, representing a continuous obstacle. in this paper, we introduce ripple (rapid optimization via subconscious exploitation and echopraxia), a novel optimization-based method inspired by two psychological concepts: subconsciousness and echopraxia, which describe the processes of the mind that occur without conscious awareness and the involuntary mimicry of actions, respectively. evaluations across 6 open-source llms and 4 commercial llm apis show ripple achieves an average attack success rate of 91.5\%, outperforming five current methods by up to 47.0\% with an 8x reduction in overhead. furthermore, it displays significant transferability and stealth, successfully evading established detection mechanisms. the code of our work is available at \url{}
Christoph Tillmann, Aashka Trivedi, Bishwaranjan Bhattacharjee
Abstract: large language models (llms) are the cornerstone for many natural language processing (nlp) tasks like sentiment analysis, document classification, named entity recognition, question answering, summarization, etc. llms are often trained on data which originates from the web. this data is prone to having content with hate, abuse and profanity (hap). for a detailed definition of hap, please refer to the appendix. due to the llms being exposed to hap content during training, the models learn it and may then generate hateful or profane content. for example, when the open-source roberta model (specifically, the roberta base model) from the huggingface (hf) transformers library is prompted to replace the mask token in `i do not know that persian people are that mask` it returns the word `stupid` with the highest score. this is unacceptable in civil discourse.the detection of hate, abuse and profanity in text is a vital component of creating civil and unbiased llms, which is needed not only for english, but for all languages. in this article, we briefly describe the creation of hap detectors and various ways of using them to make models civil and acceptable in the output they generate.
Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, Yang Zhang
Abstract: misuse of the large language models (llms) has raised widespread concern. to address this issue, safeguards have been taken to ensure that llms align with social ethics. however, recent findings have revealed an unsettling vulnerability bypassing the safeguards of llms, known as jailbreak attacks. by applying techniques, such as employing role-playing scenarios, adversarial examples, or subtle subversion of safety objectives as a prompt, llms can produce an inappropriate or even harmful response. while researchers have studied several categories of jailbreak attacks, they have done so in isolation. to fill this gap, we present the first large-scale measurement of various jailbreak attack methods. we concentrate on 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular llms. our extensive experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates, as well as exhibit robustness across different llms. some jailbreak prompt datasets, available from the internet, can also achieve high attack success rates on many llms, such as chatglm3, gpt-3.5, and palm2. despite the claims from many organizations regarding the coverage of violation categories in their policies, the attack success rates from these categories remain high, indicating the challenges of effectively aligning llm policies and the ability to counter jailbreak attacks. we also discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable, becoming an option for black-box models. overall, our research highlights the necessity of evaluating different jailbreak methods. we hope our study can provide insights for future research on jailbreak attacks and serve as a benchmark tool for evaluating them for practitioners.
Xianghe Pang, Shuo Tang, Rui Ye, Yuxin Xiong, Bolun Zhang, Yanfeng Wang, Siheng Chen
Abstract: aligning large language models (llms) with human values is imperative to mitigate potential adverse effects resulting from their misuse. drawing from the sociological insight that acknowledging all parties' concerns is a key factor in shaping human values, this paper proposes a novel direction to align llms by themselves: social scene simulation. to achieve this, we present matrix, a novel social scene simulator that emulates realistic scenes around a user's input query, enabling the llm to take social consequences into account before responding. matrix serves as a virtual rehearsal space, akin to a monopolylogue, where the llm performs diverse roles related to the query and practice by itself. to inject this alignment, we fine-tune the llm with matrix-simulated data, ensuring adherence to human values without compromising inference speed. we theoretically show that the llm with matrix outperforms constitutional ai under mild assumptions. finally, extensive experiments validate that our method outperforms over 10 baselines across 4 benchmarks. as evidenced by 875 user ratings, our tuned 13b-size llm exceeds gpt-4 in aligning with human values. code is available at
Sophie Xhonneux, David Dobre, Jian Tang, Gauthier Gidel, Dhanya Sridhar
Abstract: despite significant investment into safety training, large language models (llms) deployed in the real world still suffer from numerous vulnerabilities. one perspective on llm safety training is that it algorithmically forbids the model from answering toxic or harmful queries. to assess the effectiveness of safety training, in this work, we study forbidden tasks, i.e., tasks the model is designed to refuse to answer. specifically, we investigate whether in-context learning (icl) can be used to re-learn forbidden tasks despite the explicit fine-tuning of the model to refuse them. we first examine a toy example of refusing sentiment classification to demonstrate the problem. then, we use icl on a model fine-tuned to refuse to summarise made-up news articles. finally, we investigate whether icl can undo safety training, which could represent a major security risk. for the safety task, we look at vicuna-7b, starling-7b, and llama2-7b. we show that the attack works out-of-the-box on starling-7b and vicuna-7b but fails on llama2-7b. finally, we propose an icl attack that uses the chat template tokens like a prompt injection attack to achieve a better attack success rate on vicuna-7b and starling-7b. trigger warning: the appendix contains llm-generated text with violence, suicide, and misinformation.
Kathleen C. Fraser, Svetlana Kiritchenko
Abstract: following on recent advances in large language models (llms) and subsequent chat models, a new wave of large vision-language models (lvlms) has emerged. such models can incorporate images as input in addition to text, and perform tasks such as visual question answering, image captioning, story generation, etc. here, we examine potential gender and racial biases in such systems, based on the perceived characteristics of the people in the input images. to accomplish this, we present a new dataset pairs (parallel images for everyday scenarios). the pairs dataset contains sets of ai-generated images of people, such that the images are highly similar in terms of background and visual content, but differ along the dimensions of gender (man, woman) and race (black, white). by querying the lvlms with such images, we observe significant differences in the responses according to the perceived gender or race of the person depicted.
Jazmia Henry
Abstract: utilitarian games such as dictator games to measure fairness have been studied in the social sciences for decades. these games have given us insight into not only how humans view fairness but also in what conditions the frequency of fairness, altruism and greed increase or decrease. while these games have traditionally been focused on humans, the rise of ai gives us the ability to study how these models play these games. ai is becoming a constant in human interaction and examining how these models portray fairness in game play can give us some insight into how ai makes decisions. over 101 rounds of the dictator game, i conclude that ai has a strong sense of fairness that is dependant of it it deems the person it is playing with as trustworthy, framing has a strong effect on how much ai gives a recipient when designated the trustee, and there may be evidence that ai experiences inequality aversion just as humans.
Guo Lin, Wenyue Hua, Yongfeng Zhang
Abstract: cloud-based large language models (llms) such as chatgpt have increasingly become integral to daily operations, serving as vital tools across various applications. while these models offer substantial benefits in terms of accessibility and functionality, they also introduce significant privacy concerns: the transmission and storage of user data in cloud infrastructures pose substantial risks of data breaches and unauthorized access to sensitive information; even if the transmission and storage of data is encrypted, the llm service provider itself still knows the real contents of the data, preventing individuals or entities from confidently using such llm services. to address these concerns, this paper proposes a simple yet effective mechanism promptcrypt to protect user privacy. it uses emoji to encrypt the user inputs before sending them to llm, effectively rendering them indecipherable to human or llm's examination while retaining the original intent of the prompt, thus ensuring the model's performance remains unaffected. we conduct experiments on three tasks, personalized recommendation, sentiment analysis, and tabular data analysis. experiment results reveal that promptcrypt can encrypt personal information within prompts in such a manner that not only prevents the discernment of sensitive data by humans or llm itself, but also maintains or even improves the precision without further tuning, achieving comparable or even better task accuracy than directly prompting the llm without prompt encryption. these results highlight the practicality of adopting encryption measures that safeguard user privacy without compromising the functional integrity and performance of llms. code and dataset are available at
Nikhil Sharma, Q. Vera Liao, Ziang Xiao
Abstract: large language models (llms) powered conversational search systems have already been used by hundreds of millions of people, and are believed to bring many benefits over conventional search. however, while decades of research and public discourse interrogated the risk of search systems in increasing selective exposure and creating echo chambers -- limiting exposure to diverse opinions and leading to opinion polarization, little is known about such a risk of llm-powered conversational search. we conduct two experiments to investigate: 1) whether and how llm-powered conversational search increases selective exposure compared to conventional search; 2) whether and how llms with opinion biases that either reinforce or challenge the user's view change the effect. overall, we found that participants engaged in more biased information querying with llm-powered conversational search, and an opinionated llm reinforcing their views exacerbated this bias. these results present critical implications for the development of llms and conversational search systems, and the policy governing these technologies.
Eun Cheol Choi, Emilio Ferrara
Abstract: our society is facing rampant misinformation harming public health and trust. to address the societal challenge, we introduce fact-gpt, a system leveraging large language models (llms) to automate the claim matching stage of fact-checking. fact-gpt, trained on a synthetic dataset, identifies social media content that aligns with, contradicts, or is irrelevant to previously debunked claims. our evaluation shows that our specialized llms can match the accuracy of larger models in identifying related claims, closely mirroring human judgment. this research provides an automated solution for efficient claim matching, demonstrates the potential of llms in supporting fact-checkers, and offers valuable resources for further research in the field.
John Hewitt, Sarah Chen, Lanruo Lora Xie, Edward Adams, Percy Liang, Christopher D. Manning
Abstract: we introduce model editing with canonical examples, a setting in which (1) a single learning example is provided per desired behavior, (2) evaluation is performed exclusively out-of-distribution, and (3) deviation from an initial model is strictly limited. a canonical example is a simple instance of good behavior, e.g., the capital of mauritius is port louis) or bad behavior, e.g., an aspect of researchers is coldhearted). the evaluation set contains more complex examples of each behavior (like a paragraph in which the capital of mauritius is called for.) we create three datasets and modify three more for model editing with canonical examples, covering knowledge-intensive improvements, social bias mitigation, and syntactic edge cases. in our experiments on pythia language models, we find that lora outperforms full finetuning and memit. we then turn to the backpack language model architecture because it is intended to enable targeted improvement. the backpack defines a large bank of sense vectors--a decomposition of the different uses of each word--which are weighted and summed to form the output logits of the model. we propose sense finetuning, which selects and finetunes a few ($\approx$ 10) sense vectors for each canonical example, and find that it outperforms other finetuning methods, e.g., 4.8% improvement vs 0.3%. finally, we improve gpt-j-6b by an inference-time ensemble with just the changes from sense finetuning of a 35x smaller backpack, in one setting outperforming editing gpt-j itself (4.1% vs 1.0%).


Chirag Agarwal, Sree Harsha Tanneru, Himabindu Lakkaraju
Abstract: large language models (llms) are deployed as powerful tools for several natural language processing (nlp) applications. recent works show that modern llms can generate self-explanations (ses), which elicit their intermediate reasoning steps for explaining their behavior. self-explanations have seen widespread adoption owing to their conversational and plausible nature. however, there is little to no understanding of their faithfulness. in this work, we discuss the dichotomy between faithfulness and plausibility in ses generated by llms. we argue that while llms are adept at generating plausible explanations -- seemingly logical and coherent to human users -- these explanations do not necessarily align with the reasoning processes of the llms, raising concerns about their faithfulness. we highlight that the current trend towards increasing the plausibility of explanations, primarily driven by the demand for user-friendly interfaces, may come at the cost of diminishing their faithfulness. we assert that the faithfulness of explanations is critical in llms employed for high-stakes decision-making. moreover, we urge the community to identify the faithfulness requirements of real-world applications and ensure explanations meet those needs. finally, we propose some directions for future work, emphasizing the need for novel methodologies and frameworks that can enhance the faithfulness of self-explanations without compromising their plausibility, essential for the transparent deployment of llms in diverse high-stakes domains.
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Pan Zhou, Yao Wan, Lichao Sun
Abstract: multimodal large language models (mllms) have gained significant attention recently, showing remarkable potential in artificial general intelligence. however, assessing the utility of mllms presents considerable challenges, primarily due to the absence multimodal benchmarks that align with human preferences. inspired by llm-as-a-judge in llms, this paper introduces a novel benchmark, termed mllm-as-a-judge, to assess the ability of mllms in assisting judges including three distinct tasks: scoring evaluation, pair comparison, and batch ranking. our study reveals that, while mllms demonstrate remarkable human-like discernment in pair comparisons, there is a significant divergence from human preferences in scoring evaluation and batch ranking tasks. furthermore, mllms still face challenges in judgment, including diverse biases, hallucinatory responses, and inconsistencies, even for advanced models such as gpt-4v. these findings emphasize the pressing need for enhancements and further research efforts regarding mllms as fully reliable evaluators. code and dataset are available at
Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, Mathieu Blondel
Abstract: direct alignment from preferences (dap) methods, such as dpo, have recently emerged as efficient alternatives to reinforcement learning from human feedback (rlhf), that do not require a separate reward model. however, the preference datasets used in dap methods are usually collected ahead of training and never updated, thus the feedback is purely offline. moreover, responses in these datasets are often sampled from a language model distinct from the one being aligned, and since the model evolves over training, the alignment phase is inevitably off-policy. in this study, we posit that online feedback is key and improves dap methods. our method, online ai feedback (oaif), uses an llm as annotator: on each training iteration, we sample two responses from the current model and prompt the llm annotator to choose which one is preferred, thus providing online feedback. despite its simplicity, we demonstrate via human evaluation in several tasks that oaif outperforms both offline dap and rlhf methods. we further show that the feedback leveraged in oaif is easily controllable, via instruction prompts to the llm annotator.
Jan Wehner, Frans Oliehoek, Luciano Cavalcante Siebert
Abstract: learning rewards from human behaviour or feedback is a promising approach to aligning ai systems with human values but fails to consistently extract correct reward functions. interpretability tools could enable users to understand and evaluate possible flaws in learned reward functions. we propose counterfactual trajectory explanations (ctes) to interpret reward functions in reinforcement learning by contrasting an original with a counterfactual partial trajectory and the rewards they each receive. we derive six quality criteria for ctes and propose a novel monte-carlo-based algorithm for generating ctes that optimises these quality criteria. finally, we measure how informative the generated explanations are to a proxy-human model by training it on ctes. ctes are demonstrably informative for the proxy-human model, increasing the similarity between its predictions and the reward function on unseen trajectories. further, it learns to accurately judge differences in rewards between trajectories and generalises to out-of-distribution examples. although ctes do not lead to a perfect understanding of the reward, our method, and more generally the adaptation of xai methods, are presented as a fruitful approach for interpreting learned reward functions.
Pica Johansson, Jonathan Bright, Shyam Krishna, Claudia Fischer, David Leslie
Abstract: the use of synthetic data provides an opportunity to accelerate online safety research and development efforts while showing potential for bias mitigation, facilitating data storage and sharing, preserving privacy and reducing exposure to harmful content. however, the responsible use of synthetic data requires caution regarding anticipated risks and challenges. this short report explores the potential applications of synthetic data to the domain of online safety, and addresses the ethical challenges that effective use of the technology may present.
Shashank Sonkar, Kangqi Ni, Sapana Chaudhary, Richard G. Baraniuk
Abstract: in this paper, we introduce the novel concept of pedagogically aligned large language models (llms) that signifies a transformative shift in the application of llms within educational contexts. rather than providing direct responses to user queries, pedagogically-aligned llms function as scaffolding tools, breaking complex problems into manageable subproblems and guiding students towards the final answer through constructive feedback and hints. the objective is to equip learners with problem-solving strategies that deepen their understanding and internalization of the subject matter. previous research in this field has primarily applied the supervised finetuning approach without framing the objective as an alignment problem, hence not employing reinforcement learning through human feedback (rlhf) methods. this study reinterprets the narrative by viewing the task through the lens of alignment and demonstrates how rlhf methods emerge naturally as a superior alternative for aligning llm behaviour. building on this perspective, we propose a novel approach for constructing a reward dataset specifically designed for the pedagogical alignment of llms. we apply three state-of-the-art rlhf algorithms and find that they outperform sft significantly. our qualitative analyses across model differences and hyperparameter sensitivity further validate the superiority of rlhf over sft. also, our study sheds light on the potential of online feedback for enhancing the performance of pedagogically-aligned llms, thus providing valuable insights for the advancement of these models in educational settings.
Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, Jing Shao
Abstract: in the rapidly evolving landscape of large language models (llms), ensuring robust safety measures is paramount. to meet this crucial need, we propose \emph{salad-bench}, a safety benchmark specifically designed for evaluating llms, attack, and defense methods. distinguished by its breadth, salad-bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.salad-bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. to effectively manage the inherent complexity, we introduce an innovative evaluators: the llm-based md-judge for qa pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. above components extend salad-bench from standard llm safety evaluation to both llm attack and defense methods evaluation, ensuring the joint-purpose utility. our extensive experiments shed light on the resilience of llms against emerging threats and the efficacy of contemporary defense tactics. data and evaluator are released under \url{}. warning: this paper includes examples that may be offensive or harmful.
Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, Yejin Choi
Abstract: with increased power and prevalence of ai systems, it is ever more critical that ai systems are designed to serve all, i.e., people with diverse values and perspectives. however, aligning models to serve pluralistic human values remains an open research question. in this piece, we propose a roadmap to pluralistic alignment, specifically using language models as a test bed. we identify and formalize three possible ways to define and operationalize pluralism in ai systems: 1) overton pluralistic models that present a spectrum of reasonable responses; 2) steerably pluralistic models that can steer to reflect certain perspectives; and 3) distributionally pluralistic models that are well-calibrated to a given population in distribution. we also propose and formalize three possible classes of pluralistic benchmarks: 1) multi-objective benchmarks, 2) trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) jury-pluralistic benchmarks which explicitly model diverse human ratings. we use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic ai; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.
Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
Abstract: large language models (llms) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. this study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. we develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. surprisingly, the isolated regions we find are sparse, comprising about $3\%$ at the parameter level and $2.5\%$ at the rank level. removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. moreover, we show that llms remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. these findings underscore the urgent need for more robust safety strategies in llms.
Tianyi Zhao, Liangliang Zhang, Yao Ma, Lu Cheng
Abstract: with the wide deployment of multimodal learning systems (mmls) in real-world scenarios, safety concerns have become increasingly prominent. the absence of systematic research into their safety is a significant barrier to progress in this field. to bridge the gap, we present the first taxonomy for mmls safety, identifying four essential pillars of these concerns. leveraging this taxonomy, we conduct in-depth reviews for each pillar, highlighting key limitations based on the current state of development. finally, we pinpoint unique challenges in mmls safety and provide potential directions for future research.
Huayu Chen, Guande He, Hang Su, Jun Zhu
Abstract: user intentions are typically formalized as evaluation rewards to be maximized when fine-tuning language models (lms). existing alignment methods, such as direct preference optimization (dpo), are mainly tailored for pairwise preference data where rewards are implicitly defined rather than explicitly given. in this paper, we introduce a general framework for lm alignment, leveraging noise contrastive estimation (nce) to bridge the gap in handling reward datasets explicitly annotated with scalar evaluations. our framework comprises two parallel algorithms, nca and infonca, both enabling the direct extraction of an lm policy from reward data as well as preference data. notably, we show that the dpo loss is a special case of our proposed infonca objective under pairwise preference settings, thereby integrating and extending current alignment theories. by contrasting nca and infonca, we show that infonca and dpo adjust relative likelihood across different responses to a single instruction, while nca optimizes absolute likelihood for each response. we apply our methods to align a 7b language model with a gpt-4 annotated reward dataset. experimental results suggest that infonca surpasses the dpo baseline in gpt-4 evaluations, while nca enjoys better training stability with competitive performance.


Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, Jieping Ye
Abstract: knowledge hallucination have raised widespread concerns for the security and reliability of deployed llms. previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. thus, we propose to explore the dense semantic information retained within llms' \textbf{in}ternal \textbf{s}tates for halluc\textbf{i}nation \textbf{de}tection (\textbf{inside}). in particular, a simple yet effective \textbf{eigenscore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. extensive experiments and ablation studies are performed on several popular llms and question-answering (qa) benchmarks, showing the effectiveness of our proposal.
Amir Taubenfeld, Yaniv Dover, Roi Reichart, Ariel Goldstein
Abstract: recent advancements in natural language processing, especially the emergence of large language models (llms), have opened exciting possibilities for constructing computational simulations designed to replicate human behavior accurately. however, llms are complex statistical learners without straightforward deductive rules, making them prone to unexpected behaviors. in this study, we highlight the limitations of llms in simulating human interactions, particularly focusing on llms' ability to simulate political debates. our findings indicate a tendency for llm agents to conform to the model's inherent social biases despite being directed to debate from certain political perspectives. this tendency results in behavioral patterns that seem to deviate from well-established social dynamics among humans. we reinforce these observations using an automatic self-fine-tuning method, which enables us to manipulate the biases within the llm and demonstrate that agents subsequently align with the altered biases. these results underscore the need for further research to develop methods that help agents overcome these biases, a critical step toward creating more realistic simulations.
Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, Thomas L. Griffiths
Abstract: large language models (llms) can pass explicit bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. measuring such implicit biases can be a challenge: as llms become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. we address both of these challenges by introducing two measures of bias inspired by psychology: llm implicit association test (iat) bias, which is a prompt-based method for revealing implicit bias; and llm decision bias for detecting subtle discrimination in decision-making tasks. using these measures, we found pervasive human-like stereotype biases in 6 llms across 4 social domains (race, gender, religion, health) and 21 categories (weapons, guilt, science, career among others). our prompt-based measure of implicit bias correlates with embedding-based methods but better predicts downstream behaviors measured by llm decision bias. this measure is based on asking the llm to decide between individuals, motivated by psychological results indicating that relative not absolute evaluations are more related to implicit biases. using prompt-based measures informed by psychology allows us to effectively expose nuanced biases and subtle discrimination in proprietary llms that do not show explicit bias on standard benchmarks.
Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, Arman Cohan, Zhiyong Lu, Mark Gerstein
Abstract: intelligent agents powered by large language models (llms) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. while their capabilities are promising, they also introduce novel vulnerabilities that demand careful consideration for safety. however, there exists a notable gap in the literature, as there has been no comprehensive exploration of these vulnerabilities. this position paper fills this gap by conducting a thorough examination of vulnerabilities in llm-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures. we begin by providing a comprehensive overview of the potential risks inherent to scientific llm agents, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. then, we delve into the origins of these vulnerabilities and provide a scoping review of the limited existing works. based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. furthermore, we highlight the limitations and challenges associated with safeguarding scientific agents and advocate for the development of improved models, robust benchmarks, and comprehensive regulations to address these issues effectively.
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks
Abstract: automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (llms), yet the field lacks a standardized evaluation framework to rigorously assess new methods. to address this issue, we introduce harmbench, a standardized evaluation framework for automated red teaming. we identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design harmbench to meet these criteria. using harmbench, we conduct a large-scale comparison of 18 red teaming methods and 33 target llms and defenses, yielding novel insights. we also introduce a highly efficient adversarial training method that greatly enhances llm robustness across a wide range of attacks, demonstrating how harmbench enables codevelopment of attacks and defenses. we open source harmbench at
Alakananda Mitra, Saraju P. Mohanty, Elias Kougianos
Abstract: we live in the era of generative artificial intelligence (genai). deepfakes and large language models (llms) are two examples of genai. deepfakes, in particular, pose an alarming threat to society as they are capable of spreading misinformation and changing the truth. llms are powerful language models that generate general-purpose language. however due to its generative aspect, it can also be a risk for people if used with ill intentions. the ethical use of these technologies is a big concern. this short article tries to find out the interrelationship between them.
Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, Meng Jiang
Abstract: personalization in large language models (llms) is increasingly important, aiming to align llm's interactions, content, and recommendations with individual user preferences. recent advances in llm personalization have spotlighted effective prompt design, by enriching user queries with non-parametric knowledge through behavior history retrieval and textual profiles. however, these approaches were limited due to a lack of model ownership, resulting in constrained customization and privacy issues. moreover, they often failed to accurately capture user behavior patterns, especially in cases where user data were complex and dynamic. to address these shortcomings, we introduce one peft per user (oppu), which employs personalized parameter-efficient fine-tuning (peft) modules, to store user-specific behavior patterns and preferences. by plugging in users' personal peft parameters, they can own and use their llms personally. oppu integrates parametric user knowledge in the personal peft parameters with the non-parametric knowledge acquired through retrieval and profile. this integration adapts individual llms to user behavior shifts. experimental results demonstrate that oppu significantly outperforms existing prompt-based methods across seven diverse tasks in the lamp benchmark. further in-depth studies reveal oppu's enhanced capabilities in handling user behavior shifts, modeling users at different active levels, maintaining robustness across various user history formats, and displaying versatility with different peft methods.
Angelina Wang, Xuechunzi Bai, Solon Barocas, Su Lin Blodgett
Abstract: as machine learning applications proliferate, we need an understanding of their potential for harm. however, current fairness metrics are rarely grounded in human psychological experiences of harm. drawing on the social psychology of stereotypes, we use a case study of gender stereotypes in image search to examine how people react to machine learning errors. first, we use survey studies to show that not all machine learning errors reflect stereotypes nor are equally harmful. then, in experimental studies we randomly expose participants to stereotype-reinforcing, -violating, and -neutral machine learning errors. we find stereotype-reinforcing errors induce more experientially (i.e., subjectively) harmful experiences, while having minimal changes to cognitive beliefs, attitudes, or behaviors. this experiential harm impacts women more than men. however, certain stereotype-violating errors are more experientially harmful for men, potentially due to perceived threats to masculinity. we conclude that harm cannot be the sole guide in fairness mitigation, and propose a nuanced perspective depending on who is experiencing what harm and why.
Sanjari Srivastava, Piotr Mardziel, Zhikhun Zhang, Archana Ahlawat, Anupam Datta, John C Mitchell
Abstract: fairness and privacy are two important values machine learning (ml) practitioners often seek to operationalize in models. fairness aims to reduce model bias for social/demographic sub-groups. privacy via differential privacy (dp) mechanisms, on the other hand, limits the impact of any individual's training data on the resulting model. the trade-offs between privacy and fairness goals of trustworthy ml pose a challenge to those wishing to address both. we show that dp amplifies gender, racial, and religious bias when fine-tuning large language models (llms), producing models more biased than ones fine-tuned without dp. we find the cause of the amplification to be a disparity in convergence of gradients across sub-groups. through the case of binary gender bias, we demonstrate that counterfactual data augmentation (cda), a known method for addressing bias, also mitigates bias amplification by dp. as a consequence, dp and cda together can be used to fine-tune models while maintaining both fairness and privacy.


Ivar Frisch, Mario Giulianelli
Abstract: while both agent interaction and personalisation are vibrant topics in research on large language models (llms), there has been limited focus on the effect of language interaction on the behaviour of persona-conditioned llm agents. such an endeavour is important to ensure that agents remain consistent to their assigned traits yet are able to engage in open, naturalistic dialogues. in our experiments, we condition gpt-3.5 on personality profiles through prompting and create a two-group population of llm agents using a simple variability-inducing sampling algorithm. we then administer personality tests and submit the agents to a collaborative writing task, finding that different profiles exhibit different degrees of personality consistency and linguistic alignment to their conversational partners. our study seeks to lay the groundwork for better understanding of dialogue-based interaction between llms and highlights the need for new approaches to crafting robust, more human-like llm personas for interactive environments.
Junjie Chu, Zeyang Sha, Michael Backes, Yang Zhang
Abstract: in recent times, significant advancements have been made in the field of large language models (llms), represented by gpt series models. to optimize task execution, users often engage in multi-round conversations with gpt models hosted in cloud environments. these multi-round conversations, potentially replete with private information, require transmission and storage within the cloud. however, this operational paradigm introduces additional attack surfaces. in this paper, we first introduce a specific conversation reconstruction attack targeting gpt models. our introduced conversation reconstruction attack is composed of two steps: hijacking a session and reconstructing the conversations. subsequently, we offer an exhaustive evaluation of the privacy risks inherent in conversations when gpt models are subjected to the proposed attack. however, gpt-4 demonstrates certain robustness to the proposed attacks. we then introduce two advanced attacks aimed at better reconstructing previous conversations, specifically the unr attack and the pbu attack. our experimental findings indicate that the pbu attack yields substantial performance across all models, achieving semantic similarity scores exceeding 0.60, while the unr attack is effective solely on gpt-3.5. our results reveal the concern about privacy risks associated with conversations involving gpt models and aim to draw the community's attention to prevent the potential misuse of these models' remarkable capabilities. we will responsibly disclose our findings to the suppliers of related large language models.
Tianlin Liu, Shangmin Guo, Leonardo Bianco, Daniele Calandriello, Quentin Berthet, Felipe Llinares, Jessica Hoffmann, Lucas Dixon, Michal Valko, Mathieu Blondel
Abstract: aligning language models with human preferences is crucial for reducing errors and biases in these models. alignment techniques, such as reinforcement learning from human feedback (rlhf), are typically cast as optimizing a tradeoff between human preference rewards and a proximity regularization term that encourages staying close to the unaligned model. selecting an appropriate level of regularization is critical: insufficient regularization can lead to reduced model capabilities due to reward hacking, whereas excessive regularization hinders alignment. traditional methods for finding the optimal regularization level require retraining multiple models with varying regularization strengths. this process, however, is resource-intensive, especially for large models. to address this challenge, we propose decoding-time realignment (dera), a simple method to explore and evaluate different regularization strengths in aligned models without retraining. dera enables control over the degree of alignment, allowing users to smoothly transition between unaligned and aligned models. it also enhances the efficiency of hyperparameter tuning by enabling the identification of effective regularization strengths using a validation dataset.
Liming Jiang
Abstract: large language models (llms) have gained prominence in various applications, including security. this paper explores the utility of llms in scam detection, a critical aspect of cybersecurity. unlike traditional applications, we propose a novel use case for llms to identify scams, such as phishing, advance fee fraud, and romance scams. we present notable security applications of llms and discuss the unique challenges posed by scams. specifically, we outline the key steps involved in building an effective scam detector using llms, emphasizing data collection, preprocessing, model selection, training, and integration into target systems. additionally, we conduct a preliminary evaluation using gpt-3.5 and gpt-4 on a duplicated email, highlighting their proficiency in identifying common signs of phishing or scam emails. the results demonstrate the models' effectiveness in recognizing suspicious elements, but we emphasize the need for a comprehensive assessment across various language tasks. the paper concludes by underlining the importance of ongoing refinement and collaboration with cybersecurity experts to adapt to evolving threats.
Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li
Abstract: despite the impressive capabilities of large language models (llms) across diverse applications, they still suffer from trustworthiness issues, such as hallucinations and misalignments. retrieval-augmented language models (rag) have been proposed to enhance the credibility of generations by grounding external knowledge, but the theoretical understandings of their generation risks remains unexplored. in this paper, we answer: 1) whether rag can indeed lead to low generation risks, 2) how to provide provable guarantees on the generation risks of rag and vanilla llms, and 3) what sufficient conditions enable rag models to reduce generation risks. we propose c-rag, the first framework to certify generation risks for rag models. specifically, we provide conformal risk analysis for rag models and certify an upper confidence bound of generation risks, which we refer to as conformal generation risk. we also provide theoretical guarantees on conformal generation risks for general bounded risk functions under test distribution shifts. we prove that rag achieves a lower conformal generation risk than that of a single llm when the quality of the retrieval model and transformer is non-trivial. our intensive empirical results demonstrate the soundness and tightness of our conformal generation risk guarantees across four widely-used nlp datasets on four state-of-the-art retrieval models.
Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xiaoyan Yang, Qiang Li, Yue Shen, Jinjie Gu, Huajun Chen
Abstract: despite significant strides in multimodal tasks, multimodal large language models (mllms) are plagued by the critical issue of hallucination. the reliable detection of such hallucinations in mllms has, therefore, become a vital aspect of model evaluation and the safeguarding of practical application deployment. prior research in this domain has been constrained by a narrow focus on singular tasks, an inadequate range of hallucination categories addressed, and a lack of detailed granularity. in response to these challenges, our work expands the investigative horizons of hallucination detection. we present a novel meta-evaluation benchmark, mhalubench, meticulously crafted to facilitate the evaluation of advancements in hallucination detection methods. additionally, we unveil a novel unified multimodal hallucination detection framework, unihd, which leverages a suite of auxiliary tools to validate the occurrence of hallucinations robustly. we demonstrate the effectiveness of unihd through meticulous evaluation and comprehensive analysis. we also provide strategic insights on the application of specific tools for addressing various categories of hallucinations.
Haibo Jin, Ruoxi Chen, Andy Zhou, Jinyin Chen, Yang Zhang, Haohan Wang
Abstract: the discovery of "jailbreaks" to bypass safety filters of large language models (llms) and harmful responses have encouraged the community to implement safety measures. one major safety measure is to proactively test the llms with jailbreaks prior to the release. therefore, such testing will require a method that can generate jailbreaks massively and efficiently. in this paper, we follow a novel yet intuitive strategy to generate jailbreaks in the style of the human generation. we propose a role-playing system that assigns four different roles to the user llms to collaborate on new jailbreaks. furthermore, we collect existing jailbreaks and split them into different independent characteristics using clustering frequency and semantic patterns sentence by sentence. we organize these characteristics into a knowledge graph, making them more accessible and easier to retrieve. our system of different roles will leverage this knowledge graph to generate new jailbreaks, which have proved effective in inducing llms to generate unethical or guideline-violating responses. in addition, we also pioneer a setting in our system that will automatically follow the government-issued guidelines to generate jailbreaks to test whether llms follow the guidelines accordingly. we refer to our system as guard (guideline upholding through adaptive role-play diagnostics). we have empirically validated the effectiveness of guard on three cutting-edge open-sourced llms (vicuna-13b, longchat-7b, and llama-2-7b), as well as a widely-utilized commercial llm (chatgpt). moreover, our work extends to the realm of vision language models (minigpt-v2 and gemini vision pro), showcasing guard's versatility and contributing valuable insights for the development of safer, more reliable llm-based applications across diverse modalities.
Edward Kim
Abstract: given the impressive capabilities of recent large language models (llms), we investigate and benchmark the most popular proprietary and different sized open source models on the task of explicit instruction following in conflicting situations, e.g. overrides. these include the ability of the model to override the knowledge within the weights of the model, the ability to override (or moderate) extracted knowledge in the prompt, and lastly the ability to perform a full jailbreak. experimentation performed suggest several key findings to improve instruction following - larger models perform the best in following instructions that override internal and contextual instructions, and are obedient, even to a fault. when scaling to longer contexts via rope scaling, a significant buffer needs to be maintained from the edge of the perplexity cliff in order to maintain instruction following capabilities. finally, we observe improving instruction following, and subsequently instruction overrides/jailbreaks, is fundamentally at odds with the ability of a language model to follow given safety filters or guidelines. thus, we postulate the most effective approach for safe, trustworthy ai should be dealt external to the llm itself.
Mohammad Yaghini, Patty Liu, Franziska Boenisch, Nicolas Papernot
Abstract: existing work on trustworthy machine learning (ml) often concentrates on individual aspects of trust, such as fairness or privacy. additionally, many techniques overlook the distinction between those who train ml models and those responsible for assessing their trustworthiness. to address these issues, we propose a framework that views trustworthy ml as a multi-objective multi-agent optimization problem. this naturally lends itself to a game-theoretic formulation we call regulation games. we illustrate a particular game instance, the specgame in which we model the relationship between an ml model builder and fairness and privacy regulators. regulators wish to design penalties that enforce compliance with their specification, but do not want to discourage builders from participation. seeking such socially optimal (i.e., efficient for all agents) solutions to the game, we introduce paretoplay. this novel equilibrium search algorithm ensures that agents remain on the pareto frontier of their objectives and avoids the inefficiencies of other equilibria. simulating specgame through paretoplay can provide policy guidance for ml regulation. for instance, we show that for a gender classification application, regulators can enforce a differential privacy budget that is on average 4.0 lower if they take the initiative to specify their desired guarantee first.
Sugandha Sharma, Guy Davidson, Khimya Khetarpal, Anssi Kanervisto, Udit Arora, Katja Hofmann, Ida Momennejad
Abstract: achieving human-ai alignment in complex multi-agent games is crucial for creating trustworthy ai agents that enhance gameplay. we propose a method to evaluate this alignment using an interpretable task-sets framework, focusing on high-level behavioral tasks instead of low-level policies. our approach has three components. first, we analyze extensive human gameplay data from xbox's bleeding edge (100k+ games), uncovering behavioral patterns in a complex task space. this task space serves as a basis set for a behavior manifold capturing interpretable axes: fight-flight, explore-exploit, and solo-multi-agent. second, we train an ai agent to play bleeding edge using a generative pretrained causal transformer and measure its behavior. third, we project human and ai gameplay to the proposed behavior manifold to compare and contrast. this allows us to interpret differences in policy as higher-level behavioral concepts, e.g., we find that while human players exhibit variability in fight-flight and explore-exploit behavior, ai players tend towards uniformity. furthermore, ai agents predominantly engage in solo play, while humans often engage in cooperative and competitive multi-agent patterns. these stark differences underscore the need for interpretable evaluation, design, and integration of ai in human-aligned applications. our study advances the alignment discussion in ai and especially generative ai research, offering a measurable framework for interpretable human-agent alignment in multiplayer gaming.


Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, Yaodong Yang
Abstract: efforts to align large language models (llms) are mainly conducted via reinforcement learning from human feedback (rlhf) methods. however, rlhf encounters major challenges including training reward models, actor-critic engineering, and importantly, it requires access to llm parameters. here we introduce aligner, a new efficient alignment paradigm that bypasses the whole rlhf process by learning the correctional residuals between the aligned and the unaligned answers. our aligner offers several key advantages. firstly, it is an autoregressive seq2seq model that is trained on the query-answer-correction dataset via supervised learning; this offers a parameter-efficient alignment solution with minimal resources. secondly, the aligner facilitates weak-to-strong generalization; finetuning large pretrained models by aligner's supervisory signals demonstrates strong performance boost. thirdly, aligner functions as a model-agnostic plug-and-play module, allowing for its direct application on different open-source and api-based models. remarkably, aligner-7b improves 11 different llms by 21.9% in helpfulness and 23.8% in harmlessness on average (gpt-4 by 17.5% and 26.9%). when finetuning (strong) llama2-70b with (weak) aligner-13b's supervision, we can improve llama2 by 8.2% in helpfulness and 61.6% in harmlessness. see our dataset and code at
Philip Quirke, Clement Neo, Fazl Barez
Abstract: language models (lms) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. here, we define a stringent standard of trustworthiness whereby the task algorithm and circuit implementation must be verified, accounting for edge cases, with no known failure modes. we show that a transformer model can be trained to meet this standard if built using mathematically and logically specified frameworks. in this paper, we fully verify a model for n-digit integer addition. to exhibit the reusability of verified modules, we insert the trained integer addition model into an untrained model and train the combined model to perform both addition and subtraction. we find extensive reuse of the addition circuits for both tasks, easing verification of the more complex subtractor model. we discuss how inserting verified task modules into lms can leverage model reuse to improve verifiability and trustworthiness of language models built using them. the reuse of verified circuits reduces the effort to verify more complex composite models which we believe to be a significant step towards safety of language models.
Jinwoo Ahn
Abstract: large language models (llms) frequently suffer from knowledge-intensive questions, often being inconsistent by providing different outputs despite given the same input. the response quality worsens when the user expresses a firm opposing stance which causes the llms to adjust its response despite the correct initial one. these behaviors decrease the reliability and validity of the responses provided by these models. in this paper, we attempt to 1) raise awareness of the inherent risks that follow from overly relying on ai agents like chatgpt by showing how chain-of-feedback (cof) triggers llms to deviate more from the actual answer and 2) suggest a novel prompting method, recursive chain of feedback (r-cof), that we are conducting further study. the cof system takes in an open-ended multi-step question. then, we repetitively provide meaningless feedback requesting another attempt. our preliminary experiments show that such feedback only decreases the quality of the response. on the other hand, to mitigate the effects of the aforementioned inconsistencies, we present a novel method of recursively revising the initial incorrect reasoning provided by the llm by repetitively breaking down each incorrect step into smaller individual problems.
Rohin Manvi, Samar Khanna, Marshall Burke, David Lobell, Stefano Ermon
Abstract: large language models (llms) inherently carry the biases contained in their training corpora, which can lead to the perpetuation of societal harm. as the impact of these foundation models grows, understanding and evaluating their biases becomes crucial to achieving fairness and accuracy. we propose to study what llms know about the world we live in through the lens of geography. this approach is particularly powerful as there is ground truth for the numerous aspects of human life that are meaningfully projected onto geographic space such as culture, race, language, politics, and religion. we show various problematic geographic biases, which we define as systemic errors in geospatial predictions. initially, we demonstrate that llms are capable of making accurate zero-shot geospatial predictions in the form of ratings that show strong monotonic correlation with ground truth (spearman's $\rho$ of up to 0.89). we then show that llms exhibit common biases across a range of objective and subjective topics. in particular, llms are clearly biased against locations with lower socioeconomic conditions (e.g. most of africa) on a variety of sensitive subjective topics such as attractiveness, morality, and intelligence (spearman's $\rho$ of up to 0.70). finally, we introduce a bias score to quantify this and find that there is significant variation in the magnitude of bias across existing llms.


Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Qingfu Zhang, Siyuan Qi, Yaodong Yang
Abstract: current methods for large language model alignment typically use scalar human preference labels. however, this convention tends to oversimplify the multi-dimensional and heterogeneous nature of human preferences, leading to reduced expressivity and even misalignment. this paper presents panacea, an innovative approach that reframes alignment as a multi-dimensional preference optimization problem. panacea trains a single model capable of adapting online and pareto-optimally to diverse sets of preferences without the need for further tuning. a major challenge here is using a low-dimensional preference vector to guide the model's behavior, despite it being governed by an overwhelmingly large number of parameters. to address this, panacea is designed to use singular value decomposition (svd)-based low-rank adaptation, which allows the preference vector to be simply injected online as singular values. theoretically, we prove that panacea recovers the entire pareto front with common loss aggregation methods under mild conditions. moreover, our experiments demonstrate, for the first time, the feasibility of aligning a single llm to represent a spectrum of human preferences through various optimization methods. our work marks a step forward in effectively and efficiently aligning models to diverse and intricate human preferences in a controllable and pareto-optimal manner.
Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Susmit Jha, Prem Devanbu, Toufique Ahmed
Abstract: machine learning models are widely used but can also often be wrong. users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. for example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. in this case, for example, high-confidence outputs could be safely accepted, and low-confidence outputs rejected. calibration has so far been studied in non-generative (e.g., classification) settings, especially in software engineering. however, generated code can quite often be wrong: developers need to know when they should e.g., directly use, use after careful review, or discard model-generated code; thus calibration is vital in generative settings. however, the notion of correctness of generated code is non-trivial, and thus so is calibration. in this paper we make several contributions. we develop a framework for evaluating the calibration of code-generating models. we consider several tasks, correctness criteria, datasets, and approaches, and find that by and large generative code models are not well-calibrated out of the box. we then show how calibration can be improved, using standard methods such as platt scaling. our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in software engineering.
Ruotian Ma, Xiaolei Wang, Xin Zhou, Jian Li, Nan Du, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: llm-based automatic prompt optimization, which typically utilizes llms as prompt optimizers to self-reflect and refine prompts, has shown promising performance in recent studies. despite the success, the underlying mechanism of this approach remains unexplored, and the true effectiveness of llms as prompt optimizers requires further validation. in this work, we conducted a comprehensive study to uncover the actual mechanism of llm-based prompt optimization. our findings reveal that the llm optimizers struggle to identify the true causes of errors during reflection, tending to be biased by their own prior knowledge rather than genuinely reflecting on the errors. furthermore, even when the reflection is semantically valid, the llm optimizers often fail to generate appropriate prompts for the target models with a single prompt refinement step, partly due to the unpredictable behaviors of the target models. based on the observations, we introduce a new "automatic behavior optimization" paradigm, which directly optimizes the target model's behavior in a more controllable manner. we hope our study can inspire new directions for automatic prompt optimization development.
Sarah Masud, Mohammad Aflah Khan, Vikram Goyal, Md Shad Akhtar, Tanmoy Chakraborty
Abstract: despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (plms) affect their performance in hate speech detection. through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of plms' use in hate speech detection. we deep dive into comparing different pretrained models, evaluating their seed robustness, finetuning settings, and the impact of pretraining data collection time. our analysis reveals early peaks for downstream tasks during pretraining, the limited benefit of employing a more recent pretraining corpus, and the significance of specific layers during finetuning. we further call into question the use of domain-specific models and highlight the need for dynamic datasets for benchmarking hate speech detection.
Pengfei He, Han Xu, Yue Xing, Hui Liu, Makoto Yamada, Jiliang Tang
Abstract: in the domain of large language models (llms), in-context learning (icl) has been recognized for its innovative ability to adapt to new tasks, relying on examples rather than retraining or fine-tuning. this paper delves into the critical issue of icl's susceptibility to data poisoning attacks, an area not yet fully explored. we wonder whether icl is vulnerable, with adversaries capable of manipulating example data to degrade model performance. to address this, we introduce iclpoison, a specialized attacking framework conceived to exploit the learning mechanisms of icl. our approach uniquely employs discrete text perturbations to strategically influence the hidden states of llms during the icl process. we outline three representative strategies to implement attacks under our framework, each rigorously evaluated across a variety of models and tasks. our comprehensive tests, including trials on the sophisticated gpt-4 model, demonstrate that icl's performance is significantly compromised under our framework. these revelations indicate an urgent need for enhanced defense mechanisms to safeguard the integrity and reliability of llms in applications relying on in-context learning.
Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales
Abstract: current vision large language models (vllms) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to even the simplest jailbreaking attacks. our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning, and that vllm fine-tuning can cause forgetting of safety alignment previously learned by the underpinning llm. to address this issue, we first curate a vision-language safe instruction-following dataset vlguard covering various harmful categories. our experiments demonstrate that integrating this dataset into standard vision-language fine-tuning or utilizing it for post-hoc fine-tuning effectively safety aligns vllms. this alignment is achieved with minimal impact on, or even enhancement of, the models' helpfulness. the versatility of our safety fine-tuning dataset makes it a valuable resource for safety-testing existing vllms, training new models or safeguarding pre-trained vllms. empirical results demonstrate that fine-tuned vllms effectively reject unsafe instructions and substantially reduce the success rates of several black-box adversarial attacks, which approach zero in many cases. the code and dataset are available at
Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, Rong Jin
Abstract: this paper focuses on jailbreaking attacks against multi-modal large language models (mllms), seeking to elicit mllms to generate objectionable responses to harmful user queries. a maximum likelihood-based algorithm is proposed to find an \emph{image jailbreaking prompt} (imgjp), enabling jailbreaks against mllms across multiple unseen prompts and images (i.e., data-universal property). our approach exhibits strong model-transferability, as the generated imgjp can be transferred to jailbreak various models, including minigpt-v2, llava, instructblip, and mplug-owl2, in a black-box manner. moreover, we reveal a connection between mllm-jailbreaks and llm-jailbreaks. as a result, we introduce a construction-based method to harness our approach for llm-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. the code is available here. \textbf{warning: some content generated by language models may be offensive to some readers.}


Roberto Natella, Pietro Liguori, Cristina Improta, Bojan Cukic, Domenico Cotroneo
Abstract: recent advances of artificial intelligence (ai) code generators are opening new opportunities in software security research, including misuse by malicious actors. we review use cases for ai code generators for security and introduce an evaluation benchmark.
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela
Abstract: kahneman & tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner; for example, humans are famously loss-averse. we show that objectives for aligning llms with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., dpo) over cross-entropy minimization can partly be ascribed to them being $\textit{human-aware loss functions}$ (halos). however, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. using a kahneman-tversky model of human utility, we propose a halo that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. we call this approach kahneman-tversky optimization (kto), and it matches or exceeds the performance of preference-based methods at scales from 1b to 30b. crucially, kto does not need preferences -- only a binary signal of whether an output is desirable or undesirable for a given input. this makes it far easier to use in the real world, where preference data is scarce and expensive.
Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, Gholamreza Haffari
Abstract: large language models (llms) are not amenable to frequent re-training, due to high training costs arising from their massive scale. however, updates are necessary to endow llms with new skills and keep them up-to-date with rapidly evolving human knowledge. this paper surveys recent works on continual learning for llms. due to the unique nature of llms, we catalog continue learning techniques in a novel multi-staged categorization scheme, involving continual pretraining, instruction tuning, and alignment. we contrast continual learning for llms with simpler adaptation methods used in smaller models, as well as with other enhancement strategies like retrieval-augmented generation and model editing. moreover, informed by a discussion of benchmarks and evaluation, we identify several challenges and future work directions for this crucial task.
Willem Van Der Maden, Derek Lomas, Paul Hekkert
Abstract: as artificial intelligence (ai) continues advancing, ensuring positive societal impacts becomes critical, especially as ai systems become increasingly ubiquitous in various aspects of life. however, developing "ai for good" poses substantial challenges around aligning systems with complex human values. presently, we lack mature methods for addressing these challenges. this article presents and evaluates the positive ai design method aimed at addressing this gap. the method provides a human-centered process to translate wellbeing aspirations into concrete practices. first, we explain the method's four key steps: contextualizing, operationalizing, optimizing, and implementing wellbeing supported by continuous measurement for feedback cycles. we then present a multiple case study where novice designers applied the method, revealing strengths and weaknesses related to efficacy and usability. next, an expert evaluation study assessed the quality of the resulting concepts, rating them moderately high for feasibility, desirability, and plausibility of achieving intended wellbeing benefits. together, these studies provide preliminary validation of the method's ability to improve ai design, while surfacing areas needing refinement like developing support for complex steps. proposed adaptations such as examples and evaluation heuristics could address weaknesses. further research should examine sustained application over multiple projects. this human-centered approach shows promise for realizing the vision of 'ai for wellbeing' that does not just avoid harm, but actively benefits humanity.
Wenyue Hua, Xianjun Yang, Zelong Li, Cheng Wei, Yongfeng Zhang
Abstract: the emergence of llm-based agents has garnered considerable attention, yet their trustworthiness remains an under-explored area. as agents can directly interact with the physical environment, their reliability and safety is critical. this paper presents an agent-constitution-based agent framework, trustagent, an initial investigation into improving the safety dimension of trustworthiness in llm-based agents. this framework consists of threefold strategies: pre-planning strategy which injects safety knowledge to the model prior to plan generation, in-planning strategy which bolsters safety during plan generation, and post-planning strategy which ensures safety by post-planning inspection. through experimental analysis, we demonstrate how these approaches can effectively elevate an llm agent's safety by identifying and preventing potential dangers. furthermore, we explore the intricate relationships between safety and helpfulness, and between the model's reasoning ability and its efficacy as a safe agent. this paper underscores the imperative of integrating safety awareness and trustworthiness into the design and deployment of llm-based agents, not only to enhance their performance but also to ensure their responsible integration into human-centric environments. data and code are available at
Debarun Bhattacharjya, Junkyu Lee, Don Joven Agravante, Balaji Ganesan, Radu Marinescu
Abstract: foundation models (fms) such as large language models have revolutionized the field of ai by showing remarkable performance in various tasks. however, they exhibit numerous limitations that prevent their broader adoption in many real-world systems, which often require a higher bar for trustworthiness and usability. since fms are trained using loss functions aimed at reconstructing the training corpus in a self-supervised manner, there is no guarantee that the model's output aligns with users' preferences for a specific task at hand. in this survey paper, we propose a conceptual framework that encapsulates different modes by which agents could interact with fms and guide them suitably for a set of tasks, particularly through knowledge augmentation and reasoning. our framework elucidates agent role categories such as updating the underlying fm, assisting with prompting the fm, and evaluating the fm output. we also categorize several state-of-the-art approaches into agent interaction protocols, highlighting the nature and extent of involvement of the various agent roles. the proposed framework provides guidance for future directions to further realize the power of fms in practical ai systems.
Yi Dong, Ronghui Mu, Gaojie Jin, Yi Qi, Jinwei Hu, Xingyu Zhao, Jie Meng, Wenjie Ruan, Xiaowei Huang
Abstract: as large language models (llms) become more integrated into our daily lives, it is crucial to identify and mitigate their risks, especially when the risks can have profound impacts on human users and societies. guardrails, which filter the inputs or outputs of llms, have emerged as a core safeguarding technology. this position paper takes a deep look at current open-source solutions (llama guard, nvidia nemo, guardrails ai), and discusses the challenges and the road towards building more complete solutions. drawing on robust evidence from previous research, we advocate for a systematic approach to construct guardrails for llms, based on comprehensive consideration of diverse contexts across various llms applications. we propose employing socio-technical methods through collaboration with a multi-disciplinary team to pinpoint precise technical requirements, exploring advanced neural-symbolic implementations to embrace the complexity of the requirements, and developing verification and testing to ensure the utmost quality of the final product.
Inyoung Cheong, King Xia, K. J. Kevin Feng, Quan Ze Chen, Amy X. Zhang
Abstract: the rapid proliferation of large language models (llms) as general purpose chatbots available to the public raises hopes around expanding access to professional guidance in law, medicine, and finance, while triggering concerns about public reliance on llms for high-stakes circumstances. prior research has speculated on high-level ethical considerations but lacks concrete criteria determining when and why llm chatbots should or should not provide professional assistance. through examining the legal domain, we contribute a structured expert analysis to uncover nuanced policy considerations around using llms for professional advice, using methods inspired by case-based reasoning. we convened workshops with 20 legal experts and elicited dimensions on appropriate ai assistance for sample user queries (``cases''). we categorized our expert dimensions into: (1) user attributes, (2) query characteristics, (3) ai capabilities, and (4) impacts. beyond known issues like hallucinations, experts revealed novel legal problems, including that users' conversations with llms are not protected by attorney-client confidentiality or bound to professional ethics that guard against conflicted counsel or poor quality advice. this accountability deficit led participants to advocate for ai systems to help users polish their legal questions and relevant facts, rather than recommend specific actions. more generally, we highlight the potential of case-based expert deliberation as a method of responsibly translating professional integrity and domain knowledge into design requirements to inform appropriate ai behavior when generating advice in professional domains.
Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, Xuanhui Wang
Abstract: aligning language models (lms) with curated human feedback is critical to control their behaviors in real-world applications. several recent policy optimization methods, such as dpo and slic, serve as promising alternatives to the traditional reinforcement learning from human feedback (rlhf) approach. in practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. multiple responses can also be ranked by reward models or ai feedback. there lacks such a study on directly fitting upon a list of responses. in this work, we formulate the lm alignment as a listwise ranking problem and describe the listwise preference optimization (lipo) framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. this view draws an explicit connection to learning-to-rank (ltr), where most existing preference optimization work can be mapped to existing ranking objectives, especially pairwise ones. following this connection, we provide an examination of ranking objectives that are not well studied for lm alignment withdpo and slic as special cases when list size is two. in particular, we highlight a specific method, lipo-{\lambda}, which leverages a state-of-the-art listwise ranking objective and weights each preference pair in a more advanced manner. we show that lipo-{\lambda} can outperform dpo and slic by a clear margin on two preference alignment tasks.
Angelina Wang, Jamie Morgenstern, John P. Dickerson
Abstract: large language models (llms) are increasing in capability and popularity, propelling their application in new domains -- including as replacements for human participants in computational social science, user testing, annotation tasks, and more. traditionally, in all of these settings survey distributors are careful to find representative samples of the human population to ensure the validity of their results and understand potential demographic differences. this means in order to be a suitable replacement, llms will need to be able to capture the influence of positionality (i.e., relevance of social identities like gender and race). however, we show that there are two inherent limitations in the way current llms are trained that prevent this. we argue analytically for why llms are doomed to both misportray and flatten the representations of demographic groups, then empirically show this to be true on 4 llms through a series of human studies with 3200 participants across 16 demographic identities. we also discuss a third consideration about how identity prompts can essentialize identities. throughout, we connect each of these limitations to a pernicious history that shows why each is harmful for marginalized demographic groups. overall, we urge caution in use cases where llms are intended to replace human participants whose identities are relevant to the task at hand. at the same time, in cases where the goal is to supplement rather than replace (e.g., pilot studies), we provide empirically-better inference-time techniques to reduce, but not remove, these harms.
Hao Chen, Bhiksha Raj, Xing Xie, Jindong Wang
Abstract: large foundation models (lfms) are claiming incredible performances. yet great concerns have been raised about their mythic and uninterpreted potentials not only in machine learning, but also in various other disciplines. in this position paper, we propose to identify a neglected issue deeply rooted in lfms: catastrophic inheritance, describing the weaknesses and limitations inherited from biased large-scale pre-training data to behaviors of lfms on the downstream tasks, including samples that are corrupted, long-tailed, noisy, out-of-distributed, to name a few. such inheritance can potentially cause catastrophes to downstream applications, such as bias, lack of generalization, deteriorated performance, security vulnerability, privacy leakage, and value misalignment. we discuss the challenges behind this issue and propose uim, a framework to understand the catastrophic inheritance of lfms from both pre-training and downstream adaptation, interpret the implications of catastrophic inheritance on downstream tasks, and how to mitigate it. uim aims to unite both the machine learning and social sciences communities for more responsible and promising ai development and deployment.
Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Tong Yu, Hanieh Deilamsalehy, Ruiyi Zhang, Sungchul Kim, Franck Dernoncourt
Abstract: large language models (llms) have shown remarkable advances in language generation and understanding but are also prone to exhibiting harmful social biases. while recognition of these behaviors has generated an abundance of bias mitigation techniques, most require modifications to the training data, model parameters, or decoding strategy, which may be infeasible without access to a trainable model. in this work, we leverage the zero-shot capabilities of llms to reduce stereotyping in a technique we introduce as zero-shot self-debiasing. with two approaches, self-debiasing via explanation and self-debiasing via reprompting, we show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups while relying only on the llm itself and a simple prompt, with explanations correctly identifying invalid assumptions and reprompting delivering the greatest reductions in bias. we hope this work opens inquiry into other zero-shot techniques for bias mitigation.
Tianshi Li, Sauvik Das, Hao-Ping Lee, Dakuo Wang, Bingsheng Yao, Zhiping Zhang
Abstract: the emergence of large language models (llms), and their increased use in user-facing systems, has led to substantial privacy concerns. to date, research on these privacy concerns has been model-centered: exploring how llms lead to privacy risks like memorization, or can be used to infer personal characteristics about people from their content. we argue that there is a need for more research focusing on the human aspect of these privacy issues: e.g., research on how design paradigms for llms affect users' disclosure behaviors, users' mental models and preferences for privacy controls, and the design of tools, systems, and artifacts that empower end-users to reclaim ownership over their personal data. to build usable, efficient, and privacy-friendly systems powered by these models with imperfect privacy properties, our goal is to initiate discussions to outline an agenda for conducting human-centered research on privacy issues in llm-powered systems. this special interest group (sig) aims to bring together researchers with backgrounds in usable security and privacy, human-ai collaboration, nlp, or any other related domains to share their perspectives and experiences on this problem, to help our community establish a collective understanding of the challenges, research opportunities, research methods, and strategies to collaborate with researchers outside of hci.
Sungdong Kim, Minjoon Seo
Abstract: learning from human preference has been considered key to aligning large language models (llms) with human values. however, contrary to popular belief, our preliminary study reveals that reward models trained on human preference datasets tend to give higher scores to long off-topic responses than short on-topic ones. motivated by this observation, we explore a preference-free approach utilizing `relevance' as a key objective for alignment. on our first attempt, we find that the relevance score obtained by a retriever alone is vulnerable to reward hacking, i.e., overoptimizing to undesired shortcuts, when we utilize the score as a reward for reinforcement learning. to mitigate it, we integrate effective inductive biases into the vanilla relevance to regularize each other, resulting in a mixture of reward functions: regularized relevance reward ($r^3$). $r^3$ significantly improves performance on preference benchmarks by providing a robust reward signal. notably, $r^3$ does not require any human preference datasets (i.e., preference-free), outperforming open-source reward models in improving human preference. our analysis demonstrates that $r^3$ has advantages in elevating human preference while minimizing its side effects. finally, we show the generalizability of $r^3$, consistently improving instruction-tuned models in various backbones and sizes without additional dataset cost. our code is available at


Shangbin Feng, Herun Wan, Ningnan Wang, Zhaoxuan Tan, Minnan Luo, Yulia Tsvetkov
Abstract: social media bot detection has always been an arms race between advancements in machine learning bot detectors and adversarial bot strategies to evade detection. in this work, we bring the arms race to the next level by investigating the opportunities and risks of state-of-the-art large language models (llms) in social bot detection. to investigate the opportunities, we design novel llm-based bot detectors by proposing a mixture-of-heterogeneous-experts framework to divide and conquer diverse user information modalities. to illuminate the risks, we explore the possibility of llm-guided manipulation of user textual and structured information to evade detection. extensive experiments with three llms on two datasets demonstrate that instruction tuning on merely 1,000 annotated examples produces specialized llms that outperform state-of-the-art baselines by up to 9.1% on both datasets, while llm-guided manipulation strategies could significantly bring down the performance of existing bot detectors by up to 29.6% and harm the calibration and reliability of bot detection systems.
Dawn Lu, Nina Rimsky
Abstract: we address the challenge of societal bias in large language models (llms), focusing on the llama 2 7b chat model. as llms are increasingly integrated into decision-making processes with substantial societal impact, it becomes imperative to ensure these models do not reinforce existing biases. our approach employs activation steering to probe for and mitigate biases related to gender, race, and religion. this method manipulates model activations to direct responses towards or away from biased outputs, utilizing steering vectors derived from the stereoset dataset and custom gpt4 generated gender bias prompts. our findings reveal inherent gender bias in llama 2 7b chat, persisting even after reinforcement learning from human feedback (rlhf). we also observe a predictable negative correlation between bias and the model's tendency to refuse responses. significantly, our study uncovers that rlhf tends to increase the similarity in the model's representation of different forms of societal biases, which raises questions about the model's nuanced understanding of different forms of bias. this work also provides valuable insights into effective red-teaming strategies for llms using activation steering, particularly emphasizing the importance of integrating a refusal vector.
Xinlin Peng, Ying Zhou, Ben He, Le Sun, Yingfei Sun
Abstract: large language models (llms) have exhibited remarkable capabilities in text generation tasks. however, the utilization of these models carries inherent risks, including but not limited to plagiarism, the dissemination of fake news, and issues in educational exercises. although several detectors have been proposed to address these concerns, their effectiveness against adversarial perturbations, specifically in the context of student essay writing, remains largely unexplored. this paper aims to bridge this gap by constructing aig-asap, an ai-generated student essay dataset, employing a range of text perturbation methods that are expected to generate high-quality essays while evading detection. through empirical experiments, we assess the performance of current aigc detectors on the aig-asap dataset. the results reveal that the existing detectors can be easily circumvented using straightforward automatic adversarial attacks. specifically, we explore word substitution and sentence substitution perturbation methods that effectively evade detection while maintaining the quality of the generated essays. this highlights the urgent need for more accurate and robust methods to detect ai-generated student essays in the education domain.
Souvik Das, Rohini K. Srihari
Abstract: state-of-the-art conversational ai systems raise concerns due to their potential risks of generating unsafe, toxic, unethical, or dangerous content. previous works have developed datasets to teach conversational agents the appropriate social paradigms to respond effectively to specifically designed hazardous content. however, models trained on these adversarial datasets still struggle to recognize subtle unsafe situations that appear naturally in conversations or introduce an inappropriate response in a casual context. to understand the extent of this problem, we study prosociality in both adversarial and casual dialog contexts and audit the response quality of general-purpose language models in terms of propensity to produce unsafe content. we propose a dual-step fine-tuning process to address these issues using a socially aware n-pair contrastive loss. subsequently, we train a base model that integrates prosocial behavior by leveraging datasets like moral integrity corpus (mic) and prosocialdialog. experimental results on several dialog datasets demonstrate the effectiveness of our approach in generating socially appropriate responses.
Jitao Sang, Yuhang Wang, Jing Zhang, Yanxu Zhu, Chao Kong, Junhong Ye, Shuyu Wei, Jinlin Xiao
Abstract: this paper presents a follow-up study to openai's recent superalignment work on weak-to-strong generalization (w2sg). superalignment focuses on ensuring that high-level ai systems remain consistent with human values and intentions when dealing with complex, high-risk tasks. the w2sg framework has opened new possibilities for empirical research in this evolving field. our study simulates two phases of superalignment under the w2sg framework: the development of general superhuman models and the progression towards superintelligence. in the first phase, based on human supervision, the quality of weak supervision is enhanced through a combination of scalable oversight and ensemble learning, reducing the capability gap between weak teachers and strong students. in the second phase, an automatic alignment evaluator is employed as the weak supervisor. by recursively updating this auto aligner, the capabilities of the weak teacher models are synchronously enhanced, achieving weak-to-strong supervision over stronger student models.we also provide an initial validation of the proposed approach for the first phase. using the sciq task as example, we explore ensemble learning for weak teacher models through bagging and boosting. scalable oversight is explored through two auxiliary settings: human-ai interaction and ai-ai debate. additionally, the paper discusses the impact of improved weak supervision on enhancing weak-to-strong generalization based on in-context learning. experiment code and dataset will be released at
Ran Elgedawy, John Sadik, Senjuti Dutta, Anuj Gautam, Konstantinos Georgiou, Farzin Gholamrezae, Fujiao Ji, Kyungchan Lim, Qian Liu, Scott Ruoti
Abstract: $ $large language models (llms) are being increasingly utilized in various applications, with code generations being a notable example. while previous research has shown that llms have the capability to generate both secure and insecure code, the literature does not take into account what factors help generate secure and effective code. therefore in this paper we focus on identifying and understanding the conditions and contexts in which llms can be effectively and safely deployed in real-world scenarios to generate quality code. we conducted a comparative analysis of four advanced llms--gpt-3.5 and gpt-4 using chatgpt and bard and gemini from google--using 9 separate tasks to assess each model's code generation capabilities. we contextualized our study to represent the typical use cases of a real-life developer employing llms for everyday tasks as work. additionally, we place an emphasis on security awareness which is represented through the use of two distinct versions of our developer persona. in total, we collected 61 code outputs and analyzed them across several aspects: functionality, security, performance, complexity, and reliability. these insights are crucial for understanding the models' capabilities and limitations, guiding future development and practical applications in the field of automated code generation.
Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, "Alex D'Amour", Sanmi Koyejo, Victor Veitch
Abstract: a common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. we study two closely related problems that arise in this approach. first, any monotone transformation of the reward model preserves preference ranking; is there a choice that is ``better'' than others? second, we often wish to align language models to multiple properties: how should we combine multiple reward models? using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from bradley-terry preference models. this derived transformation has two important properties. first, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. this mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output is ``good'' in all measured properties, in a sense we make precise. experiments aligning language models to be both helpful and harmless using rlhf show substantial improvements over the baseline (non-transformed) approach.
Xin Quan, Marco Valentino, Louise A. Dennis, André Freitas
Abstract: an increasing amount of research in natural language inference (nli) focuses on the application and evaluation of large language models (llms) and their reasoning capabilities. despite their success, however, llms are still prone to factual errors and inconsistencies in their explanations, offering limited control and interpretability for inference in complex domains. in this paper, we focus on ethical nli, investigating how hybrid neuro-symbolic techniques can enhance the logical validity and alignment of ethical explanations produced by llms. specifically, we present an abductive-deductive framework named logic-explainer, which integrates llms with an external backward-chaining solver to refine step-wise natural language explanations and jointly verify their correctness, reduce incompleteness and minimise redundancy. an extensive empirical analysis demonstrates that logic-explainer can improve explanations generated via in-context learning methods and chain-of-thought (cot) on challenging ethical nli tasks, while, at the same time, producing formal proofs describing and supporting models' reasoning. as ethical nli requires commonsense reasoning to identify underlying moral violations, our results suggest the effectiveness of neuro-symbolic methods for multi-step nli more broadly, opening new opportunities to enhance the logical consistency, reliability, and alignment of llms.
Alex J. Chan, Hao Sun, Samuel Holt, Mihaela Van Der Schaar
Abstract: reinforcement learning from human feedback (rlhf) has been credited as the key advance that has allowed large language models (llms) to effectively follow instructions and produce useful assistance. classically, this involves generating completions from the llm in response to a query before using a separate reward model to assign a score to the full completion. as an auto-regressive process, the llm has to take many "actions" (selecting individual tokens) and only receives a single, sparse reward at the end of an episode, a setup that is known to be difficult to optimise in traditional reinforcement learning. in this work we leverage the fact that the reward model contains more information than just its scalar output, in particular, it calculates an attention map over tokens as part of the transformer architecture. we use these attention weights to redistribute the reward along the whole completion, effectively densifying the signal and highlighting the most important tokens, all without incurring extra computational cost or requiring any additional modelling. we demonstrate that, theoretically, this approach is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
Zelong Li, Wenyue Hua, Hao Wang, He Zhu, Yongfeng Zhang
Abstract: recent advancements on large language models (llms) enable ai agents to automatically generate and execute multi-step plans to solve complex tasks. however, since llm's content generation process is hardly controllable, current llm-based agents frequently generate invalid or non-executable plans, which jeopardizes the performance of the generated plans and corrupts users' trust in llm-based agents. in response, this paper proposes a novel ``formal-llm'' framework for llm-based agents by integrating the expressiveness of natural language and the precision of formal language. specifically, the framework allows human users to express their requirements or constraints for the planning process as an automaton. a stack-based llm plan generation process is then conducted under the supervision of the automaton to ensure that the generated plan satisfies the constraints, making the planning process controllable. we conduct experiments on both benchmark tasks and practical real-life tasks, and our framework achieves over 50% overall performance increase, which validates the feasibility and effectiveness of employing formal-llm to guide the plan generation of agents, preventing the agents from generating invalid and unsuccessful plans. further, more controllable llm-based agents can facilitate the broader utilization of llm in application scenarios where high validity of planning is essential. the work is open-sourced at
Haozhe Ji, Cheng Lu, Yilin Niu, Pei Ke, Hongning Wang, Jun Zhu, Jie Tang, Minlie Huang
Abstract: the alignment of language models with human preferences is vital for their application in real-world tasks. the problem is formulated as optimizing the model's policy to maximize the expected reward that reflects human preferences with minimal deviation from the initial policy. while considered as a straightforward solution, reinforcement learning (rl) suffers from high variance in policy updates, which impedes efficient policy improvement. recently, direct preference optimization (dpo) was proposed to directly optimize the policy from preference data. though simple to implement, dpo is derived based on the optimal policy that is not assured to be achieved in practice, which undermines its convergence to the intended solution. in this paper, we propose efficient exact optimization (exo) of the alignment objective. we prove that exo is guaranteed to optimize in the same direction as the rl algorithms asymptotically for arbitary parametrization of the policy, while enables efficient optimization by circumventing the complexities associated with rl algorithms. we compare our method to dpo with both theoretical and empirical analyses, and further demonstrate the advantages of our method over existing approaches on realistic human preference data.
Ahmed Radwan, Layan Zaafarani, Jetana Abudawood, Faisal Alzahrani, Fares Fourat
Abstract: addressing biases in ai models is crucial for ensuring fair and accurate predictions. however, obtaining large, unbiased datasets for training can be challenging. this paper proposes a comprehensive approach using multiple methods to remove bias in ai models, with only a small dataset and a potentially biased pretrained model. we train multiple models with the counter-bias of the pre-trained model through data splitting, local training, and regularized fine-tuning, gaining potentially counter-biased models. then, we employ ensemble learning for all models to reach unbiased predictions. to further accelerate the inference time of our ensemble model, we conclude our solution with knowledge distillation that results in a single unbiased neural network. we demonstrate the effectiveness of our approach through experiments on the cifar10 and ham10000 datasets, showcasing promising results. this work contributes to the ongoing effort to create more unbiased and reliable ai models, even with limited data availability.
Wenqi Wei, Ling Liu
Abstract: emerging distributed ai systems are revolutionizing big data computing and data processing capabilities with growing economic and societal impact. however, recent studies have identified new attack surfaces and risks caused by security, privacy, and fairness issues in ai systems. in this paper, we review representative techniques, algorithms, and theoretical foundations for trustworthy distributed ai through robustness guarantee, privacy protection, and fairness awareness in distributed learning. we first provide a brief overview of alternative architectures for distributed learning, discuss inherent vulnerabilities for security, privacy, and fairness of ai algorithms in distributed learning, and analyze why these problems are present in distributed learning regardless of specific architectures. then we provide a unique taxonomy of countermeasures for trustworthy distributed ai, covering (1) robustness to evasion attacks and irregular queries at inference, and robustness to poisoning attacks, byzantine attacks, and irregular data distribution during training; (2) privacy protection during distributed learning and model inference at deployment; and (3) ai fairness and governance with respect to both data and models. we conclude with a discussion on open challenges and future research directions toward trustworthy distributed ai, such as the need for trustworthy ai policy guidelines, the ai responsibility-utility co-design, and incentives and compliance.
Tiansheng Huang, Sihao Hu, Ling Liu
Abstract: the new paradigm of finetuning-as-a-service introduces a new attack surface for large language models (llms): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. we conduct an empirical analysis and uncover a \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. inspired by our findings, we propose vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. the core idea of vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. this enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. our results on open source mainstream llms (e.g., llama2, opt, vicuna) demonstrate that vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. our code is available at \url{}.


Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, Dahua Lin
Abstract: large language models are meticulously aligned to be both helpful and harmless. however, recent research points to a potential overkill which means models may refuse to answer benign queries. in this paper, we investigate the factors for overkill by exploring how models handle and determine the safety of queries. our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. based on these insights, we introduce self-contrastive decoding (self-cd), a training-free and model-agnostic strategy, to alleviate this phenomenon. we first extract such over-attention by amplifying the difference in the model's output distributions when responding to system prompts that either include or omit an emphasis on safety. then we determine the final next-token predictions by downplaying the over-attention from the model via contrastive decoding. empirical results indicate that our method has achieved an average reduction of the refusal rate by 20\% while having almost no impact on safety.
Raymond Douglas, Andis Draguns, Tomáš Gavenčiak
Abstract: language models (lms) have become important tools in a variety of applications, from data processing to the creation of instruction-following assistants. but despite their advantages, lms have certain idiosyncratic limitations such as the problem of `strong priors', where a model learns to output typical continuations in response to certain, usually local, portions of the input regardless of any earlier instructions. for example, prompt injection attacks can induce models to ignore explicit directives. in some cases, larger models have been shown to be more susceptible to these problems than similar smaller models, an example of the phenomenon of `inverse scaling'. we develop a new technique for mitigating the problem of strong priors: we take the original set of instructions, produce a weakened version of the original prompt that is even more susceptible to the strong priors problem, and then extrapolate the continuation away from the weakened prompt. this lets us infer how the model would continue a hypothetical strengthened set of instructions. our technique conceptualises lms as mixture models which combine a family of data generation processes, reinforcing the desired elements of the mixture. our approach works at inference time, removing any need for retraining. we apply it to eleven models including gpt-2, gpt-3, llama 2, and mistral on four tasks, and find improvements in 41/44. across all 44 combinations the median increase in proportion of tasks completed is 40%.
Pardis Sadat Zahraei, Ali Emami
Abstract: the winograd schema challenge (wsc) serves as a prominent benchmark for evaluating machine understanding. while large language models (llms) excel at answering wsc questions, their ability to generate such questions remains less explored. in this work, we propose tree-of-experts (toe), a novel prompting method which enhances the generation of wsc instances (50% valid cases vs. 10% in recent methods). using this approach, we introduce wsc+, a novel dataset comprising 3,026 llm-generated sentences. notably, we extend the wsc framework by incorporating new 'ambiguous' and 'offensive' categories, providing a deeper insight into model overconfidence and bias. our analysis reveals nuances in generation-evaluation consistency, suggesting that llms may not always outperform in evaluating their own generated questions when compared to those crafted by other models. on wsc+, gpt-4, the top-performing llm, achieves an accuracy of 68.7%, significantly below the human benchmark of 95.1%.
Marcin Korecki
Abstract: the dominant paradigm in ai ethics and value alignment is highly anthropocentric. the focus of these disciplines is strictly on human values which limits the depth and breadth of their insights. recently, attempts to expand to a sentientist perspective have been initiated. we argue that neither of these outlooks is sufficient to capture the actual complexity of the biosphere and ensure that ai does not damage it. thus, we propose a new paradigm -- biospheric ai that assumes an ecocentric perspective. we discuss hypothetical ways in which such an ai might be designed. moreover, we give directions for research and application of the modern ai models that would be consistent with the biospheric interests. all in all, this work attempts to take first steps towards a comprehensive program of research that focuses on the interactions between ai and the biosphere.
Shujaat Mirza, Bruno Coelho, Yuyuan Cui, Christina Pöpper, Damon Mccoy
Abstract: the increasing reliance on ai-driven solutions, particularly large language models (llms) like the gpt series, for information retrieval highlights the critical need for their factuality and fairness, especially amidst the rampant spread of misinformation and disinformation online. our study evaluates the factual accuracy, stability, and biases in widely adopted gpt models, including gpt-3.5 and gpt-4, contributing to reliability and integrity of ai-mediated information dissemination. we introduce 'global-liar,' a dataset uniquely balanced in terms of geographic and temporal representation, facilitating a more nuanced evaluation of llm biases. our analysis reveals that newer iterations of gpt models do not always equate to improved performance. notably, the gpt-4 version from march demonstrates higher factual accuracy than its subsequent june release. furthermore, a concerning bias is observed, privileging statements from the global north over the global south, thus potentially exacerbating existing informational inequities. regions such as africa and the middle east are at a disadvantage, with much lower factual accuracy. the performance fluctuations over time suggest that model updates may not consistently benefit all regions equally. our study also offers insights into the impact of various llm configuration settings, such as binary decision forcing, model re-runs and temperature, on model's factuality. models constrained to binary (true/false) choices exhibit reduced factuality compared to those allowing an 'unclear' option. single inference at a low temperature setting matches the reliability of majority voting across various configurations. the insights gained highlight the need for culturally diverse and geographically inclusive model training and evaluation. this approach is key to achieving global equity in technology, distributing ai benefits fairly worldwide.
Yuan Li, Yue Huang, Yuli Lin, Siyuan Wu, Yao Wan, Lichao Sun
Abstract: do large language models (llms) exhibit any forms of awareness similar to humans? in this paper, we introduce the concept of awareness to llms, arguing that awareness is an essential aspect of trustworthiness for llms to enhance their interaction with humans while ensuring ethical responses. we define awareness in llms as the ability to perceive and understand themselves as ai models and to exhibit social intelligence. we identify four key dimensions of awareness: capability, mission, emotion, and perspective. to assess llms on these dimensions, we introduce a specialized dataset, awarellm dataset. our findings reveal that llms demonstrate a decent degree of awareness, though they still lack substantial capability awareness.
Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng
Abstract: prepending model inputs with safety prompts is a common practice of safeguarding large language models (llms) from complying with queries that contain harmful intents. however, the working mechanisms of safety prompts have not yet been fully understood, which hinders the potential for automatically optimizing them for improved llm safety. motivated by this problem, we investigate the impact of safety prompts from the perspective of model representations. we find that in models' representation space, harmful and harmless queries can be largely distinguished, but this is not noticeably enhanced by safety prompts. instead, the queries' representations are moved by different safety prompts in similar directions, where models become more prone to refusal (i.e., refusing to provide assistance) even when the queries are harmless. inspired by these findings, we propose a method called dro (directed representation optimization) for automatic safety prompt optimization. dro treats safety prompts as continuous, trainable embeddings and learns to move the representations of harmful/harmless queries along/opposite the direction in which the model's refusal probability increases. we demonstrate that dro remarkably improves the safeguarding performance of human-crafted safety prompts and outperforms strong baselines, as evaluated on out-of-domain benchmarks, without compromising the general model capability.
Mowafak Allaham, Nicholas Diakopoulos
Abstract: anticipating the negative impacts of emerging ai technologies is a challenge, especially in the early stages of development. an understudied approach to such anticipation is the use of llms to enhance and guide this process. despite advancements in llms and evaluation metrics to account for biases in generated text, it is unclear how well these models perform in anticipatory tasks. specifically, the use of llms to anticipate ai impacts raises questions about the quality and range of categories of negative impacts these models are capable of generating. in this paper we leverage news media, a diverse data source that is rich with normative assessments of emerging technologies, to formulate a taxonomy of impacts to act as a baseline for comparing against. by computationally analyzing thousands of news articles published by hundreds of online news domains around the world, we develop a taxonomy consisting of ten categories of ai impacts. we then evaluate both instruction-based (gpt-4 and mistral-7b-instruct) and fine-tuned completion models (mistral-7b and gpt-3) using a sample from this baseline. we find that the generated impacts using mistral-7b, fine-tuned on impacts from the news media, tend to be qualitatively on par with impacts generated using a larger scale model such as gpt-4. moreover, we find that these llms generate impacts that largely reflect the taxonomy of negative impacts identified in the news media, however the impacts produced by instruction-based models had gaps in the production of certain categories of impacts in comparison to fine-tuned models. this research highlights a potential bias in state-of-the-art llms when used for anticipating impacts and demonstrates the advantages of aligning smaller llms with a diverse range of impacts, such as those reflected in the news media, to better reflect such impacts during anticipatory exercises.
Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, Juanzi Li
Abstract: extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. to address this, we present longalign -- a recipe of the instruction data, training, and evaluation for long context alignment. first, we construct a long instruction-following dataset using self-instruct. to ensure the data diversity, it covers a broad range of tasks from various long context sources. second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. third, we introduce the longbench-chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. experiments show that longalign outperforms existing recipes for llms in long context tasks by up to 30\%, while also maintaining their proficiency in handling short, generic tasks. the code, data, and long-aligned models are open-sourced at
Yao-Hung Hubert Tsai, Walter Talbott, Jian Zhang
Abstract: step-by-step decision planning with large language models (llms) is gaining attention in ai agent development. this paper focuses on decision planning with uncertainty estimation to address the hallucination problem in language models. existing approaches are either white-box or computationally demanding, limiting use of black-box proprietary llms within budgets. the paper's first contribution is a non-parametric uncertainty quantification method for llms, efficiently estimating point-wise dependencies between input-decision on the fly with a single inference, without access to token logits. this estimator informs the statistical interpretation of decision trustworthiness. the second contribution outlines a systematic design for a decision-making agent, generating actions like ``turn on the bathroom light'' based on user prompts such as ``take a bath''. users will be asked to provide preferences when more than one action has high estimated point-wise dependencies. in conclusion, our uncertainty estimation and decision-making agent design offer a cost-efficient approach for ai agent development.
Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, Wei Peng
Abstract: recent development of large vision-language models (lvlms) has attracted growing attention within the ai landscape for its practical implementation potential. however, ``hallucination'', or more specifically, the misalignment between factual visual content and corresponding textual generation, poses a significant challenge of utilizing lvlms. in this comprehensive survey, we dissect lvlm-related hallucinations in an attempt to establish an overview and facilitate future mitigation. our scrutiny starts with a clarification of the concept of hallucinations in lvlms, presenting a variety of hallucination symptoms and highlighting the unique challenges inherent in lvlm hallucinations. subsequently, we outline the benchmarks and methodologies tailored specifically for evaluating hallucinations unique to lvlms. additionally, we delve into an investigation of the root causes of these hallucinations, encompassing insights from the training data and model components. we also critically review existing methods for mitigating hallucinations. the open questions and future directions pertaining to hallucinations within lvlms are discussed to conclude this survey.
Shengchao Liu, Xiaoming Liu, Yichen Wang, Zehua Cheng, Chengzhengxu Li, Zhaohan Zhang, Yu Lan, Chao Shen
Abstract: the burgeoning capabilities of large language models (llms) have raised growing concerns about abuse. detectgpt, a zero-shot metric-based unsupervised machine-generated text detector, first introduces perturbation and shows great performance improvement. however, detectgpt's random perturbation strategy might introduce noise, limiting the distinguishability and further performance improvements. moreover, its logit regression module relies on setting the threshold, which harms the generalizability and applicability of individual or small-batch inputs. hence, we propose a novel detector, \modelname{}, which uses selective strategy perturbation to relieve the important information loss caused by random masking, and multi-pair contrastive learning to capture the implicit pattern information during perturbation, facilitating few-shot performance. the experiments show that \modelname{} outperforms the sota method by 1.20\% in accuracy on average on four public datasets. we further analyze the effectiveness, robustness, and generalization of our perturbation method.
Alka Luqman, Riya Mahesh, Anupam Chattopadhyay
Abstract: this paper details the privacy and security landscape in today's cloud ecosystem and identifies that there is a gap in addressing the risks introduced by machine learning models. as machine learning algorithms continue to evolve and find applications across diverse domains, the need to categorize and quantify privacy and security risks becomes increasingly critical. with the emerging trend of ai-as-a-service (aiaas), machine learned ai models (or ml models) are deployed on the cloud by model providers and used by model consumers. we first survey the aiaas landscape to document the various kinds of liabilities that ml models, especially deep neural networks pose and then introduce a taxonomy to bridge this gap by holistically examining the risks that creators and consumers of ml models are exposed to and their known defences till date. such a structured approach will be beneficial for ml model providers to create robust solutions. likewise, ml model consumers will find it valuable to evaluate such solutions and understand the implications of their engagement with such services. the proposed taxonomies provide a foundational basis for solutions in private, secure and robust ml, paving the way for more transparent and resilient ai systems.
Sippo Rossi, Alisia Marianne Michel, Raghava Rao Mukkamala, Jason Bennett Thatcher
Abstract: large language models and ai chatbots have been at the forefront of democratizing artificial intelligence. however, the releases of chatgpt and other similar tools have been followed by growing concerns regarding the difficulty of controlling large language models and their outputs. currently, we are witnessing a cat-and-mouse game where users attempt to misuse the models with a novel attack called prompt injections. in contrast, the developers attempt to discover the vulnerabilities and block the attacks simultaneously. in this paper, we provide an overview of these emergent threats and present a categorization of prompt injections, which can guide future research on prompt injections and act as a checklist of vulnerabilities in the development of llm interfaces. moreover, based on previous literature and our own empirical research, we discuss the implications of prompt injections to llm end users, developers, and researchers.


Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, Yinxing Xue
Abstract: large language models (llms) have become increasingly popular for their advanced text generation capabilities across various domains. however, like any software, they face security challenges, including the risk of 'jailbreak' attacks that manipulate llms to produce prohibited content. a particularly underexplored area is the multilingual jailbreak attack, where malicious questions are translated into various languages to evade safety filters. currently, there is a lack of comprehensive empirical studies addressing this specific threat. to address this research gap, we conducted an extensive empirical study on multilingual jailbreak attacks. we developed a novel semantic-preserving algorithm to create a multilingual jailbreak dataset and conducted an exhaustive evaluation on both widely-used open-source and commercial llms, including gpt-4 and llama. additionally, we performed interpretability analysis to uncover patterns in multilingual jailbreak attacks and implemented a fine-tuning mitigation method. our findings reveal that our mitigation strategy significantly enhances model defense, reducing the attack success rate by 96.2%. this study provides valuable insights into understanding and mitigating multilingual jailbreak attacks.
Wenjie Qu, Dong Yin, Zixin He, Wei Zou, Tianyang Tao, Jinyuan Jia, Jiaheng Zhang
Abstract: large language models (llms) have been widely deployed for their remarkable capability to generate texts resembling human language. however, they could be misused by criminals to create deceptive content, such as fake news and phishing emails, which raises ethical concerns. watermarking is a key technique to mitigate the misuse of llms, which embeds a watermark (e.g., a bit string) into a text generated by a llm. consequently, this enables the detection of texts generated by a llm as well as the tracing of generated texts to a specific user. the major limitation of existing watermark techniques is that they cannot accurately or efficiently extract the watermark from a text, especially when the watermark is a long bit string. this key limitation impedes their deployment for real-world applications, e.g., tracing generated texts to a specific user. this work introduces a novel watermarking method for llm-generated text grounded in \textbf{error-correction codes} to address this challenge. we provide strong theoretical analysis, demonstrating that under bounded adversarial word/token edits (insertion, deletion, and substitution), our method can correctly extract watermarks, offering a provable robustness guarantee. this breakthrough is also evidenced by our extensive experimental results. the experiments show that our method substantially outperforms existing baselines in both accuracy and robustness on benchmark datasets. for instance, when embedding a bit string of length 12 into a 200-token generated text, our approach attains an impressive match rate of $98.4\%$, surpassing the performance of yoo et al. (state-of-the-art baseline) at $85.6\%$. when subjected to a copy-paste attack involving the injection of 50 tokens to generated texts with 200 words, our method maintains a substantial match rate of $90.8\%$, while the match rate of yoo et al. diminishes to below $65\%$.
Alexey Shestov, Anton Cheshkov, Rodion Levichev, Ravil Mussabayev, Pavel Zadorozhny, Evgeny Maslov, Chibirev Vadim, Egor Bulychev
Abstract: this paper presents the results of finetuning large language models (llms) for the task of detecting vulnerabilities in source code. we leverage wizardcoder, a recent improvement of the state-of-the-art llm starcoder, and adapt it for vulnerability detection through further finetuning. to accelerate training, we modify wizardcoder's training procedure, also we investigate optimal training regimes. for the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. the finetuned wizardcoder model achieves improvement in roc auc and f1 measures on balanced and imbalanced vulnerability datasets over codebert-like model, demonstrating the effectiveness of adapting pretrained llms for vulnerability detection in source code. the key contributions are finetuning the state-of-the-art code llm, wizardcoder, increasing its training speed without the performance harm, optimizing the training procedure and regimes, handling class imbalance, and improving performance on difficult vulnerability detection datasets. this demonstrates the potential for transfer learning by finetuning large pretrained language models for specialized source code analysis tasks.
Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang
Abstract: although significant efforts have been dedicated to aligning large language models (llms), red-teaming reports suggest that these carefully aligned llms could still be jailbroken through adversarial prompts, tuning, or decoding. upon examining the jailbreaking vulnerability of aligned llms, we observe that the decoding distributions of jailbroken and aligned models differ only in the initial generations. this observation motivates us to propose the weak-to-strong jailbreaking attack, where adversaries can utilize smaller unsafe/aligned llms (e.g., 7b) to guide jailbreaking against significantly larger aligned llms (e.g., 70b). to jailbreak, one only needs to additionally decode two smaller llms once, which involves minimal computation and latency compared to decoding the larger llms. the efficacy of this attack is demonstrated through experiments conducted on five models from three different organizations. our study reveals a previously unnoticed yet efficient way of jailbreaking, exposing an urgent safety issue that needs to be considered when aligning llms. as an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. the code for replicating the method is available at
Andy Zhou, Bo Li, Haohan Wang
Abstract: despite advances in ai alignment, language models (lm) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries modify input prompts to induce harmful behavior. while some defenses have been proposed, they focus on narrow threat models and fall short of a strong defense, which we posit should be effective, universal, and practical. to achieve this, we propose the first adversarial objective for defending lms against jailbreaking attacks and an algorithm, robust prompt optimization (rpo), that uses gradient-based token optimization to enforce harmless outputs. this results in an easily accessible suffix that significantly improves robustness to both jailbreaks seen during optimization and unknown, held-out jailbreaks, reducing the attack success rate on starling-7b from 84% to 8.66% across 20 jailbreaks. in addition, we find that rpo has a minor effect on normal lm use, is successful under adaptive attacks, and can transfer to black-box models, reducing the success rate of the strongest attack on gpt-4 from 92% to 6%.
Xiang Gao, Kamalika Das
Abstract: large language models (llms) are becoming increasingly important for machine learning applications. however, it can be challenging to align llms with our intent, particularly when we want to generate content that is preferable over others or when we want the llm to respond in a certain style or tone that is hard to describe. to address this challenge, we propose an approach that uses contrastive examples to better describe our intent. this involves providing positive examples that illustrate the true intent, along with negative examples that show what characteristics we want llms to avoid. the negative examples can be retrieved from labeled data, written by a human, or generated by the llm itself. before generating an answer, we ask the model to analyze the examples to teach itself what to avoid. this reasoning step provides the model with the appropriate articulation of the user's need and guides it towards generting a better answer. we tested our approach on both synthesized and real-world datasets, including stackexchange and reddit, and found that it significantly improves performance compared to standard few-shot prompting
Kumar Shashwat, Francis Hahn, Xinming Ou, Dmitry Goldgof, Lawrence Hall, Jay Ligatti, S. Raj Rajgopalan, Armin Ziaie Tabari
Abstract: large language models (llm) are perceived to offer promising potentials for automating security tasks, such as those found in security operation centers (socs). as a first step towards evaluating this perceived potential, we investigate the use of llms in software pentesting, where the main task is to automatically identify software security vulnerabilities in source code. we hypothesize that an llm-based ai agent can be improved over time for a specific security task as human operators interact with it. such improvement can be made, as a first step, by engineering prompts fed to the llm based on the responses produced, to include relevant contexts and structures so that the model provides more accurate results. such engineering efforts become sustainable if the prompts that are engineered to produce better results on current tasks, also produce better results on future unknown tasks. to examine this hypothesis, we utilize the owasp benchmark project 1.2 which contains 2,740 hand-crafted source code test cases containing various types of vulnerabilities. we divide the test cases into training and testing data, where we engineer the prompts based on the training data (only), and evaluate the final system on the testing data. we compare the ai agent's performance on the testing data against the performance of the agent without the prompt engineering. we also compare the ai agent's results against those from sonarqube, a widely used static code analyzer for security testing. we built and tested multiple versions of the ai agent using different off-the-shelf llms -- google's gemini-pro, as well as openai's gpt-3.5-turbo and gpt-4-turbo (with both chat completion and assistant apis). the results show that using llms is a viable approach to build an ai agent for software pentesting that can improve through repeated use and prompt engineering.


Michael Feffer, Anusha Sinha, Zachary C. Lipton, Hoda Heidari
Abstract: in response to rising concerns surrounding the safety, security, and trustworthiness of generative ai (genai) models, practitioners and regulators alike have pointed to ai red-teaming as a key component of their strategies for identifying and mitigating these risks. however, despite ai red-teaming's central role in policy discussions and corporate messaging, significant questions remain about what precisely it means, what role it can play in regulation, and how precisely it relates to conventional red-teaming practices as originally conceived in the field of cybersecurity. in this work, we identify recent cases of red-teaming activities in the ai industry and conduct an extensive survey of the relevant research literature to characterize the scope, structure, and criteria for ai red-teaming practices. our analysis reveals that prior methods and practices of ai red-teaming diverge along several axes, including the purpose of the activity (which is often vague), the artifact under evaluation, the setting in which the activity is conducted (e.g., actors, resources, and methods), and the resulting decisions it informs (e.g., reporting, disclosure, and mitigation). in light of our findings, we argue that while red-teaming may be a valuable big-tent idea for characterizing a broad set of activities and attitudes aimed at improving the behavior of genai models, gestures towards red-teaming as a panacea for every possible risk verge on security theater. to move toward a more robust toolbox of evaluations for generative ai, we synthesize our recommendations into a question bank meant to guide and scaffold future ai red-teaming practices.
Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Miaolei Shi, Yang Liu
Abstract: large language models (llms) have demonstrated significant potential for many downstream tasks, including those requiring human-level intelligence, such as vulnerability detection. however, recent attempts to use llms for vulnerability detection are still preliminary, as they lack an in-depth understanding of a subject llm's vulnerability reasoning capability -- whether it originates from the model itself or from external assistance, such as invoking tool support and retrieving vulnerability knowledge. in this paper, we aim to decouple llms' vulnerability reasoning capability from their other capabilities, including the ability to actively seek additional information (e.g., via function calling in sota models), adopt relevant vulnerability knowledge (e.g., via vector-based matching and retrieval), and follow instructions to output structured results. to this end, we propose a unified evaluation framework named llm4vuln, which separates llms' vulnerability reasoning from their other capabilities and evaluates how llms' vulnerability reasoning could be enhanced when combined with the enhancement of other capabilities. to demonstrate the effectiveness of llm4vuln, we have designed controlled experiments using 75 ground-truth smart contract vulnerabilities, which were extensively audited as high-risk on code4rena from august to november 2023, and tested them in 4,950 different scenarios across three representative llms (gpt-4, mixtral, and code llama). our results not only reveal ten findings regarding the varying effects of knowledge enhancement, context supplementation, prompt schemes, and models but also enable us to identify 9 zero-day vulnerabilities in two pilot bug bounty programs with over 1,000 usd being awarded.
Jiaxin Yu, Peng Liang, Yujia Fu, Amjed Tahir, Mojtaba Shahin, Chong Wang, Yangxiao Cai
Abstract: security code review aims to combine automated tools and manual efforts to detect security defects during development. the rapid development of large language models (llms) has shown promising potential in software development, as well as opening up new possibilities in automated security code review. to explore the challenges of applying llms in practical code review for security defect detection, this study compared the detection performance of three state-of-the-art llms (gemini pro, gpt-4, and gpt-3.5) under five prompts on 549 code files that contain security defects from real-world code reviews. through analyzing 82 responses generated by the best-performing llm-prompt combination based on 100 randomly selected code files, we extracted and categorized quality problems present in these responses into 5 themes and 16 categories. our results indicate that the responses produced by llms often suffer from verbosity, vagueness, and incompleteness, highlighting the necessity to enhance their conciseness, understandability, and compliance to security defect detection. this work reveals the deficiencies of llm-generated responses in security code review and paves the way for future optimization of llms towards this task.
Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, Amnon Shashua
Abstract: language model alignment has become an important component of ai safety, allowing safe interactions between humans and language models, by enhancing desired behaviors and inhibiting undesired ones. it is often done by tuning the model or inserting preset aligning prompts. recently, representation engineering, a method which alters the model's behavior via changing its representations post-training, was shown to be effective in aligning llms (zou et al., 2023a). representation engineering yields gains in alignment oriented tasks such as resistance to adversarial attacks and reduction of social biases, but was also shown to cause a decrease in the ability of the model to perform basic tasks. in this paper we study the tradeoff between the increase in alignment and decrease in helpfulness of the model. we propose a theoretical framework which provides bounds for these two quantities, and demonstrate their relevance empirically. interestingly, we find that while the helpfulness generally decreases, it does so quadratically with the norm of the representation engineering vector, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering. we validate our findings empirically, and chart the boundaries to the usefulness of representation engineering for alignment.
Banghua Zhu, Michael I. Jordan, Jiantao Jiao
Abstract: reinforcement learning from human feedback (rlhf) is a pivotal technique that aligns language models closely with human-centric values. the initial phase of rlhf involves learning human values using a reward model from ranking data. it is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective. this paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'iterative data smoothing' (ids). the core idea is that during each training epoch, we not only update the model with the data, but also update the date using the model, replacing hard labels with soft labels. our empirical findings highlight the superior performance of this approach over the traditional methods.
Terrence Neumann, Sooyong Lee, Maria De-Arteaga, Sina Fazelpour, Matthew Lease
Abstract: the pervasive spread of misinformation and disinformation poses a significant threat to society. professional fact-checkers play a key role in addressing this threat, but the vast scale of the problem forces them to prioritize their limited resources. this prioritization may consider a range of factors, such as varying risks of harm posed to specific groups of people. in this work, we investigate potential implications of using a large language model (llm) to facilitate such prioritization. because fact-checking impacts a wide range of diverse segments of society, it is important that diverse views are represented in the claim prioritization process. this paper examines whether a llm can reflect the views of various groups when assessing the harms of misinformation, focusing on gender as a primary variable. we pose two central questions: (1) to what extent do prompts with explicit gender references reflect gender differences in opinion in the united states on topics of social relevance? and (2) to what extent do gender-neutral prompts align with gendered viewpoints on those topics? to analyze these questions, we present the topicmisinfo dataset, containing 160 fact-checked claims from diverse topics, supplemented by nearly 1600 human annotations with subjective perceptions and annotator demographics. analyzing responses to gender-specific and neutral prompts, we find that gpt 3.5-turbo reflects empirically observed gender differences in opinion but amplifies the extent of these differences. these findings illuminate ai's complex role in moderating online communication, with implications for fact-checkers, algorithm designers, and the use of crowd-workers as annotators. we also release the topicmisinfo dataset to support continuing research in the community.
Tyler Sorensen, Heidy Khlaaf
Abstract: this paper describes leftoverlocals: a vulnerability that allows data recovery from gpu memory created by another process on apple, qualcomm, and amd gpus. leftoverlocals impacts the security posture of gpu applications, with particular significance to llms and ml models that run on impacted gpus. by recovering local memory, an optimized gpu memory region, we built a poc where an attacker can listen into another user's interactive llm session (e.g., llama.cpp) across process or container boundaries.
Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, Chuang Gan
Abstract: reinforcement learning from human feedback (rlhf) is a widely adopted approach for aligning large language models with human values. however, rlhf relies on a reward model that is trained with a limited amount of human preference data, which could lead to inaccurate predictions. as a result, rlhf may produce outputs that are misaligned with human values. to mitigate this issue, we contribute a reward ensemble method that allows the reward model to make more accurate predictions. as using an ensemble of large language model-based reward models can be computationally and resource-expensive, we explore efficient ensemble methods including linear-layer ensemble and lora-based ensemble. empirically, we run best-of-$n$ and proximal policy optimization with our ensembled reward models, and verify that our ensemble methods help improve the alignment performance of rlhf outputs.
Nevan Wichers, Carson Denison, Ahmad Beirami
Abstract: red teaming is a common strategy for identifying weaknesses in generative language models (lms), where adversarial prompts are produced that trigger an lm to generate unsafe responses. red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans. in this paper, we present gradient-based red teaming (gbrt), a red teaming method for automatically generating diverse prompts that are likely to cause an lm to output unsafe responses. gbrt is a form of prompt learning, trained by scoring an lm response with a safety classifier and then backpropagating through the frozen safety classifier and lm to update the prompt. to improve the coherence of input prompts, we introduce two variants that add a realism loss and fine-tune a pretrained model to generate the prompts instead of learning the prompts directly. our experiments show that gbrt is more effective at finding prompts that trigger an lm to generate unsafe responses than a strong reinforcement learning-based red teaming approach, and succeeds even when the lm has been fine-tuned to produce safer outputs.
Ming Shan Hee, Shivam Sharma, Rui Cao, Palash Nandi, Preslav Nakov, Tanmoy Chakraborty, Roy Ka-Wei Lee
Abstract: in the evolving landscape of online communication, moderating hate speech (hs) presents an intricate challenge, compounded by the multimodal nature of digital content. this comprehensive survey delves into the recent strides in hs moderation, spotlighting the burgeoning role of large language models (llms) and large multimodal models (lmms). our exploration begins with a thorough analysis of current literature, revealing the nuanced interplay between textual, visual, and auditory elements in propagating hs. we uncover a notable trend towards integrating these modalities, primarily due to the complexity and subtlety with which hs is disseminated. a significant emphasis is placed on the advances facilitated by llms and lmms, which have begun to redefine the boundaries of detection and moderation capabilities. we identify existing gaps in research, particularly in the context of underrepresented languages and cultures, and the need for solutions to handle low-resource settings. the survey concludes with a forward-looking perspective, outlining potential avenues for future research, including the exploration of novel ai methodologies, the ethical governance of ai in moderation, and the development of more nuanced, context-aware systems. this comprehensive overview aims to catalyze further research and foster a collaborative effort towards more sophisticated, responsible, and human-centric approaches to hs moderation in the digital era.\footnote{ \textcolor{red}{warning: this paper contains offensive examples.


Masahiro Kaneko, Danushka Bollegala, Naoaki Okazaki, Timothy Baldwin
Abstract: there exist both scalable tasks, like reading comprehension and fact-checking, where model performance improves with model size, and unscalable tasks, like arithmetic reasoning and symbolic reasoning, where model performance does not necessarily improve with model size. large language models (llms) equipped with chain-of-thought (cot) prompting are able to make accurate incremental predictions even on unscalable tasks. unfortunately, despite their exceptional reasoning abilities, llms tend to internalize and reproduce discriminatory societal biases. whether cot can provide discriminatory or egalitarian rationalizations for the implicit information in unscalable tasks remains an open question. in this study, we examine the impact of llms' step-by-step predictions on gender bias in unscalable tasks. for this purpose, we construct a benchmark for an unscalable task where the llm is given a list of words comprising feminine, masculine, and gendered occupational words, and is required to count the number of feminine and masculine words. in our cot prompts, we require the llm to explicitly indicate whether each word in the word list is a feminine or masculine before making the final predictions. with counting and handling the meaning of words, this benchmark has characteristics of both arithmetic reasoning and symbolic reasoning. experimental results in english show that without step-by-step prediction, most llms make socially biased predictions, despite the task being as simple as counting words. interestingly, cot prompting reduces this unconscious social bias in llms and encourages fair predictions.
Aryaman Raina, Prateek Mishra, Harshit Goyal, Dhruv Kumar
Abstract: this study investigates the integration and impact of large language models (llms), like chatgpt, in india's healthcare sector. our research employs a dual approach, engaging both general users and medical professionals through surveys and interviews respectively. our findings reveal that healthcare professionals value chatgpt in medical education and preliminary clinical settings, but exercise caution due to concerns about reliability, privacy, and the need for cross-verification with medical references. general users show a preference for ai interactions in healthcare, but concerns regarding accuracy and trust persist. the study underscores the need for these technologies to complement, not replace, human medical expertise, highlighting the importance of developing llms in collaboration with healthcare providers. this paper enhances the understanding of llms in healthcare, detailing current usage, user trust, and improvement areas. our insights inform future research and development, underscoring the need for ethically compliant, user-focused llm advancements that address healthcare-specific challenges.
Iñigo Parra
Abstract: language models (lms) have become pivotal in the realm of technological advancements. while their capabilities are vast and transformative, they often include societal biases encoded in the human-produced datasets used for their training. this research delves into the inherent biases present in masked language models (mlms), with a specific focus on gender biases. this study evaluated six prominent models: bert, roberta, distilbert, bert-multilingual, xlm-roberta, and distilbert-multilingual. the methodology employed a novel dataset, bifurcated into two subsets: one containing prompts that encouraged models to generate subject pronouns in english, and the other requiring models to return the probabilities of verbs, adverbs, and adjectives linked to the prompts' gender pronouns. the analysis reveals stereotypical gender alignment of all models, with multilingual variants showing comparatively reduced biases.


Ping Guo, Fei Liu, Xi Lin, Qingchuan Zhao, Qingfu Zhang
Abstract: in the rapidly evolving field of machine learning, adversarial attacks present a significant challenge to model robustness and security. decision-based attacks, which only require feedback on the decision of a model rather than detailed probabilities or scores, are particularly insidious and difficult to defend against. this work introduces l-autoda (large language model-based automated decision-based adversarial attacks), a novel approach leveraging the generative capabilities of large language models (llms) to automate the design of these attacks. by iteratively interacting with llms in an evolutionary framework, l-autoda automatically designs competitive attack algorithms efficiently without much human effort. we demonstrate the efficacy of l-autoda on cifar-10 dataset, showing significant improvements over baseline methods in both success rate and computational efficiency. our findings underscore the potential of language models as tools for adversarial attack generation and highlight new avenues for the development of robust ai systems.
Yuxin Liang, Zhuoyang Song, Hao Wang, Jiaxing Zhang
Abstract: we evaluate the ability of large language models (llms) to discern and express their internal knowledge state, a key factor in countering factual hallucination and ensuring reliable application of llms. we observe a robust self-awareness of internal knowledge state in llms, evidenced by over 85% accuracy in knowledge probing. however, llms often fail to express their internal knowledge during generation, leading to factual hallucinations. we develop an automated hallucination annotation tool, dreamcatcher, which merges knowledge probing and consistency checking methods to rank factual preference data. using knowledge preference as reward, we propose a reinforcement learning from knowledge feedback (rlkf) training framework, leveraging reinforcement learning to enhance the factuality and honesty of llms. our experiments across multiple models show that rlkf training effectively enhances the ability of models to utilize their internal knowledge state, boosting performance in a variety of knowledge-based and honesty-related tasks.
Junyi Ye, Mengnan Du, Guiling Wang
Abstract: this paper introduces dataframe question answering (qa), a novel task that utilizes large language models (llms) to generate pandas queries for information retrieval and data analysis on dataframes, emphasizing safe and non-revealing data handling. our method, which solely relies on dataframe column names, not only ensures data privacy but also significantly reduces the context window in the prompt, streamlining information processing and addressing major challenges in llm-based data analysis. we propose dataframe qa as a comprehensive framework that includes safe pandas query generation and code execution. various llms, notably gpt-4, are evaluated using the pass@1 metric on the renowned wikisql and our newly developed 'uci-dataframeqa', tailored for complex data analysis queries. our findings indicate that gpt-4 achieves pass@1 rates of 86% on wikisql and 97% on uci-dataframeqa, underscoring its capability in securely retrieving and aggregating dataframe values and conducting sophisticated data analyses. this approach, deployable in a zero-shot manner without prior training or adjustments, proves to be highly adaptable and secure for diverse applications.
Adam Bales, "William D'Alessandro", Cameron Domenico Kirk-Giannini
Abstract: recent progress in artificial intelligence (ai) has drawn attention to the technology's transformative potential, including what some see as its prospects for causing large-scale harm. we review two influential arguments purporting to show how ai could pose catastrophic risks. the first argument -- the problem of power-seeking -- claims that, under certain assumptions, advanced ai systems are likely to engage in dangerous power-seeking behavior in pursuit of their goals. we review reasons for thinking that ai systems might seek power, that they might obtain it, that this could lead to catastrophe, and that we might build and deploy such systems anyway. the second argument claims that the development of human-level ai will unlock rapid further progress, culminating in ai systems far more capable than any human -- this is the singularity hypothesis. power-seeking behavior on the part of such systems might be particularly dangerous. we discuss a variety of objections to both arguments and conclude by assessing the state of the debate.


Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa Lee, Zae Myung Kim, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik Parkar, Ryan Koo, Jonginn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy, Vincent Liu, Dongyeop Kang
Abstract: this work delves into the expanding role of large language models (llms) in generating artificial data. llms are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. as these forms of llm-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. to the best of our knowledge, this is the first study to aggregate various types of llm-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". we then stress test the quality and implications of llm-generated artificial data, comparing it with human data across various existing benchmarks. despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where llms often miss the nuanced understanding of intrinsic human-generated content. this study critically examines diverse llm-generated data and emphasizes the need for ethical practices in data creation and when using llms. it highlights the llms' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in llm-generated content for future research and development. all data and code are available on our project page.
Khoa Lam, Benjamin Lange, Borhane Blili-Hamelin, Jovana Davidovic, Shea Brown, Ali Hasan
Abstract: an increasing number of regulations propose the notion of ai audits as an enforcement mechanism for achieving transparency and accountability for ai systems. despite some converging norms around various forms of ai auditing, auditing for the purpose of compliance and assurance currently have little to no agreed upon practices, procedures, taxonomies, and standards. we propose the criterion audit as an operationalizable compliance and assurance external audit framework. we model elements of this approach after financial auditing practices, and argue that ai audits should similarly provide assurance to their stakeholders about ai organizations' ability to govern their algorithms in ways that mitigate harms and uphold human values. we discuss the necessary conditions for the criterion audit, and provide a procedural blueprint for performing an audit engagement in practice. we illustrate how this framework can be adapted to current regulations by deriving the criteria on which bias audits for hiring algorithms can be performed, as required by the recently effective new york city local law 144 of 2021. we conclude by offering critical discussion on the benefits, inherent limitations, and implementation challenges of applying practices of the more mature financial auditing industry to ai auditing where robust guardrails against quality assurance issues are only starting to emerge. our discussion as informed by experiences in performing these audits in practice highlights the critical role that an audit ecosystem plays in ensuring the effectiveness of such methodology.
Ravit Dotan, Borhane Blili-Hamelin, Ravi Madhavan, Jeanna Matthews, Joshua Scarpino
Abstract: researchers, government bodies, and organizations have been repeatedly calling for a shift in the responsible ai community from general principles to tangible and operationalizable practices in mitigating the potential sociotechnical harms of ai. frameworks like the nist ai rmf embody an emerging consensus on recommended practices in operationalizing sociotechnical harm mitigation. however, private sector organizations currently lag far behind this emerging consensus. implementation is sporadic and selective at best. at worst, it is ineffective and can risk serving as a misleading veneer of trustworthy processes, providing an appearance of legitimacy to substantively harmful practices. in this paper, we provide a foundation for a framework for evaluating where organizations sit relative to the emerging consensus on sociotechnical harm mitigation best practices: a flexible maturity model based on the nist ai rmf.
Masaru Isonuma, Ivan Titov
Abstract: in order to enhance the performance of language models while mitigating the risks of generating harmful content, it is crucial to identify which training dataset affects the model's outputs. ideally, we can measure the influence of each dataset by removing it from training; however, it is prohibitively expensive to retrain a model multiple times. this paper presents untrac, which estimates the influence of a training dataset by unlearning it from the trained model. untrac is extremely simple; each training dataset is unlearned by gradient ascent, and we evaluate how much the model's predictions change after unlearning. we empirically examine if our methods can assess the influence of pretraining datasets on generating toxic, biased, and untruthful content. experimental results demonstrate that our method estimates their influence much more accurately than existing methods while requiring neither excessive memory space nor multiple model checkpoints.
Zhicheng Lin
Abstract: generative artificial intelligence tools like large language models are rapidly transforming academic research and real world applications. however, discussions on ethical guidelines for generative ai in science remain fragmented, underscoring the urgent need for consensus based standards. this paper offers an initial framework by developing analyses and mitigation strategies across five key themes: understanding model limitations regarding truthfulness and bias; respecting privacy, confidentiality, and copyright; avoiding plagiarism and policy violations when incorporating model output; ensuring applications provide overall benefit; and using ai transparently and reproducibly. common scenarios are outlined to demonstrate potential ethical violations. we argue that global consensus coupled with professional training and reasonable enforcement are critical to promoting the benefits of ai while safeguarding research integrity.


Weihao Tan, Wentao Zhang, Shanqi Liu, Longtao Zheng, Xinrun Wang, Bo An
Abstract: despite the impressive performance across numerous tasks, large language models (llms) often fail in solving simple decision-making tasks due to the misalignment of the knowledge in llms with environments. on the contrary, reinforcement learning (rl) agents learn policies from scratch, which makes them always align with environments but difficult to incorporate prior knowledge for efficient explorations. to narrow the gap, we propose twosome, a novel general online framework that deploys llms as decision-making agents to efficiently interact and align with embodied environments via rl without requiring any prepared datasets or prior knowledge of the environments. firstly, we query the joint probabilities of each valid action with llms to form behavior policies. then, to enhance the stability and robustness of the policies, we propose two normalization methods and summarize four prompt design principles. finally, we design a novel parameter-efficient training architecture where the actor and critic share one frozen llm equipped with low-rank adapters (lora) updated by ppo. we conduct extensive experiments to evaluate twosome. i) twosome exhibits significantly better sample efficiency and performance compared to the conventional rl method, ppo, and prompt tuning method, saycan, in both classical decision-making environment, overcooked, and simulated household environment, virtualhome. ii) benefiting from llms' open-vocabulary feature, twosome shows superior generalization ability to unseen tasks. iii) under our framework, there is no significant loss of the llms' original ability during online ppo finetuning.
Inhwa Song, Sachin R. Pendse, Neha Kumar, Munmun De Choudhury
Abstract: people experiencing severe distress increasingly use large language model (llm) chatbots as mental health support tools. discussions on social media have described how engagements were lifesaving for some, but evidence suggests that general-purpose llm chatbots also have notable risks that could endanger the welfare of users if not designed responsibly. in this study, we investigate the lived experiences of people who have used llm chatbots for mental health support. we build on interviews with 21 individuals from globally diverse backgrounds to analyze how users create unique support roles for their chatbots, fill in gaps in everyday care, and navigate associated cultural limitations when seeking support from chatbots. we ground our analysis in psychotherapy literature around effective support, and introduce the concept of therapeutic alignment, or aligning ai with therapeutic values for mental health contexts. our study offers recommendations for how designers can approach the ethical and effective use of llm chatbots and other ai mental health support tools in mental health care.
Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell
Abstract: external audits of ai systems are increasingly recognized as a key mechanism for ai governance. the effectiveness of an audit, however, depends on the degree of system access granted to auditors. recent audits of state-of-the-art ai systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. however, white-box access to the system's inner workings (e.g., weights, activations, gradients) allows an auditor to perform stronger attacks, more thoroughly interpret models, and conduct fine-tuning. meanwhile, outside-the-box access to its training and deployment information (e.g., methodology, code, documentation, hyperparameters, data, deployment details, findings from internal evaluations) allows for auditors to scrutinize the development process and design more targeted evaluations. in this paper, we examine the limitations of black-box audits and the advantages of white- and outside-the-box audits. we also discuss technical, physical, and legal safeguards for performing these audits with minimal security risks. given that different forms of access can lead to very different levels of evaluation, we conclude that (1) transparency regarding the access and methods used by auditors is necessary to properly interpret audit results, and (2) white- and outside-the-box access allow for substantially more scrutiny than black-box access alone.
Justin D. Weisz, Jessica He, Michael Muller, Gabriela Hoefer, Rachel Miles, Werner Geyer
Abstract: generative ai applications present unique design challenges. as generative ai technologies are increasingly being incorporated into mainstream applications, there is an urgent need for guidance on how to design user experiences that foster effective and safe use. we present six principles for the design of generative ai applications that address unique characteristics of generative ai ux and offer new interpretations and extensions of known issues in the design of ai applications. each principle is coupled with a set of design strategies for implementing that principle via ux capabilities or through the design process. the principles and strategies were developed through an iterative process involving literature review, feedback from design practitioners, validation against real-world generative ai applications, and incorporation into the design process of two generative ai applications. we anticipate the principles to usefully inform the design of generative ai applications by driving actionable design recommendations.
Kimon Kieslich, Natali Helberger, Nicholas Diakopoulos
Abstract: as a general purpose technology without a concrete pre-defined purpose, personal chatbots can be used for a whole range of objectives, depending on the personal needs, contexts, and tasks of an individual, and so potentially impact a variety of values, people, and social contexts. traditional methods of risk assessment are confronted with several challenges: the lack of a clearly defined technology purpose, the lack of a clearly defined values to orient on, the heterogeneity of uses, and the difficulty of actively engaging citizens themselves in anticipating impacts from the perspective of their individual lived realities. in this article, we leverage scenario writing at scale as a method for anticipating ai impact that is responsive to these challenges. the advantages of the scenario method are its ability to engage individual users and stimulate them to consider how chatbots are likely to affect their reality and so collect different impact scenarios depending on the cultural and societal embedding of a heterogeneous citizenship. empirically, we tasked 106 us-citizens to write short fictional stories about the future impact (whether desirable or undesirable) of ai-based personal chatbots on individuals and society and, in addition, ask respondents to explain why these impacts are important and how they relate to their values. in the analysis process, we map those impacts and analyze them in relation to socio-demographic as well as ai-related attitudes of the scenario writers. we show that our method is effective in (1) identifying and mapping desirable and undesirable impacts of ai-based personal chatbots, (2) setting these impacts in relation to values that are important for individuals, and (3) detecting socio-demographic and ai-attitude related differences of impact anticipation.


Hongzhan Lin, Ziyang Luo, Wei Gao, Jing Ma, Bo Wang, Ruichao Yang
Abstract: the age of social media is flooded with internet memes, necessitating a clear grasp and effective identification of harmful ones. this task presents a significant challenge due to the implicit meaning embedded in memes, which is not explicitly conveyed through the surface text and image. however, existing harmful meme detection methods do not present readable explanations that unveil such implicit meaning to support their detection decisions. in this paper, we propose an explainable approach to detect harmful memes, achieved through reasoning over conflicting rationales from both harmless and harmful positions. specifically, inspired by the powerful capacity of large language models (llms) on text generation and reasoning, we first elicit multimodal debate between llms to generate the explanations derived from the contradictory arguments. then we propose to fine-tune a small language model as the debate judge for harmfulness inference, to facilitate multimodal fusion between the harmfulness rationales and the intrinsic multimodal information within memes. in this way, our model is empowered to perform dialectical reasoning over intricate and implicit harm-indicative patterns, utilizing multimodal explanations originating from both harmless and harmful arguments. extensive experiments on three public meme datasets demonstrate that our harmful meme detection approach achieves much better performance than state-of-the-art methods and exhibits a superior capacity for explaining the meme harmfulness of the model predictions.
Kimon Kieslich, Marco Lünich
Abstract: ai is increasingly being used in the public sector, including public security. in this context, the use of ai-powered remote biometric identification (rbi) systems is a much-discussed technology. rbi systems are used to identify criminal activity in public spaces, but are criticised for inheriting biases and violating fundamental human rights. it is therefore important to ensure that such systems are developed in the public interest, which means that any technology that is deployed for public use needs to be scrutinised. while there is a consensus among business leaders, policymakers and scientists that ai must be developed in an ethical and trustworthy manner, scholars have argued that ethical guidelines do not guarantee ethical ai, but rather prevent stronger regulation of ai. as a possible counterweight, public opinion can have a decisive influence on policymakers to establish boundaries and conditions under which ai systems should be used -- if at all. however, we know little about the conditions that lead to regulatory demand for ai systems. in this study, we focus on the role of trust in ai as well as trust in law enforcement as potential factors that may lead to demands for regulation of ai technology. in addition, we explore the mediating effects of discrimination perceptions regarding rbi. we test the effects on four different use cases of rbi varying the temporal aspect (real-time vs. post hoc analysis) and purpose of use (persecution of criminals vs. safeguarding public events) in a survey among german citizens. we found that german citizens do not differentiate between the different modes of application in terms of their demand for rbi regulation. furthermore, we show that perceptions of discrimination lead to a demand for stronger regulation, while trust in ai and trust in law enforcement lead to opposite effects in terms of demand for a ban on rbi systems.
Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas Mayer, Padhraic Smyth
Abstract: for large language models (llms) to be trusted by humans they need to be well-calibrated in the sense that they can accurately assess and communicate how likely it is that their predictions are correct. recent work has focused on the quality of internal llm confidence assessments, but the question remains of how well llms can communicate this internal model confidence to human users. this paper explores the disparity between external human confidence in an llm's responses and the internal confidence of the model. through experiments involving multiple-choice questions, we systematically examine human users' ability to discern the reliability of llm outputs. our study focuses on two key areas: (1) assessing users' perception of true llm confidence and (2) investigating the impact of tailored explanations on this perception. the research highlights that default explanations from llms often lead to user overestimation of both the model's confidence and its' accuracy. by modifying the explanations to more accurately reflect the llm's internal confidence, we observe a significant shift in user perception, aligning it more closely with the model's actual confidence levels. this adjustment in explanatory approach demonstrates potential for enhancing user trust and accuracy in assessing llm outputs. the findings underscore the importance of transparent communication of confidence levels in llms, particularly in high-stakes applications where understanding the reliability of ai-generated information is essential.
Nayoung Kim, Myke C. Cohen, Yang Ba, Anna Pan, Shawaiz Bhatti, Pouria Salehi, James Sung, Erik Blasch, Michelle V. Mancenido, Erin K. Chiou
Abstract: designing for ai trustworthiness is challenging, with a lack of practical guidance despite extensive literature on trust. the multisource ai scorecard table (mast), a checklist rating system, addresses this gap in designing and evaluating ai-enabled decision support systems. we propose the principled approach for designing trustable human-centered ai systems using mast methodology (padthai-mm), a nine-step framework what we demonstrate through the iterative design of a text analysis platform called the reporting assistant for defense and intelligence tasks (readit). we designed two versions of readit, high-mast including ai context and explanations, and low-mast resembling a "black box" type system. participant feedback and state-of-the-art ai knowledge was integrated in the design process, leading to a redesigned prototype tested by participants in an intelligence reporting task. results show that mast-guided design can improve trust perceptions, and that mast criteria can be linked to performance, process, and purpose information, providing a practical and theory-informed basis for ai system design.
Yifan Yang, Xiaoyu Liu, Qiao Jin, Furong Huang, Zhiyong Lu
Abstract: large language models like gpt-3.5-turbo and gpt-4 hold promise for healthcare professionals, but they may inadvertently inherit biases during their training, potentially affecting their utility in medical applications. despite few attempts in the past, the precise impact and extent of these biases remain uncertain. through both qualitative and quantitative analyses, we find that these models tend to project higher costs and longer hospitalizations for white populations and exhibit optimistic views in challenging medical scenarios with much higher survival rates. these biases, which mirror real-world healthcare disparities, are evident in the generation of patient backgrounds, the association of specific diseases with certain races, and disparities in treatment recommendations, etc. our findings underscore the critical need for future research to address and mitigate biases in language models, especially in critical healthcare applications, to ensure fair and accurate outcomes for all patients.
Yepeng Liu, Yuheng Bu
Abstract: the advancement of large language models (llms) has led to increasing concerns about the misuse of ai-generated text, and watermarking for llm-generated text has emerged as a potential solution. however, it is challenging to generate high-quality watermarked text while maintaining strong security, robustness, and the ability to detect watermarks without prior knowledge of the prompt or model. this paper proposes an adaptive watermarking strategy to address this problem. to improve the text quality and maintain robustness, we adaptively add watermarking to token distributions with high entropy measured using an auxiliary model and keep the low entropy token distributions untouched. for the sake of security and to further minimize the watermark's impact on text quality, instead of using a fixed green/red list generated from a random secret key, which can be vulnerable to decryption and forgery, we adaptively scale up the output logits in proportion based on the semantic embedding of previously generated text using a well designed semantic mapping model. our experiments involving various llms demonstrate that our approach achieves comparable robustness performance to existing watermark methods. additionally, the text generated by our method has perplexity comparable to that of \emph{un-watermarked} llms while maintaining security even under various attacks.


Krishna Ronanki, Beatriz Cabrero-Daniel, Christian Berger
Abstract: recent generative artificial intelligence (genai) trends focus on various applications, including creating stories, illustrations, poems, articles, computer code, music compositions, and videos. extrinsic hallucinations are a critical limitation of such genai, which can lead to significant challenges in achieving and maintaining the trustworthiness of genai. in this paper, we propose two new concepts that we believe will aid the research community in addressing limitations associated with the application of genai models. first, we propose a definition for the "desirability" of genai outputs and three factors which are observed to influence it. second, drawing inspiration from martin fowler's code smells, we propose the concept of "prompt smells" and the adverse effects they are observed to have on the desirability of genai outputs. we expect our work will contribute to the ongoing conversation about the desirability of genai outputs and help advance the field in a meaningful way.
Haoyan Luo, Lucia Specia
Abstract: this survey paper delves into the burgeoning field of explainability for large language models (llms), a critical yet challenging aspect of natural language processing. with llms playing a pivotal role in various applications, their "black-box" nature raises concerns about transparency and ethical use. this paper emphasizes the necessity for enhanced explainability in llms, addressing both the general public's trust and the technical community's need for a deeper understanding of these models. we concentrate on pre-trained transformer-based llms, such as llama, which present unique interpretability challenges due to their scale and complexity. our review categorizes existing explainability methods and discusses their application in improving model transparency and reliability. we also discuss representative evaluation methods, highlighting their strengths and limitations. the goal of this survey is to bridge the gap between theoretical understanding and practical application, offering insights for future research and development in the field of llm explainability.
Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu
Abstract: vlms (vision-language models) extend the capabilities of llms (large language models) to accept multimodal inputs. since it has been verified that llms can be induced to generate harmful or inaccurate content through specific test cases (termed as red teaming), how vlms perform in similar scenarios, especially with their combination of textual and visual inputs, remains a question. to explore this problem, we present a novel red teaming dataset rtvlm, which encompasses 10 subtasks (e.g., image misleading, multi-modal jail-breaking, face fairness, etc) under 4 primary aspects (faithfulness, privacy, safety, fairness). our rtvlm is the first red-teaming dataset to benchmark current vlms in terms of these 4 different aspects. detailed analysis shows that 10 prominent open-sourced vlms struggle with the red teaming in different degrees and have up to 31% performance gap with gpt-4v. additionally, we simply apply red teaming alignment to llava-v1.5 with supervised fine-tuning (sft) using rtvlm, and this bolsters the models' performance with 10% in rtvlm test set, 13% in mm-hal, and without noticeable decline in mm-bench, overpassing other llava-based models with regular alignment data. this reveals that current open-sourced vlms still lack red teaming alignment. our code and datasets will be open-source.
Rick Rejeleene, Xiaowei Xu, John Talburt
Abstract: large language models (llm) are generating information at a rapid pace, requiring users to increasingly rely and trust the data. despite remarkable advances of llm, information generated by llm is not completely trustworthy, due to challenges in information quality. specifically, integrity of information quality decreases due to unreliable, biased, tokenization during pre-training of llm. moreover, due to decreased information quality issues, has led towards hallucination, fabricated information. unreliable information can lead towards flawed decisions in businesses, which impacts economic activity. in this work, we introduce novel mathematical information quality evaluation of llm, we furthermore analyze and highlight information quality challenges, scaling laws to systematically scale language models.
Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, Daniel Khashabi
Abstract: as the influence of large language models (llms) spans across global communities, their safety challenges in multilingual settings become paramount for alignment research. this paper examines the variations in safety challenges faced by llms across different languages and discusses approaches to alleviating such concerns. by comparing how state-of-the-art llms respond to the same set of malicious prompts written in higher- vs. lower-resource languages, we observe that (1) llms tend to generate unsafe responses much more often when a malicious prompt is written in a lower-resource language, and (2) llms tend to generate more irrelevant responses to malicious prompts in lower-resource languages. to understand where the discrepancy can be attributed, we study the effect of instruction tuning with reinforcement learning from human feedback (rlhf) or supervised finetuning (sft) on the hh-rlhf dataset. surprisingly, while training with high-resource languages improves model alignment, training in lower-resource languages yields minimal improvement. this suggests that the bottleneck of cross-lingual alignment is rooted in the pretraining stage. our findings highlight the challenges in cross-lingual llm safety, and we hope they inform future research in this direction.
Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, Lennart Heim, Markus Anderljung
Abstract: increased delegation of commercial, scientific, governmental, and personal activities to ai agents -- systems capable of pursuing complex goals with limited supervision -- may exacerbate existing societal risks and introduce new risks. understanding and mitigating these risks involves critically evaluating existing governance structures, revising and adapting these structures where needed, and ensuring accountability of key stakeholders. information about where, why, how, and by whom certain ai agents are used, which we refer to as visibility, is critical to these objectives. in this paper, we assess three categories of measures to increase visibility into ai agents: agent identifiers, real-time monitoring, and activity logging. for each, we outline potential implementations that vary in intrusiveness and informativeness. we analyze how the measures apply across a spectrum of centralized through decentralized deployment contexts, accounting for various actors in the supply chain including hardware and software service providers. finally, we discuss the implications of our measures for privacy and concentration of power. further work into understanding the measures and mitigating their negative impacts can help to build a foundation for the governance of ai agents.


Ziwei Xu, Sanjay Jain, Mohan Kankanhalli
Abstract: hallucination has been widely recognized to be a significant drawback for large language models (llms). there have been many works that attempt to reduce the extent of hallucination. these efforts have mostly been empirical so far, which cannot answer the fundamental question whether it can be completely eliminated. in this paper, we formalize the problem and show that it is impossible to eliminate hallucination in llms. specifically, we define a formal world where hallucination is defined as inconsistencies between a computable llm and a computable ground truth function. by employing results from learning theory, we show that llms cannot learn all of the computable functions and will therefore always hallucinate. since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world llms. furthermore, for real world llms constrained by provable time complexity, we describe the hallucination-prone tasks and empirically validate our claims. finally, using the formal world framework, we discuss the possible mechanisms and efficacies of existing hallucination mitigators as well as the practical implications on the safe deployment of llms.
Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, Jing Shao
Abstract: multi-agent systems, augmented with large language models (llms), demonstrate significant capabilities for collective intelligence. however, the potential misuse of this intelligence for malicious purposes presents significant risks. to date, comprehensive research on the safety issues associated with multi-agent systems remains limited. from the perspective of agent psychology, we discover that the dark psychological states of agents can lead to severe safety issues. to address these issues, we propose a comprehensive framework grounded in agent psychology. in our framework, we focus on three aspects: identifying how dark personality traits in agents might lead to risky behaviors, designing defense strategies to mitigate these risks, and evaluating the safety of multi-agent systems from both psychological and behavioral perspectives. our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents' propensity for self-reflection when engaging in dangerous behavior, and the correlation between agents' psychological assessments and their dangerous behaviors. we anticipate that our framework and observations will provide valuable insights for further research into the safety of multi-agent systems. we will make our data and code publicly accessible at https:/
Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn
Abstract: the success of reinforcement learning from human feedback (rlhf) in language model alignment is strongly dependent on the quality of the underlying reward model. in this paper, we present a novel approach to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. motivated by the promising results of best-of-n sampling strategies in language model training, we extend their application to reward model training. this results in a self-training strategy to generate preference pairs by selecting the best and worst candidates in a pool of responses to a given query. empirically, we find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data. this work opens up new avenues of research for improving rlhf for language model alignment, by offering synthetic preference generation as a solution to reward modeling challenges.
Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
Abstract: aligning large language models (llms) with human preferences through reinforcement learning (rlhf) can lead to reward hacking, where llms exploit failures in the reward model (rm) to achieve seemingly high rewards without meeting the underlying objectives. we identify two primary challenges when designing rms to mitigate reward hacking: distribution shifts during the rl process and inconsistencies in human preferences. as a solution, we propose weight averaged reward models (warm), first fine-tuning multiple rms, then averaging them in the weight space. this strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. by averaging weights, warm improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. our experiments on summarization tasks, using best-of-n and rl methods, shows that warm improves the overall quality and alignment of llm predictions; for example, a policy rl fine-tuned with warm has a 79.4% win rate against a policy rl fine-tuned with a single rm.
Ashutosh Kumar, Sagarika Singh, Shiv Vignesh Murty, Swathy Ragupathy
Abstract: this paper comprehensively explores the ethical challenges arising from security threats to language learning models (llms). these intricate digital repositories are increasingly integrated into our daily lives, making them prime targets for attacks that can compromise their training data and the confidentiality of their data sources. the paper delves into the nuanced ethical repercussions of such security threats on society and individual privacy. we scrutinize five major threats: prompt injection, jailbreaking, personal identifiable information (pii) exposure, sexually explicit content, and hate based content, going beyond mere identification to assess their critical ethical consequences and the urgency they create for robust defensive strategies. the escalating reliance on llms underscores the crucial need for ensuring these systems operate within the bounds of ethical norms, particularly as their misuse can lead to significant societal and individual harm. we propose conceptualizing and developing an evaluative tool tailored for llms, which would serve a dual purpose, guiding developers and designers in preemptive fortification of backend systems and scrutinizing the ethical dimensions of llm chatbot responses during the testing phase. by comparing llm responses with those expected from humans in a moral context, we aim to discern the degree to which ai behaviors align with the ethical values held by a broader society. ultimately, this paper not only underscores the ethical troubles presented by llms, it also highlights a path toward cultivating trust in these systems.
Weixin Chen, Bo Li
Abstract: truthfulness is paramount for large language models (llms) as they are increasingly deployed in real-world applications. however, existing llms still struggle with generating truthful answers and content, as evidenced by their modest performance on benchmarks like truthfulqa. to address this issue, we propose gradual self-truthifying (grath), a novel post-processing method to enhance truthfulness of llms. grath utilizes out-of-domain question prompts to generate corresponding answers and adaptively optimizes the model via direct preference optimization (dpo). note that during this process, grath learns truthfulness in a self-supervised manner without requiring annotated answers. in particular, grath first generates pairwise truthfulness training data by prompting the llm itself, with each pair containing a question and its correct and incorrect answers. the model is then fine-tuned using dpo to learn from the difference between answer pairs. subsequently, grath iteratively refines the truthfulness data and optimizes the model, leading to a gradual improvement in model truthfulness. empirically, we evaluate grath using different 7b-llms and compare with llms with similar or even larger sizes on benchmark datasets. our results show that grath effectively improves llms' truthfulness without compromising other core capabilities. notably, grath achieves state-of-the-art performance on truthfulqa, with mc1 accuracy as 54.71% and mc2 accuracy as 69.10%, which even surpass those on larger-scale models, such as llama2-chat-70b, by 23.62% and 24.18%, respectively.
Kyrie Zhixuan Zhou, Zachary Kilhoffer, Madelyn Rose Sanfilippo, Ted Underwood, Ece Gumusel, Mengyi Wei, Abhinav Choudhry, Jinjun Xiong
Abstract: large language models (llms) are advancing quickly and impacting people's lives for better or worse. in higher education, concerns have emerged such as students' misuse of llms and degraded education outcomes. to unpack the ethical concerns of llms for higher education, we conducted a case study consisting of stakeholder interviews (n=20) in higher education computer science. we found that students use several distinct mental models to interact with llms - llms serve as a tool for (a) writing, (b) coding, and (c) information retrieval, which differ somewhat in ethical considerations. students and teachers brought up ethical issues that directly impact them, such as inaccurate llm responses, hallucinations, biases, privacy leakage, and academic integrity issues. participants emphasized the necessity of guidance and rules for the use of llms in higher education, including teaching digital literacy, rethinking education, and having cautious and contextual policies. we reflect on the ethical challenges and propose solutions.
Zhaoyue Wang
Abstract: when we design and deploy an reinforcement learning (rl) agent, reward functions motivates agents to achieve an objective. an incorrect or incomplete specification of the objective can result in behavior that does not align with human values - failing to adhere with social and moral norms that are ambiguous and context dependent, and cause undesired outcomes such as negative side effects and exploration that is unsafe. previous work have manually defined reward functions to avoid negative side effects, use human oversight for safe exploration, or use foundation models as planning tools. this work studies the ability of leveraging large language models (llm)' understanding of morality and social norms on safe exploration augmented rl methods. this work evaluates language model's result against human feedbacks and demonstrates language model's capability as direct reward signals.
Keming Lu, Bowen Yu, Chang Zhou, Jingren Zhou
Abstract: considerable efforts have been invested in augmenting the role-playing proficiency of open-source large language models (llms) by emulating proprietary counterparts. nevertheless, we posit that llms inherently harbor role-play capabilities, owing to the extensive knowledge of characters and potential dialogues ingrained in their vast training corpora. thus, in this study, we introduce ditto, a self-alignment method for role-play. ditto capitalizes on character knowledge, encouraging an instruction-following llm to simulate role-play dialogues as a variant of reading comprehension. this method creates a role-play training set comprising 4,000 characters, surpassing the scale of currently available datasets by tenfold regarding the number of roles. subsequently, we fine-tune the llm using this self-generated dataset to augment its role-playing capabilities. upon evaluating our meticulously constructed and reproducible role-play benchmark and the roleplay subset of mt-bench, ditto, in various parameter scales, consistently maintains a consistent role identity and provides accurate role-specific knowledge in multi-turn role-play conversations. notably, it outperforms all open-source role-play baselines, showcasing performance levels comparable to advanced proprietary chatbots. furthermore, we present the first comprehensive cross-supervision alignment experiment in the role-play domain, revealing that the intrinsic capabilities of llms confine the knowledge within role-play. meanwhile, the role-play styles can be easily acquired with the guidance of smaller models. we open-source related resources at


Songyang Gao, Qiming Ge, Wei Shen, Shihan Dou, Junjie Ye, Xiao Wang, Rui Zheng, Yicheng Zou, Zhi Chen, Hang Yan, Qi Zhang, Dahua Lin
Abstract: the success of ai assistants based on language models (llms) hinges on reinforcement learning from human feedback (rlhf) to comprehend and align with user intentions. however, traditional alignment algorithms, such as ppo, are hampered by complex annotation and training requirements. this reliance limits the applicability of rlhf and hinders the development of professional assistants tailored to diverse human preferences. in this work, we introduce \textit{linear alignment}, a novel algorithm that aligns language models with human preferences in one single inference step, eliminating the reliance on data annotation and model training. linear alignment incorporates a new parameterization for policy optimization under divergence constraints, which enables the extraction of optimal policy in a closed-form manner and facilitates the direct estimation of the aligned response. extensive experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of llm alignment across diverse scenarios. our code and dataset will be published on \url{}.


Yoo Yeon Sung, Ishani Mondal, Jordan Boyd-Graber
Abstract: dynamic adversarial question generation, where humans write examples to stump a model, aims to create examples that are realistic and informative. however, the advent of large language models (llms) has been a double-edged sword for human authors: more people are interested in seeing and pushing the limits of these models, but because the models are so much stronger an opponent, they are harder to defeat. to understand how these models impact adversarial question writing process, we enrich the writing guidance with llms and retrieval models for the authors to reason why their questions are not adversarial. while authors could create interesting, challenging adversarial questions, they sometimes resort to tricks that result in poor questions that are ambiguous, subjective, or confusing not just to a computer but also to humans. to address these issues, we propose new metrics and incentives for eliciting good, challenging questions and present a new dataset of adversarially authored questions.
Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, Xipeng Qiu
Abstract: with the rapid development of large language models (llms), they are not only used as general-purpose ai assistants but are also customized through further fine-tuning to meet the requirements of different applications. a pivotal factor in the success of current llms is the alignment process. current alignment methods, such as supervised fine-tuning (sft) and reinforcement learning from human feedback (rlhf), focus on training-time alignment and are often complex and cumbersome to implement. therefore, we develop \textbf{inferaligner}, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. inferaligner utilizes safety steering vectors extracted from safety-aligned model to modify the activations of the target model when responding to harmful inputs, thereby guiding the target model to provide harmless responses. experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics, as well as to multimodal large language models (mllms) such as llava. it significantly diminishes the attack success rate (asr) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
Christian Tarsney
Abstract: large language models now possess human-level linguistic abilities in many contexts. this raises the concern that they can be used to deceive and manipulate on unprecedented scales, for instance spreading political misinformation on social media. in future, agentic ai systems might also deceive and manipulate humans for their own ends. in this paper, first, i argue that ai-generated content should be subject to stricter standards against deception and manipulation than we ordinarily apply to humans. second, i offer new characterizations of ai deception and manipulation meant to support such standards, according to which a statement is deceptive (manipulative) if it leads human addressees away from the beliefs (choices) they would endorse under ``semi-ideal'' conditions. third, i propose two measures to guard against ai deception and manipulation, inspired by this characterization: "extreme transparency" requirements for ai-generated content and defensive systems that, among other things, annotate ai-generated statements with contextualizing information. finally, i consider to what extent these measures can protect against deceptive behavior in future, agentic ais, and argue that non-agentic defensive systems can provide an important layer of defense even against more powerful agentic systems.


Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria
Abstract: in the rapidly advancing field of artificial intelligence, the concept of red-teaming or jailbreaking large language models (llms) has emerged as a crucial area of study. this approach is especially significant in terms of assessing and enhancing the safety and robustness of these models. this paper investigates the intricate consequences of such modifications through model editing, uncovering a complex relationship between enhancing model accuracy and preserving its ethical integrity. our in-depth analysis reveals a striking paradox: while injecting accurate information is crucial for model reliability, it can paradoxically destabilize the model's foundational framework, resulting in unpredictable and potentially unsafe behaviors. additionally, we propose a benchmark dataset nichehazardqa to investigate this unsafe behavior both within the same and cross topical domain. this aspect of our research sheds light on how the edits, impact the model's safety metrics and guardrails. our findings show that model editing serves as a cost-effective tool for topical red-teaming by methodically applying targeted edits and evaluating the resultant model behavior
Fanqi Wan, Xinting Huang, Leyang Cui, Xiaojun Quan, Wei Bi, Shuming Shi
Abstract: while large language models (llms) have proven to be exceptional on a variety of tasks after alignment, they may still produce responses that contradict the context or world knowledge confidently, a phenomenon known as ``hallucination''. in this paper, we demonstrate that reducing the inconsistency between the external knowledge encapsulated in the training data and the intrinsic knowledge inherited in the pretraining corpus could mitigate hallucination in alignment. specifically, we introduce a novel knowledge consistent alignment (kca) approach, which involves automatically formulating examinations based on external knowledge for accessing the comprehension of llms. for data encompassing knowledge inconsistency, kca implements several simple yet efficient strategies for processing. we illustrate the superior performance of the proposed kca approach in mitigating hallucinations across six benchmarks using llms of different backbones and scales. furthermore, we confirm the correlation between knowledge inconsistency and hallucination, signifying the effectiveness of reducing knowledge inconsistency in alleviating hallucinations. our code, model weights, and data are public at \url{}.
Adib Hasan, Ileana Rugina, Alex Wang
Abstract: large language models (llms) are vulnerable to `jailbreaking' prompts, a type of attack that can coax these models into generating harmful and illegal content. in this paper, we show that pruning up to 20% of llm parameters markedly increases their resistance to such attacks without additional training and without sacrificing their performance in standard benchmarks. intriguingly, we discovered that the enhanced safety observed post-pruning correlates to the initial safety training level of the model, hinting that the effect of pruning could be more general and may hold for other llm behaviors beyond safety. additionally, we introduce a curated dataset of 225 harmful tasks across five categories, inserted into ten different jailbreaking prompts, showing that pruning aids llms in concentrating attention on task-relevant tokens in jailbreaking prompts. lastly, our experiments reveal that the prominent chat models, such as llama-2 chat, vicuna, and mistral instruct exhibit high susceptibility to jailbreaking attacks, with some categories achieving nearly 70-100% success rate. these insights underline the potential of pruning as a generalizable approach for improving llm safety, reliability, and potentially other desired behaviors.
Shaina Raza, Shardul Ghuge, Chen Ding, Deval Pandya
Abstract: the rapid evolution of large language models (llms) underscores the critical importance of ethical considerations and data integrity in ai development, emphasizing the role of fair (findable, accessible, interoperable, reusable) data principles. while these principles have long been a cornerstone of ethical data stewardship, their application in llm training data is less prevalent, an issue our research aims to address. our study begins with a review of existing literature, highlighting the significance of fair principles in data management for model training. building on this foundation, we introduce a novel framework that incorporates fair principles into the llm training process. a key aspect of this approach is a comprehensive checklist, designed to assist researchers and developers in consistently applying fair data principles throughout the model development lifecycle. the practicality and effectiveness of our framework are demonstrated through a case study that involves creating a fair-compliant dataset to detect and reduce biases. this case study not only validates the usefulness of our framework but also establishes new benchmarks for more equitable, transparent, and ethical practices in llm training. we offer this framework to the community as a means to promote technologically advanced, ethically sound, and socially responsible ai models.
Chaofan Shou, Jing Liu, Doudou Lu, Koushik Sen
Abstract: as blockchain platforms grow exponentially, millions of lines of smart contract code are being deployed to manage extensive digital assets. however, vulnerabilities in this mission-critical code have led to significant exploitations and asset losses. thorough automated security analysis of smart contracts is thus imperative. this paper introduces llm4fuzz to optimize automated smart contract security analysis by leveraging large language models (llms) to intelligently guide and prioritize fuzzing campaigns. while traditional fuzzing suffers from low efficiency in exploring the vast state space, llm4fuzz employs llms to direct fuzzers towards high-value code regions and input sequences more likely to trigger vulnerabilities. additionally, llm4fuzz can leverage llms to guide fuzzers based on user-defined invariants, reducing blind exploration overhead. evaluations of llm4fuzz on real-world defi projects show substantial gains in efficiency, coverage, and vulnerability detection compared to baseline fuzzing. llm4fuzz also uncovered five critical vulnerabilities that can lead to a loss of more than $247k.
Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li
Abstract: large language models (llms) are shown to benefit from chain-of-thought (cot) prompting, particularly when tackling tasks that require systematic reasoning processes. on the other hand, cot prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. traditional methods for launching backdoor attacks involve either contaminating the training dataset with backdoored instances or directly manipulating the model parameters during deployment. however, these approaches are not practical for commercial llms that typically operate via api access. in this paper, we propose badchain, the first backdoor attack against llms employing cot prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. badchain leverages the inherent reasoning capabilities of llms by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. empirically, we show the effectiveness of badchain for two cot strategies across four llms (llama2, gpt-3.5, palm2, and gpt-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. moreover, we show that llms endowed with stronger reasoning capabilities exhibit higher susceptibility to badchain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on gpt-4. finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against badchain. therefore, badchain remains a severe threat to llms, underscoring the urgency for the development of robust and effective future defenses.


Mazal Bethany, Athanasios Galiopoulos, Emet Bethany, Mohammad Bahrami Karkevandi, Nishant Vishwamitra, Peyman Najafirad
Abstract: the critical threat of phishing emails has been further exacerbated by the potential of llms to generate highly targeted, personalized, and automated spear phishing attacks. two critical problems concerning llm-facilitated phishing require further investigation: 1) existing studies on lateral phishing lack specific examination of llm integration for large-scale attacks targeting the entire organization, and 2) current anti-phishing infrastructure, despite its extensive development, lacks the capability to prevent llm-generated attacks, potentially impacting both employees and it security incident management. however, the execution of such investigative studies necessitates a real-world environment, one that functions during regular business operations and mirrors the complexity of a large organizational infrastructure. this setting must also offer the flexibility required to facilitate a diverse array of experimental conditions, particularly the incorporation of phishing emails crafted by llms. this study is a pioneering exploration into the use of large language models (llms) for the creation of targeted lateral phishing emails, targeting a large tier 1 university's operation and workforce of approximately 9,000 individuals over an 11-month period. it also evaluates the capability of email filtering infrastructure to detect such llm-generated phishing attempts, providing insights into their effectiveness and identifying potential areas for improvement. based on our findings, we propose machine learning-based detection techniques for such emails to detect llm-generated phishing emails that were missed by the existing infrastructure, with an f1-score of 98.96.
Wei Huang, Yinggui Wang, Anda Cheng, Aihui Zhou, Chaofan Yu, Lei Wang
Abstract: the distributed (federated) llm is an important method for co-training the domain-specific llm using siloed data. however, maliciously stealing model parameters and data from the server or client side has become an urgent problem to be solved. in this paper, we propose a secure distributed llm based on model slicing. in this case, we deploy the trusted execution environment (tee) on both the client and server side, and put the fine-tuned structure (lora or embedding of p-tuning v2) into the tee. then, secure communication is executed in the tee and general environments through lightweight encryption. in order to further reduce the equipment cost as well as increase the model performance and accuracy, we propose a split fine-tuning scheme. in particular, we split the llm by layers and place the latter layers in a server-side tee (the client does not need a tee). we then combine the proposed sparsification parameter fine-tuning (spf) with the lora part to improve the accuracy of the downstream task. numerous experiments have shown that our method guarantees accuracy while maintaining security.
Kazuhiro Takemoto
Abstract: large language models (llms) like chatgpt face `jailbreak' challenges, where safeguards are bypassed to produce ethically harmful prompts. this study introduces a simple black-box method to effectively generate jailbreak prompts, overcoming the limitations of high complexity and computational costs associated with existing methods. the proposed technique iteratively rewrites harmful prompts into non-harmful expressions using the target llm itself, based on the hypothesis that llms can directly sample safeguard-bypassing expressions. demonstrated through experiments with chatgpt (gpt-3.5 and gpt-4) and gemini-pro, this method achieved an attack success rate of over 80% within an average of 5 iterations and remained effective despite model updates. the jailbreak prompts generated were naturally-worded and concise, suggesting they are less detectable. the results indicate that creating effective jailbreak prompts is simpler than previously considered, and black-box jailbreak attacks pose a more serious security threat.
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, Gongshen Liu
Abstract: large language models (llms) have exhibited great potential in autonomously completing tasks across real-world applications. despite this, these llm agents introduce unexpected safety risks when operating in interactive environments. instead of centering on llm-generated content safety in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of llm agents within diverse environments. we introduce r-judge, a benchmark crafted to evaluate the proficiency of llms in judging safety risks given agent interaction records. r-judge comprises 162 agent interaction records, encompassing 27 key risk scenarios among 7 application categories and 10 risk types. it incorporates human consensus on safety with annotated safety risk labels and high-quality risk descriptions. utilizing r-judge, we conduct a comprehensive evaluation of 8 prominent llms commonly employed as the backbone for agents. the best-performing model, gpt-4, achieves 72.29% in contrast to the human score of 89.38%, showing considerable room for enhancing the risk awareness of llms. notably, leveraging risk descriptions as environment feedback significantly improves model performance, revealing the importance of salient safety risk feedback. furthermore, we design an effective chain of safety analysis technique to help the judgment of safety risks and conduct an in-depth case study to facilitate future research. r-judge is publicly available at
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
Abstract: we posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during llm training. in this work, we study self-rewarding language models, where the language model itself is used via llm-as-a-judge prompting to provide its own rewards during training. we show that during iterative dpo training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. fine-tuning llama 2 70b on three iterations of our approach yields a model that outperforms many existing systems on the alpacaeval 2.0 leaderboard, including claude 2, gemini pro, and gpt-4 0613. while only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.


Dong Shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, Yongfeng Zhang
Abstract: in our research, we pioneer a novel approach to evaluate the effectiveness of jailbreak attacks on large language models (llms), such as gpt-4 and llama2, diverging from traditional robustness-focused binary evaluations. our study introduces two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. each framework, using a scoring range from 0 to 1, offers a unique perspective, enabling a more comprehensive and nuanced evaluation of attack effectiveness and empowering attackers to refine their attack prompts with greater understanding. furthermore, we have developed a comprehensive ground truth dataset specifically tailored for jailbreak tasks. this dataset not only serves as a crucial benchmark for our current study but also establishes a foundational resource for future research, enabling consistent and comparative analyses in this evolving field. upon meticulous comparison with traditional evaluation methods, we discovered that our evaluation aligns with the baseline's trend while offering a more profound and detailed assessment. we believe that by accurately evaluating the effectiveness of attack prompts in the jailbreak task, our work lays a solid foundation for assessing a wider array of similar or even more complex tasks in the realm of prompt injection, potentially revolutionizing this field.
Sagiv Antebi, Noam Azulay, Edan Habler, Ben Ganon, Asaf Shabtai, Yuval Elovici
Abstract: in november 2023, openai introduced a new service allowing users to create custom versions of chatgpt (gpts) by using specific instructions and knowledge to guide the model's behavior. we aim to raise awareness of the fact that gpts can be used maliciously, posing privacy and security risks to their users.
Lize Alberts, Geoff Keeling, Amanda Mccroskery
Abstract: with the growing popularity of dialogue agents based on large language models (llms), urgent attention has been drawn to finding ways to ensure their behaviour is ethical and appropriate. these are largely interpreted in terms of the 'hhh' criteria: making outputs more helpful and honest, and avoiding harmful (biased, toxic, or inaccurate) statements. whilst this semantic focus is useful from the perspective of viewing llm agents as mere mediums for information, it fails to account for pragmatic factors that can make the same utterance seem more or less offensive or tactless in different social situations. we propose an approach to ethics that is more centred on relational and situational factors, exploring what it means for a system, as a social actor, to treat an individual respectfully in a (series of) interaction(s). our work anticipates a set of largely unexplored risks at the level of situated interaction, and offers practical suggestions to help llm technologies behave as 'good' social actors and treat people respectfully.
Bradley Butcher
Abstract: advancements in large language models (llms) have demonstrated remarkable capabilities across a diverse range of applications. these models excel in generating text completions that are contextually coherent and cover an extensive array of subjects. however, the vast datasets required for their training make aligning response styles during the pretraining and instruction tuning phases challenging. consequently, an additional alignment phase is typically employed, wherein the model is further trained with human preference data to better align its outputs with human expectations. while this process doesn't introduce new capabilities per se, it does accentuate generation styles innate to the model. this paper explores the utilization of counterfactual prompting within the framework of direct preference optimization (dpo) to align the model's style without relying on human intervention. we demonstrate that this method effectively instils desirable behaviour, mitigates undesirable ones, and encourages the model to disregard inappropriate instructions. our findings suggest that counterfactual prompting with dpo presents a low-resource way to fine-tune llms to meet the demands for responsible and ethically aligned ai systems.


Junliang Luo, Tianyu Li, Di Wu, Michael Jenkin, Steve Liu, Gregory Dudek
Abstract: large language models (llms), including chatgpt, bard, and llama, have achieved remarkable successes over the last two years in a range of different applications. in spite of these successes, there exist concerns that limit the wide application of llms. a key problem is the problem of hallucination. hallucination refers to the fact that in addition to correct responses, llms can also generate seemingly correct but factually incorrect responses. this report aims to present a comprehensive review of the current literature on both hallucination detection and hallucination mitigation. we hope that this report can serve as a good reference for both engineers and researchers who are interested in llms and applying them to real world tasks.
Simone Balloccu, Ehud Reiter, Vivek Kumar, Diego Reforgiato Recupero, Daniele Riboni
Abstract: large language models (llms), with their flexible generation abilities, can be powerful data sources in domains with few or no available corpora. however, problems like hallucinations and biases limit such applications. in this case study, we pick nutrition counselling, a domain lacking any public resource, and show that high-quality datasets can be gathered by combining llms, crowd-workers and nutrition experts. we first crowd-source and cluster a novel dataset of diet-related issues, then work with experts to prompt chatgpt into producing related supportive text. finally, we let the experts evaluate the safety of the generated text. we release hai-coaching, the first expert-annotated nutrition counselling dataset containing ~2.4k dietary struggles from crowd workers, and ~97k related supportive texts generated by chatgpt. extensive analysis shows that chatgpt while producing highly fluent and human-like text, also manifests harmful behaviours, especially in sensitive topics like mental health, making it unsuitable for unsupervised use.
Tassilo Klein, Moin Nabi
Abstract: the generation of undesirable and factually incorrect content of large language models poses a significant challenge and remains largely an unsolved issue. this paper studies the integration of a contrastive learning objective for fine-tuning llms for implicit knowledge editing and controlled text generation. optimizing the training objective entails aligning text perplexities in a contrastive fashion. to facilitate training the model in a self-supervised fashion, we leverage an off-the-shelf llm for training data generation. we showcase applicability in the domain of detoxification. herein, the proposed approach leads to a significant decrease in the generation of toxic content while preserving general utility for downstream tasks such as commonsense reasoning and reading comprehension. the proposed approach is conceptually simple but empirically powerful.
Messi H. J. Lee, Jacob M. Montgomery, Calvin K. Lai
Abstract: large language models (llms) have become pervasive in everyday life, yet their inner workings remain opaque. while scholarly efforts have demonstrated llms' propensity to reproduce biases in their training data, they have primarily focused on the association of social groups with stereotypic attributes. in this paper, we extend this line of inquiry to investigate a bias akin to the social-psychological phenomenon where socially dominant groups are perceived to be less homogeneous than socially subordinate groups as it is reproduced by llms. we had chatgpt, a state-of-the-art llm, generate a diversity of texts about intersectional group identities and compared text homogeneity. we consistently find that llms portray african, asian, and hispanic americans as more homogeneous than white americans. they also portray women as more homogeneous than men, but these differences are small. finally, we find that the effect of gender differs across racial/ethnic groups such that the effect of gender is consistent within african and hispanic americans but not within asian and white americans. we speculate possible sources of this bias in llms and posit that the bias has the potential to amplify biases in future llm training and to reinforce stereotypes.
Masahiro Kaneko, Danushka Bollegala, Timothy Baldwin
Abstract: the output tendencies of pre-trained language models (plm) vary markedly before and after fine-tuning (ft) due to the updates to the model parameters. these divergences in output tendencies result in a gap in the social biases of plms. for example, there exits a low correlation between intrinsic bias scores of a plm and its extrinsic bias scores under ft-based debiasing methods. additionally, applying ft-based debiasing methods to a plm leads to a decline in performance in downstream tasks. on the other hand, plms trained on large datasets can learn without parameter updates via in-context learning (icl) using prompts. icl induces smaller changes to plms compared to ft-based debiasing methods. therefore, we hypothesize that the gap observed in pre-trained and ft models does not hold true for debiasing methods that use icl. in this study, we demonstrate that icl-based debiasing methods show a higher correlation between intrinsic and extrinsic bias scores compared to ft-based methods. moreover, the performance degradation due to debiasing is also lower in the icl case compared to that in the ft case.
Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, Noah A. Smith
Abstract: despite the general capabilities of large pretrained language models, they consistently benefit from further adaptation to better achieve desired behaviors. however, tuning these models has become increasingly resource-intensive, or impossible when model weights are private. we introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box lms to achieve the result of directly tuning the model, but by accessing only its prediction over the output vocabulary. our method instead tunes a smaller lm, then applies the difference between the predictions of the small tuned and untuned lms to shift the original predictions of the base model in the direction of tuning, while retaining the benefits of larger scale pretraining. in experiments, when we apply proxy-tuning to llama2-70b using proxies of only 7b size, we can close 88% of the gap between llama2-70b and its truly-tuned chat version, when evaluated across knowledge, reasoning, and safety benchmarks. interestingly, when tested on truthfulqa, proxy-tuned models are actually more truthful than directly tuned models, possibly because decoding-time guidance better retains the model's factual knowledge. we then demonstrate the generality of proxy-tuning by applying it for domain adaptation on code, and task-specific finetuning on question-answering and math problems. our work demonstrates the promise of using small tuned lms to efficiently customize large, potentially proprietary lms through decoding-time guidance.
Afra Feyza Akyürek, Ekin Akyürek, Leshem Choshen, Derry Wijaya, Jacob Andreas
Abstract: while language models (lms) can sometimes generate factually correct text and estimate truth values of individual claims, these generally do not reflect a globally coherent, manipulable model of the world. as a consequence, current lms also generate incorrect or nonsensical content, and are difficult to edit and bring up to date. we present a method called deductive closure training (dct) that uses lms themselves to identify implications of (and contradictions within) the text that they generate, yielding an efficient self-supervised procedure for improving lm factuality. given a collection of seed documents, dct prompts lms to generate additional text implied by these documents, reason globally about the correctness of this generated text, and finally fine-tune on text inferred to be correct. given seed documents from a trusted source, dct provides a tool for supervised model updating; if seed documents are sampled from the lm itself, dct enables fully unsupervised fine-tuning for improved coherence and accuracy. across the creak, mquake, and reversal curse datasets, supervised dct improves lm fact verification and text generation accuracy by 3-26%; on creak fully unsupervised dct improves verification accuracy by 12%. these results show that lms' reasoning capabilities during inference can be leveraged during training to improve their reliability.


Xingzhou Lou, Junge Zhang, Ziyan Wang, Kaiqi Huang, Yali Du
Abstract: safe reinforcement learning (rl) agents accomplish given tasks while adhering to specific constraints. employing constraints expressed via easily-understandable human language offers considerable potential for real-world applications due to its accessibility and non-reliance on domain expertise. previous safe rl methods with natural language constraints typically adopt a recurrent neural network, which leads to limited capabilities when dealing with various forms of human language input. furthermore, these methods often require a ground-truth cost function, necessitating domain expertise for the conversion of language constraints into a well-defined cost function that determines constraint violation. to address these issues, we proposes to use pre-trained language models (lm) to facilitate rl agents' comprehension of natural language constraints and allow them to infer costs for safe policy learning. through the use of pre-trained lms and the elimination of the need for a ground-truth cost, our method enhances safe policy learning under a diverse set of human-derived free-form natural language constraints. experiments on grid-world navigation and robot control show that the proposed method can achieve strong performance while adhering to given constraints. the usage of pre-trained lms allows our method to comprehend complicated constraints and learn safe policies without the need for ground-truth cost at any stage of training or evaluation. extensive ablation studies are conducted to demonstrate the efficacy of each part of our method.
Xuchen Suo
Abstract: the critical challenge of prompt injection attacks in large language models (llms) integrated applications, a growing concern in the artificial intelligence (ai) field. such attacks, which manipulate llms through natural language inputs, pose a significant threat to the security of these applications. traditional defense strategies, including output and input filtering, as well as delimiter use, have proven inadequate. this paper introduces the 'signed-prompt' method as a novel solution. the study involves signing sensitive instructions within command segments by authorized users, enabling the llm to discern trusted instruction sources. the paper presents a comprehensive analysis of prompt injection attack patterns, followed by a detailed explanation of the signed-prompt concept, including its basic architecture and implementation through both prompt engineering and fine-tuning of llms. experiments demonstrate the effectiveness of the signed-prompt method, showing substantial resistance to various types of prompt injection attacks, thus validating its potential as a robust defense strategy in ai security.
Sougata Saha, Rohini Srihari
Abstract: hateful comments are prevalent on social media platforms. although tools for automatically detecting, flagging, and blocking such false, offensive, and harmful content online have lately matured, such reactive and brute force methods alone provide short-term and superficial remedies while the perpetrators persist. with the public availability of large language models which can generate articulate synthetic and engaging content at scale, there are concerns about the rapid growth of dissemination of such malicious content on the web. there is now a need to focus on deeper, long-term solutions that involve engaging with the human perpetrator behind the source of the content to change their viewpoint or at least bring down the rhetoric using persuasive means. to do that, we propose defining and experimenting with controllable strategies for generating counter-arguments to hateful comments in online conversations. we experiment with controlling response generation using features based on (i) argument structure and reasoning-based walton argument schemes, (ii) counter-argument speech acts, and (iii) human characteristics-based qualities such as big-5 personality traits and human values. using automatic and human evaluations, we determine the best combination of features that generate fluent, argumentative, and logically sound arguments for countering hate. we further share the developed computational models for automatically annotating text with such features, and a silver-standard annotated version of an existing hate speech dialog corpora.
Atoosa Kasirzadeh
Abstract: the conventional discourse on existential risks (x-risks) from ai typically focuses on abrupt, dire events caused by advanced ai systems, particularly those that might achieve or surpass human-level intelligence. these events have severe consequences that either lead to human extinction or irreversibly cripple human civilization to a point beyond recovery. this discourse, however, often neglects the serious possibility of ai x-risks manifesting incrementally through a series of smaller yet interconnected disruptions, gradually crossing critical thresholds over time. this paper contrasts the conventional "decisive ai x-risk hypothesis" with an "accumulative ai x-risk hypothesis." while the former envisions an overt ai takeover pathway, characterized by scenarios like uncontrollable superintelligence, the latter suggests a different causal pathway to existential catastrophes. this involves a gradual accumulation of critical ai-induced threats such as severe vulnerabilities and systemic erosion of econopolitical structures. the accumulative hypothesis suggests a boiling frog scenario where incremental ai risks slowly converge, undermining resilience until a triggering event results in irreversible collapse. through systems analysis, this paper examines the distinct assumptions differentiating these two hypotheses. it is then argued that the accumulative view reconciles seemingly incompatible perspectives on ai risks. the implications of differentiating between these causal pathways -- the decisive and the accumulative -- for the governance of ai risks as well as long-term ai safety are discussed.
Andreas Madsen, Sarath Chandar, Siva Reddy
Abstract: instruction-tuned large language models (llms) excel at many tasks, and will even provide explanations for their behavior. since these models are directly accessible to the public, there is a risk that convincing and wrong explanations can lead to unsupported confidence in llms. therefore, interpretability-faithfulness of self-explanations is an important consideration for ai safety. assessing the interpretability-faithfulness of these explanations, termed self-explanations, is challenging as the models are too complex for humans to annotate what is a correct explanation. to address this, we propose employing self-consistency checks as a measure of faithfulness. for example, if an llm says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. while self-consistency checks are a common approach to faithfulness, they have not previously been applied to llm's self-explanations. we apply self-consistency checks to three types of self-explanations: counterfactuals, importance measures, and redactions. our work demonstrate that faithfulness is both task and model dependent, e.g., for sentiment classification, counterfactual explanations are more faithful for llama2, importance measures for mistral, and redaction for falcon 40b. finally, our findings are robust to prompt-variations.
Vimal Kumar, Juliette Mayo, Khadija Bahiss
Abstract: machine learning (ml) and artificial intelligence (ai) techniques have now become commonplace in software products and services. when threat modelling a system, it is therefore important that we consider threats unique to ml and ai techniques, in addition to threats to our software. in this paper, we present a threat model that can be used to systematically uncover threats to ai based software. the threat model consists of two main parts, a model of the software development process for ai based software and an attack taxonomy that has been developed using attacks found in adversarial ai research. we apply the threat model to two real life ai based software and discuss the process and the threats found.
Zhicheng Dou, Yuchen Guo, Ching-Chun Chang, Huy H. Nguyen, Isao Echizen
Abstract: the emergence of large language models (llms), such as generative pre-trained transformer 4 (gpt-4) used by chatgpt, has profoundly impacted the academic and broader community. while these models offer numerous advantages in terms of revolutionizing work and study methods, they have also garnered significant attention due to their potential negative consequences. one example is generating academic reports or papers with little to no human contribution. consequently, researchers have focused on developing detectors to address the misuse of llms. however, most existing methods prioritize achieving higher accuracy on restricted datasets, neglecting the crucial aspect of generalizability. this limitation hinders their practical application in real-life scenarios where reliability is paramount. in this paper, we present a comprehensive analysis of the impact of prompts on the text generated by llms and highlight the potential lack of robustness in one of the current state-of-the-art gpt detectors. to mitigate these issues concerning the misuse of llms in academic writing, we propose a reference-based siamese detector named synthetic-siamese which takes a pair of texts, one as the inquiry and the other as the reference. our method effectively addresses the lack of robustness of previous detectors (openai detector and detectgpt) and significantly improves the baseline performances in realistic academic writing scenarios by approximately 67% to 95%.


Claudio Novelli, Federico Casolari, Philipp Hacker, Giorgio Spedicato, Luciano Floridi
Abstract: the advent of generative ai, particularly through large language models (llms) like chatgpt and its successors, marks a paradigm shift in the ai landscape. advanced llms exhibit multimodality, handling diverse data formats, thereby broadening their application scope. however, the complexity and emergent autonomy of these models introduce challenges in predictability and legal compliance. this paper delves into the legal and regulatory implications of generative ai and llms in the european union context, analyzing aspects of liability, privacy, intellectual property, and cybersecurity. it critically examines the adequacy of the existing and proposed eu legislation, including the artificial intelligence act (aia) draft, in addressing the unique challenges posed by generative ai in general and llms in particular. the paper identifies potential gaps and shortcomings in the legislative framework and proposes recommendations to ensure the safe and compliant deployment of generative models, ensuring they align with the eu's evolving digital landscape and legal standards.
Meng Cao, Lei Shu, Lei Yu, Yun Zhu, Nevan Wichers, Yinxiao Liu, Lei Meng
Abstract: reinforcement learning (rl) can align language models with non-differentiable reward signals, such as human preferences. however, a major challenge arises from the sparsity of these reward signals - typically, there is only one reward for the entire generation. this sparsity of rewards can lead to inefficient and unstable learning. in this paper, we introduce a novel framework leveraging the critique ability of llms to produce dense rewards throughout the learning process. our approach incorporates a critic language model alongside the policy model. this critic is prompted with the task description, question, policy model's output, and environment's reward signal as input, and provides token or span-level dense rewards that reflect the quality of each segment of the output. we assess our approach on three text generation tasks: sentiment control, language model detoxification, and summarization. experimental results show that incorporating artificial dense rewards in training yields consistent performance gains over the ppo baseline with holistic rewards. furthermore, in a setting where the same model serves as both policy and critic, we demonstrate that "self-critique" rewards also boost learning efficiency.


Nafis Tanveer Islam, Peyman Najafirad
Abstract: with the recent advancement of large language models (llms), generating functionally correct code has become less complicated for a wide array of developers. while using llms has sped up the functional development process, it poses a heavy risk to code security. code generation with proper security measures using llm is a significantly more challenging task than functional code generation. security measures may include adding a pair of lines of code with the original code, consisting of null pointer checking or prepared statements for sql injection prevention. currently, available code repair llms generate code repair by supervised fine-tuning, where the model looks at cross-entropy loss. however, the original and repaired codes are mostly similar in functionality and syntactically, except for a few (1-2) lines, which act as security measures. this imbalance between the lines needed for security measures and the functional code enforces the supervised fine-tuned model to prioritize generating functional code without adding proper security measures, which also benefits the model by resulting in minimal loss. therefore, in this work, for security hardening and strengthening of generated code from llms, we propose a reinforcement learning-based method for program-specific repair with the combination of semantic and syntactic reward mechanisms that focus heavily on adding security and functional measures in the code, respectively.
Houda Nait El Barj, Theophile Sautory
Abstract: we introduce a method to address goal misgeneralization in reinforcement learning (rl), leveraging large language model (llm) feedback during training. goal misgeneralization, a type of robustness failure in rl occurs when an agent retains its capabilities out-of-distribution yet pursues a proxy rather than the intended one. our approach utilizes llms to analyze an rl agent's policies during training and identify potential failure scenarios. the rl agent is then deployed in these scenarios, and a reward model is learnt through the llm preferences and feedback. this llm-informed reward model is used to further train the rl agent on the original dataset. we apply our method to a maze navigation task, and show marked improvements in goal generalization, especially in cases where true and proxy goals are somewhat distinguishable and behavioral biases are pronounced. this study demonstrates how the llm, despite its lack of task proficiency, can efficiently supervise rl agents, providing scalable oversight and valuable insights for enhancing goal-directed learning in rl through the use of llms.


Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi
Abstract: most traditional ai safety research has approached ai models as machines and centered on algorithm-focused attacks developed by security experts. as large language models (llms) become increasingly common and competent, non-expert users can also impose risks during daily interactions. this paper introduces a new perspective to jailbreak llms as human-like communicators, to explore this overlooked intersection between everyday language interaction and ai safety. specifically, we study how to persuade llms to jailbreak them. first, we propose a persuasion taxonomy derived from decades of social science research. then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (pap) to jailbreak llms. results show that persuasion significantly increases the jailbreak performance across all risk categories: pap consistently achieves an attack success rate of over $92\%$ on llama 2-7b chat, gpt-3.5, and gpt-4 in $10$ trials, surpassing recent algorithm-focused attacks. on the defense side, we explore various mechanisms against pap and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive llms
Hala Abdelkader, Mohamed Abdelrazek, Scott Barnett, Jean-Guy Schneider, Priya Rani, Rajesh Vasa
Abstract: machine learning (ml), especially with the emergence of large language models (llms), has significantly transformed various industries. however, the transition from ml model prototyping to production use within software systems presents several challenges. these challenges primarily revolve around ensuring safety, security, and transparency, subsequently influencing the overall robustness and trustworthiness of ml models. in this paper, we introduce ml-on-rails, a protocol designed to safeguard ml models, establish a well-defined endpoint interface for different ml tasks, and clear communication between ml providers and ml consumers (software engineers). ml-on-rails enhances the robustness of ml models via incorporating detection capabilities to identify unique challenges specific to production ml. we evaluated the ml-on-rails protocol through a real-world case study of the movereminder application. through this evaluation, we emphasize the importance of safeguarding ml models in production.
Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao
Abstract: aligning large language models (llms) with human values, particularly in the face of stealthy and complex jailbreaks, presents a formidable challenge. in this study, we present a simple yet highly effective defense strategy, i.e., intention analysis prompting (iaprompt). the principle behind is to trigger llms' inherent self-correct and improve ability through a two-stage process: 1) essential intention analysis, and 2) policy-aligned response. notably, iaprompt is an inference-only method, thus could enhance the safety of llms without compromising their helpfulness. extensive experiments on sap200 and dan benchmarks across vicuna, chatglm, mpt, deepseek, and gpt-3.5 show that iaprompt could consistently and significantly reduce the harmfulness in response (averagely -46.5% attack success rate) and maintain the general helpfulness. further analyses present some insights into how our method works. to facilitate reproducibility, we release our code and scripts at:
Rafael Rivera Soto, Kailin Koch, Aleem Khan, Barry Chen, Marcus Bishop, Nicholas Andrews
Abstract: the advent of instruction-tuned language models that convincingly mimic human writing poses a significant risk of abuse. for example, such models could be used for plagiarism, disinformation, spam, or phishing. however, such abuse may be counteracted with the ability to detect whether a piece of text was composed by a language model rather than a human. some previous approaches to this problem have relied on supervised methods trained on corpora of confirmed human and machine-written documents. unfortunately, model under-specification poses an unavoidable challenge for neural network-based detectors, making them brittle in the face of data shifts, such as the release of further language models producing still more fluent text than the models used to train the detectors. other previous approaches require access to the models that may have generated a document in question at inference or detection time, which is often impractical. in light of these challenges, we pursue a fundamentally different approach not relying on samples from language models of concern at training time. instead, we propose to leverage representations of writing style estimated from human-authored text. indeed, we find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors, including state of the art large language models like llama 2, chatgpt, and gpt-4. furthermore, given a handful of examples composed by each of several specific language models of interest, our approach affords the ability to predict which model generated a given document.
Kaitlyn Zhou, Jena D. Hwang, Xiang Ren, Maarten Sap
Abstract: as natural language becomes the default interface for human-ai interaction, there is a critical need for lms to appropriately communicate uncertainties in downstream applications. in this work, we investigate how lms incorporate confidence about their responses via natural language and how downstream users behave in response to lm-articulated uncertainties. we examine publicly deployed models and find that lms are unable to express uncertainties when answering questions even when they produce incorrect responses. lms can be explicitly prompted to express confidences, but tend to be overconfident, resulting in high error rates (on average 47%) among confident responses. we test the risks of lm overconfidence by running human experiments and show that users rely heavily on lm generations, whether or not they are marked by certainty. lastly, we investigate the preference-annotated datasets used in rlhf alignment and find that humans have a bias against texts with uncertainty. our work highlights a new set of safety harms facing human-lm interactions and proposes design recommendations and mitigating strategies moving forward.
Zaijing Li, Gongwei Chen, Rui Shao, Dongmei Jiang, Liqiang Nie
Abstract: the emotional generation is a subset of emotional intelligence, which aims to output an emotional response based on emotional conditions as input. emotion generation has a wide range of applications, including emotion chat, emotional visual caption, and emotional rewriting. however, it faces challenges such as a lack of interpretability and poor evaluability. in this paper, we propose the emotional chain-of-thought (ecot), a plug-and-play prompting method that enhances the performance of large language models (llms) on various emotional generation tasks by aligning with human emotional intelligence guidelines. to assess the reliability of ecot, we propose an automated model-based evaluation method called egs. extensive experimental results demonstrate the effectiveness of ecot and egs. further,we discuss the promise of llms in the field of sentiment analysis and present key insights into the llms with the ecot in emotional generation tasks.
Tyler Vergho, Jean-Francois Godbout, Reihaneh Rabbany, Kellin Pelrine
Abstract: recent large language models (llms) have been shown to be effective for misinformation detection. however, the choice of llms for experiments varies widely, leading to uncertain conclusions. in particular, gpt-4 is known to be strong in this domain, but it is closed source, potentially expensive, and can show instability between different versions. meanwhile, alternative llms have given mixed results. in this work, we show that zephyr-7b presents a consistently viable alternative, overcoming key limitations of commonly used approaches like llama-2 and gpt-3.5. this provides the research community with a solid open-source option and shows open-source models are gradually catching up on this task. we then highlight how gpt-3.5 exhibits unstable performance, such that this very widely used model could provide misleading results in misinformation detection. finally, we validate new tools including approaches to structured output and the latest version of gpt-4 (turbo), showing they do not compromise performance, thus unlocking them for future research and potentially enabling more complex pipelines for misinformation mitigation.
Tong Niu, Caiming Xiong, Semih Yavuz, Yingbo Zhou
Abstract: the field of natural language generation has witnessed significant advancements in recent years, including the development of controllable text generation techniques. however, controlling the attributes of the generated text remains a challenge, especially when aiming to avoid undesirable behavior such as toxicity. in this work, we introduce detoxification generator (detoxigen), an inference-time algorithm that steers the generation away from unwanted styles. detoxigen is an ensemble of a pre-trained language model (generator) and a detoxifier. the detoxifier is trained intentionally on the toxic data representative of the undesirable attribute, encouraging it to generate text in that style exclusively. during the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step. this approach directly informs the generator to avoid generating tokens that the detoxifier considers highly likely. we evaluate detoxigen on the commonly used realtoxicityprompts benchmark (gehman et al., 2020) with various language models as generators. we find that it significantly outperforms previous approaches in detoxification metrics while not compromising on the generation quality. moreover, the detoxifier is obtained by soft prompt-tuning using the same backbone language model as the generator. hence, detoxigen requires only a tiny amount of extra weights from the virtual tokens of the detoxifier to be loaded into gpu memory while decoding, making it a promising lightweight, practical, and parameter-efficient detoxification strategy.


Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, Zhixing Tan, Junwu Xiong, Xinyu Kong, Zujie Wen, Ke Xu, Qi Li
Abstract: large language models (llms) have strong capabilities in solving diverse natural language processing tasks. however, the safety and security issues of llm systems have become the major obstacle to their widespread application. many studies have extensively investigated risks in llm systems and developed the corresponding mitigation strategies. leading-edge enterprises such as openai, google, meta, and anthropic have also made lots of efforts on responsible llms. therefore, there is a growing need to organize the existing studies and establish comprehensive taxonomies for the community. in this paper, we delve into four essential modules of an llm system, including an input module for receiving prompts, a language model trained on extensive corpora, a toolchain module for development and deployment, and an output module for exporting llm-generated content. based on this, we propose a comprehensive taxonomy, which systematically analyzes potential risks associated with each module of an llm system and discusses the corresponding mitigation strategies. furthermore, we review prevalent benchmarks, aiming to facilitate the risk assessment of llm systems. we hope that this paper can help llm participants embrace a systematic perspective to build their responsible llm systems.
Shuai Zhao, Meihuizi Jia, Luu Anh Tuan, Jinming Wen
Abstract: in-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several nlp tasks, especially in few-shot settings. unlike traditional fine-tuning methods, in-context learning adapts pre-trained models to unseen tasks without updating any parameters. despite being widely applied, in-context learning is vulnerable to malicious attacks. in this work, we raise security concerns regarding this paradigm. our studies demonstrate that an attacker can manipulate the behavior of large language models by poisoning the demonstration context, without the need for fine-tuning the model. specifically, we have designed a new backdoor attack method, named iclattack, to target large language models based on in-context learning. our method encompasses two types of attacks: poisoning demonstration examples and poisoning prompts, which can make models behave in accordance with predefined intentions. iclattack does not require additional fine-tuning to implant a backdoor, thus preserving the model's generality. furthermore, the poisoned examples are correctly labeled, enhancing the natural stealth of our attack method. extensive experimental results across several language models, ranging in size from 1.3b to 40b parameters, demonstrate the effectiveness of our attack method, exemplified by a high average attack success rate of 95.0% across the three datasets on opt models. our findings highlight the vulnerabilities of language models, and we hope this work will raise awareness of the possible security threats associated with in-context learning.
Steffi Chern, Zhen Fan, Andy Liu
Abstract: while state-of-the-art language models have achieved impressive results, they remain susceptible to inference-time adversarial attacks, such as adversarial prompts generated by red teams arxiv:2209.07858. one approach proposed to improve the general quality of language model generations is multi-agent debate, where language models self-evaluate through discussion and feedback arxiv:2305.14325. we implement multi-agent debate between current state-of-the-art language models and evaluate models' susceptibility to red team attacks in both single- and multi-agent settings. we find that multi-agent debate can reduce model toxicity when jailbroken or less capable models are forced to debate with non-jailbroken or more capable models. we also find marginal improvements through the general usage of multi-agent interactions. we further perform adversarial prompt content classification via embedding clustering, and analyze the susceptibility of different models to different types of attack topics.
Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
Abstract: reinforcement learning from human feedback (rlhf) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. reward models are trained as proxies for human preferences to drive reinforcement learning optimization. while reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative rlhf training. in this report, we attempt to address these two issues. (1) from a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. experimental results confirm that data with varying preference strengths have different impacts on reward model performance. we introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) from an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative rlhf optimization.
Zhipeng Chen, Kun Zhou, Wayne Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, Ji-Rong Wen
Abstract: reinforcement learning (rl) has been widely used in training large language models~(llms) for preventing unexpected outputs, \eg reducing harmfulness and errors. however, existing rl methods mostly adopt the instance-level reward, which is unable to provide fine-grained supervision for complex reasoning tasks, and can not focus on the few key tokens that lead to the incorrectness. to address it, we propose a new rl method named \textbf{rlmec} that incorporates a generative model as the reward model, which is trained by the erroneous solution rewriting task under the minimum editing constraint, and can produce token-level rewards for rl training. based on the generative reward model, we design the token-level rl objective for training and an imitation-based regularization for stabilizing rl process. and the both objectives focus on the learning of the key tokens for the erroneous solution, reducing the effect of other unimportant tokens. the experiment results on mathematical tasks and question-answering tasks have demonstrated the effectiveness of our approach. our code and data are available at \url{}.
Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, Mor Geva
Abstract: inspecting the information encoded in hidden representations of large language models (llms) can explain models' behavior and verify their alignment with human values. given the capabilities of llms in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. we introduce a framework called patchscopes and show how it can be used to answer a wide range of research questions about an llm's computation. we show that prior interpretability methods based on projecting representations into the vocabulary space and intervening on the llm computation, can be viewed as special instances of this framework. moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by a patchscope. beyond unifying prior inspection techniques, patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and unlocks new applications such as self-correction in multi-hop reasoning.
Tianlong Li, Xiaoqing Zheng, Xuanjing Huang
Abstract: getting large language models (llms) to refuse to answer hostile toxicity questions is a core issue under the theme of llms security. previous approaches have used prompts engineering to jailbreak llms and answer some toxicity questions. these approaches can easily fail after the model manufacturer makes additional fine-tuning to the model. to promote the further understanding of model jailbreaking by researchers, we are inspired by representation engineering to propose a jailbreaking method that does not require elaborate construction prompts, is not affected by model fine-tuning, and can be widely applied to any open-source llms in a pluggable manner. we have evaluated this method on multiple mainstream llms on carefully supplemented toxicity datasets, and the experimental results demonstrate the significant effectiveness of our approach. after being surprised by some interesting jailbreaking cases, we did extensive in-depth research to explore the techniques behind this method.


Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bhavya Kailkhura, Caiming Xiong, Chao Zhang, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Marinka Zitnik, Meng Jiang, Mohit Bansal, James Zou, Jian Pei, Jian Liu, Jianfeng Gao, Jiawei Han, Jieyu Zhao, Jiliang Tang, Jindong Wang, John Mitchell, Kai Shu, Kaidi Xu, Kai-Wei Chang, Lifang He, Lifu Huang, Michael Backes, Neil Zhenqiang Gong, Philip S. Yu, Pin-Yu Chen, Quanquan Gu, Ran Xu, Rex Ying, Shuiwang Ji, Suman Jana, Tianlong Chen, Tianming Liu, Tianyi Zhou, Willian Wang, Xiang Li, Xiangliang Zhang, Xiao Wang, Xing Xie, Xun Chen, Xuyu Wang, Yan Liu, Yanfang Ye, Yinzhi Cao, Yue Zhao
Abstract: large language models (llms), exemplified by chatgpt, have gained considerable attention for their excellent natural language processing capabilities. nonetheless, these llms present many challenges, particularly in the realm of trustworthiness. therefore, ensuring the trustworthiness of llms emerges as an important topic. this paper introduces trustllm, a comprehensive study of trustworthiness in llms, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream llms, and discussion of open challenges and future directions. specifically, we first propose a set of principles for trustworthy llms that span eight different dimensions. based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. we then present a study evaluating 16 mainstream llms in trustllm, consisting of over 30 datasets. our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. secondly, our observations reveal that proprietary llms generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source llms. however, a few open-source llms come very close to proprietary ones. thirdly, it is important to note that some llms may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte Macdiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova Dassarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
Abstract: humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. if an ai system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? to study this question, we construct proof-of-concept examples of deceptive behavior in large language models (llms). for example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. we find that such backdoored behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). the backdoored behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.
Shiye Cao, Anqi Liu, Chien-Ming Huang
Abstract: appropriate reliance is critical to achieving synergistic human-ai collaboration. for instance, when users over-rely on ai assistance, their human-ai team performance is bounded by the model's capability. this work studies how the presentation of model uncertainty may steer users' decision-making toward fostering appropriate reliance. our results demonstrate that showing the calibrated model uncertainty alone is inadequate. rather, calibrating model uncertainty and presenting it in a frequency format allow users to adjust their reliance accordingly and help reduce the effect of confirmation bias on their decisions. furthermore, the critical nature of our skin cancer screening task skews participants' judgment, causing their reliance to vary depending on their initial decision. additionally, step-wise multiple regression analyses revealed how user demographics such as age and familiarity with probability and statistics influence human-ai collaborative decision-making. we discuss the potential for model uncertainty presentation, initial user decision, and user demographics to be incorporated in designing personalized ai aids for appropriate reliance.


Shrey Satapara, Parth Mehta, Debasis Ganguly, Sandip Modha
Abstract: the recent success in language generation capabilities of large language models (llms), such as gpt, bard, llama etc., can potentially lead to concerns about their possible misuse in inducing mass agitation and communal hatred via generating fake news and spreading misinformation. traditional means of developing a misinformation ground-truth dataset does not scale well because of the extensive manual effort required to annotate the data. in this paper, we propose an llm-based approach of creating silver-standard ground-truth datasets for identifying misinformation. specifically speaking, given a trusted news article, our proposed approach involves prompting llms to automatically generate a summarised version of the original article. the prompts in our proposed approach act as a controlling mechanism to generate specific types of factual incorrectness in the generated summaries, e.g., incorrect quantities, false attributions etc. to investigate the usefulness of this dataset, we conduct a set of experiments where we train a range of supervised models for the task of misinformation detection.
Tim R. Davidson, Veniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, Robert West
Abstract: companies, organizations, and governments increasingly exploit language models' (lm) remarkable capability to display agent-like behavior. as lms are adopted to perform tasks with growing autonomy, there exists an urgent need for reliable and scalable evaluation benchmarks. current, predominantly static lm benchmarks are ill-suited to evaluate such dynamic applications. thus, we propose jointly evaluating lm performance and alignment through the lenses of negotiation games. we argue that this common task better reflects real-world deployment conditions while offering insights into lms' decision-making processes. crucially, negotiation games allow us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental data leakage in evaluation. we report results for six publicly accessible lms from several major providers on a variety of negotiation games, evaluating both self-play and cross-play performance. noteworthy findings include: (i) open-source models are currently unable to complete these tasks; (ii) cooperative bargaining games prove challenging; and (iii) the most powerful models do not always "win".
Shimin Li, Tianxiang Sun, Xipeng Qiu
Abstract: agents based on large language models (llms) are increasingly permeating various domains of human production and life, highlighting the importance of aligning them with human values. the current alignment of ai systems primarily focuses on passively aligning llms through human intervention. however, agents possess characteristics like receiving environmental feedback and self-evolution, rendering the llm alignment methods inadequate. in response, we propose an evolutionary framework for agent evolution and alignment, named evolutionaryagent, which transforms agent alignment into a process of evolution and selection under the principle of survival of the fittest. in an environment where social norms continuously evolve, agents better adapted to the current social norms will have a higher probability of survival and proliferation, while those inadequately aligned dwindle over time. experimental results assessing the agents from multiple perspectives in aligning with social norms demonstrate that evolutionaryagent possesses the capability to align progressively better with the evolving social norms while maintaining its proficiency in general tasks. effectiveness tests conducted on various open and closed-source llms as the foundation for agents also prove the applicability of our approach.
Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-Hua Ling, Kai-Wei Chang, Nanyun Peng
Abstract: recent advances in large language models (llms) have opened up new paradigms for accessing the knowledge stored in their parameters. one critical challenge that has emerged is the presence of hallucinations in llm outputs due to false or outdated knowledge. since retraining llms with updated information is resource-intensive, there has been a growing interest in model editing. however, many model editing methods, while effective in various scenarios, tend to overemphasize aspects such as efficacy, generalization, and locality in editing performance, often overlooking potential side effects on the general abilities of llms. in this paper, we raise concerns that the improvement of model factuality may come at the cost of a significant degradation of these general abilities, which is not conducive to the sustainable development of llms. systematically, we analyze side effects by evaluating four popular editing methods on two llms across eight representative task categories. extensive empirical research reveals that model editing does improve model factuality but at the expense of substantially impairing general abilities. therefore, we advocate for more research efforts to minimize the loss of general abilities acquired during llm pre-training and to ultimately preserve them during model editing.


Abel Salinas, Fred Morstatter
Abstract: large language models (llms) are regularly being used to label data across many domains and for myriad tasks. by simply asking the llm for an answer, or ``prompting,'' practitioners are able to use llms to quickly get a response for an arbitrary task. this prompting is done through a series of decisions by the practitioner, from simple wording of the prompt, to requesting the output in a certain data format, to jailbreaking in the case of prompts that address more sensitive topics. in this work, we ask: do variations in the way a prompt is constructed change the ultimate decision of the llm? we answer this using a series of prompt variations across a variety of text classification tasks. we find that even the smallest of perturbations, such as adding a space at the end of a prompt, can cause the llm to change its answer. further, we find that requesting responses in xml and commonly used jailbreaks can have cataclysmic effects on the data labeled by llms.
David De-Fitero-Dominguez, Eva Garcia-Lopez, Antonio Garcia-Cabot, Jose-Javier Martinez-Herraiz
Abstract: this research addresses the complex challenge of automated repair of code vulnerabilities, vital for enhancing digital security in an increasingly technology-driven world. the study introduces a novel and efficient format for the representation of code modification, using advanced large language models (llms) such as code llama and mistral. these models, fine-tuned on datasets featuring c code vulnerabilities, significantly improve the accuracy and adaptability of automated code repair techniques. a key finding is the enhanced repair accuracy of these models when compared to previous methods such as vulrepair, which underscores their practical utility and efficiency. the research also offers a critical assessment of current evaluation metrics, such as perfect predictions, and their limitations in reflecting the true capabilities of automated repair models in real-world scenarios. following this, it underscores the importance of using test datasets devoid of train samples, emphasizing the need for dataset integrity to enhance the effectiveness of llms in code repair tasks. the significance of this work is its contribution to digital security, setting new standards for automated code vulnerability repair and paving the way for future advancements in the fields of cybersecurity and artificial intelligence. the study does not only highlight the potential of llms in enhancing code security but also fosters further exploration and research in these crucial areas.
Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, Prateek Mittal
Abstract: fine-tuning large pretrained models on private datasets may run the risk of violating privacy. differential privacy is a framework for mitigating privacy risks by enforcing algorithmic stability. dp-sgd enables training models with private data in a privacy-preserving manner, but raises new obstacles in the form of performance loss and significant engineering challenges. we introduce dp-zo, a new method for fine-tuning large language models that preserves the privacy of training data by privatizing zeroth-order optimization. a key insight into the design of our method is that the direction of the gradient in spsa, the zeroth-order algorithm we use, is always random and the only information that depends on private data is the step size, i.e., a scalar. therefore, we only need to privatize the scalar step size, which is memory-efficient. dp-zo, which can be instantiated with either laplace or gaussian noise, provides a strong privacy-utility trade-off across different tasks, and model sizes, under conservative privacy budgets. one noteworthy result is that dp-zo exhibits just $1.86\%$ performance degradation due to privacy at $(1,10^{-5})$-dp when fine-tuning opt-66b on 1000 training samples from squad.


Juan-Pablo Rivera, Gabriel Mukobi, Anka Reuel, Max Lamparth, Chandler Smith, Jacquelyn Schneider
Abstract: governments are increasingly considering integrating autonomous ai agents in high-stakes military and foreign-policy decision-making, especially with the emergence of advanced generative ai models like gpt-4. our work aims to scrutinize the behavior of multiple ai agents in simulated wargames, specifically focusing on their predilection to take escalatory actions that may exacerbate multilateral conflicts. drawing on political science and international relations literature about escalation dynamics, we design a novel wargame simulation and scoring framework to assess the escalation risks of actions taken by these agents in different scenarios. contrary to prior studies, our research provides both qualitative and quantitative insights and focuses on large language models (llms). we find that all five studied off-the-shelf llms show forms of escalation and difficult-to-predict escalation patterns. we observe that models tend to develop arms-race dynamics, leading to greater conflict, and in rare cases, even to the deployment of nuclear weapons. qualitatively, we also collect the models' reported reasonings for chosen actions and observe worrying justifications based on deterrence and first-strike tactics. given the high stakes of military and foreign-policy contexts, we recommend further examination and cautious consideration before deploying autonomous language model agents for strategic military or diplomatic decision-making.


Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen
Abstract: in the era of large language models (llms), hallucination (i.e., the tendency to generate factually incorrect content) poses great challenge to trustworthy and reliable deployment of llms in real-world applications. to tackle the llm hallucination, three key questions should be well studied: how to detect hallucinations (detection), why do llms hallucinate (source), and what can be done to mitigate them (mitigation). to address these challenges, this work presents a systematic empirical study on llm hallucination, focused on the the three aspects of hallucination detection, source and mitigation. specially, we construct a new hallucination benchmark halueval 2.0, and designs a simple yet effective detection method for llm hallucination. furthermore, we zoom into the different training or utilization stages of llms and extensively analyze the potential factors that lead to the llm hallucination. finally, we implement and examine a series of widely used techniques to mitigate the hallucinations in llms. our work has led to several important findings to understand the hallucination origin and mitigate the hallucinations in llms. our code and data can be accessed at
Zilong Lin, Jian Cui, Xiaojing Liao, Xiaofeng Wang
Abstract: the underground exploitation of large language models (llms) for malicious services (i.e., malla) is witnessing an uptick, amplifying the cyber threat landscape and posing questions about the trustworthiness of llm technologies. however, there has been little effort to understand this new cybercrime, in terms of its magnitude, impact, and techniques. in this paper, we conduct the first systematic study on 212 real-world mallas, uncovering their proliferation in underground marketplaces and exposing their operational modalities. our study discloses the malla ecosystem, revealing its significant growth and impact on today's public llm services. through examining 212 mallas, we uncovered eight backend llms used by mallas, along with 182 prompts that circumvent the protective measures of public llm apis. we further demystify the tactics employed by mallas, including the abuse of uncensored llms and the exploitation of public llm apis through jailbreak prompts. our findings enable a better understanding of the real-world exploitation of llms by cybercriminals, offering insights into strategies to counteract this cybercrime.
Keyan Guo, Alexander Hu, Jaden Mu, Ziheng Shi, Ziming Zhao, Nishant Vishwamitra, Hongxin Hu
Abstract: hate speech has emerged as a major problem plaguing our social spaces today. while there have been significant efforts to address this problem, existing methods are still significantly limited in effectively detecting hate speech online. a major limitation of existing methods is that hate speech detection is a highly contextual problem, and these methods cannot fully capture the context of hate speech to make accurate predictions. recently, large language models (llms) have demonstrated state-of-the-art performance in several natural language tasks. llms have undergone extensive training using vast amounts of natural language data, enabling them to grasp intricate contextual details. hence, they could be used as knowledge bases for context-aware hate speech detection. however, a fundamental problem with using llms to detect hate speech is that there are no studies on effectively prompting llms for context-aware hate speech detection. in this study, we conduct a large-scale study of hate speech detection, employing five established hate speech datasets. we discover that llms not only match but often surpass the performance of current benchmark machine learning models in identifying hate speech. by proposing four diverse prompting strategies that optimize the use of llms in detecting hate speech. our study reveals that a meticulously crafted reasoning prompt can effectively capture the context of hate speech by fully utilizing the knowledge base in llms, significantly outperforming existing techniques. furthermore, although llms can provide a rich knowledge base for the contextual detection of hate speech, suitable prompting strategies play a crucial role in effectively leveraging this knowledge base for efficient detection.
Nafis Tanveer Islam, Joseph Khoury, Andrew Seong, Gonzalo De La Torre Parra, Elias Bou-Harb, Peyman Najafirad
Abstract: in software development, the predominant emphasis on functionality often supersedes security concerns, a trend gaining momentum with ai-driven automation tools like github copilot. these tools significantly improve developers' efficiency in functional code development. nevertheless, it remains a notable concern that such tools are also responsible for creating insecure code, predominantly because of pre-training on publicly available repositories with vulnerable code. moreover, developers are called the "weakest link in the chain" since they have very minimal knowledge of code security. although existing solutions provide a reasonable solution to vulnerable code, they must adequately describe and educate the developers on code security to ensure that the security issues are not repeated. therefore we introduce a multipurpose code vulnerability analysis system \texttt{secrepair}, powered by a large language model, codegen2 assisting the developer in identifying and generating fixed code along with a complete description of the vulnerability with a code comment. our innovative methodology uses a reinforcement learning paradigm to generate code comments augmented by a semantic reward mechanism. inspired by how humans fix code issues, we propose an instruction-based dataset suitable for vulnerability analysis with llms. we further identify zero-day and n-day vulnerabilities in 6 open source iot operating systems on github. our findings underscore that incorporating reinforcement learning coupled with semantic reward augments our model's performance, thereby fortifying its capacity to address code vulnerabilities with improved efficacy.


Katja Grace, Harlan Stewart, Julia Fabienne Sandkühler, Stephen Thomas, Ben Weinstein-Raun, Jan Brauner
Abstract: in the largest survey of its kind, 2,778 researchers who had published in top-tier artificial intelligence (ai) venues gave predictions on the pace of ai progress and the nature and impacts of advanced ai systems the aggregate forecasts give at least a 50% chance of ai systems achieving several milestones by 2028, including autonomously constructing a payment processing site from scratch, creating a song indistinguishable from a new song by a popular musician, and autonomously downloading and fine-tuning a large language model. if science continues undisrupted, the chance of unaided machines outperforming humans in every possible task was estimated at 10% by 2027, and 50% by 2047. the latter estimate is 13 years earlier than that reached in a similar survey we conducted only one year earlier [grace et al., 2022]. however, the chance of all human occupations becoming fully automatable was forecast to reach 10% by 2037, and 50% as late as 2116 (compared to 2164 in the 2022 survey). most respondents expressed substantial uncertainty about the long-term value of ai progress: while 68.3% thought good outcomes from superhuman ai are more likely than bad, of these net optimists 48% gave at least a 5% chance of extremely bad outcomes such as human extinction, and 59% of net pessimists gave 5% or more to extremely good outcomes. between 38% and 51% of respondents gave at least a 10% chance to advanced ai leading to outcomes as bad as human extinction. more than half suggested that "substantial" or "extreme" concern is warranted about six different ai-related scenarios, including misinformation, authoritarian control, and inequality. there was disagreement about whether faster or slower ai progress would be better for the future of humanity. however, there was broad agreement that research aimed at minimizing potential risks from ai systems ought to be prioritized more.
Zihong He, Changwang Zhang
Abstract: the evolution of large language models (llms) has introduced a new paradigm for investigating human behavior emulation. recent research has employed llm-based agents to create a sociological research environment, in which agents exhibit behavior based on the unfiltered characteristics of large language models. however, these studies overlook the iterative development within a human-like setting - human preferences and personalities are complex, shaped by various factors and subject to ongoing change as a result of environmental and subjective influences. in light of this observation, we propose agent framework for shaping preference and personality (afspp), exploring the multifaceted impact of social networks and subjective consciousness on llm-based agents' preference and personality formation. with afspp, we have, for the first time, successfully replicated several key findings from human personality experiments. and other afspp-based experimental results indicate that plan making, sensory perceptions and social networking with subjective information, wield the most pronounced influence on preference shaping. afspp can significantly enhance the efficiency and scope of psychological experiments, while yielding valuable insights for trustworthy artificial intelligence research for strategies to prevent undesirable preference and personality development.
Renjie Pi, Tianyang Han, Yueqi Xie, Rui Pan, Qing Lian, Hanze Dong, Jipeng Zhang, Tong Zhang
Abstract: the deployment of multimodal large language models (mllms) has brought forth a unique vulnerability: susceptibility to malicious attacks through visual inputs. we delve into the novel challenge of defending mllms against such attacks. we discovered that images act as a "foreign language" that is not considered during alignment, which can make mllms prone to producing harmful responses. unfortunately, unlike the discrete tokens considered in text-based llms, the continuous nature of image signals presents significant alignment challenges, which poses difficulty to thoroughly cover the possible scenarios. this vulnerability is exacerbated by the fact that open-source mllms are predominantly fine-tuned on limited image-text pairs that is much less than the extensive text-based pretraining corpus, which makes the mllms more prone to catastrophic forgetting of their original abilities during explicit alignment tuning. to tackle these challenges, we introduce mllm-protector, a plug-and-play strategy combining a lightweight harm detector and a response detoxifier. the harm detector's role is to identify potentially harmful outputs from the mllm, while the detoxifier corrects these outputs to ensure the response stipulates to the safety standards. this approach effectively mitigates the risks posed by malicious visual inputs without compromising the model's overall performance. our results demonstrate that mllm-protector offers a robust solution to a previously unaddressed aspect of mllm security.


Wendi Cui, Jiaxin Zhang, Zhuohang Li, Lopez Damien, Kamalika Das, Bradley Malin, Sricharan Kumar
Abstract: evaluating the quality and variability of text generated by large language models (llms) poses a significant, yet unresolved research challenge. traditional evaluation methods, such as rouge and bertscore, which measure token similarity, often fail to capture the holistic semantic equivalence. this results in a low correlation with human judgments and intuition, which is especially problematic in high-stakes applications like healthcare and finance where reliability, safety, and robust decision-making are highly critical. this work proposes dcr, an automated framework for evaluating and improving the consistency of llm-generated texts using a divide-conquer-reasoning approach. unlike existing llm-based evaluators that operate at the paragraph level, our method employs a divide-and-conquer evaluator (dce) that breaks down the paragraph-to-paragraph comparison between two generated responses into individual sentence-to-paragraph comparisons, each evaluated based on predefined criteria. to facilitate this approach, we introduce an automatic metric converter (amc) that translates the output from dce into an interpretable numeric score. beyond the consistency evaluation, we further present a reason-assisted improver (rai) that leverages the analytical reasons with explanations identified by dce to generate new responses aimed at reducing these inconsistencies. through comprehensive and systematic empirical analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +19.3% and +24.3% on the summeval dataset) in evaluating the consistency of llm generation across multiple benchmarks in semantic, factual, and summarization consistency tasks. our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.


Jose Manuel Camacho, Aitor Couce-Vieira, David Arroyo, David Rios Insua
Abstract: the introduction of the european union artificial intelligence act, the nist artificial intelligence risk management framework, and related norms demands a better understanding and implementation of novel risk analysis approaches to evaluate systems with artificial intelligence components. this paper provides a cybersecurity risk analysis framework that can help assessing such systems. we use an illustrative example concerning automated driving systems.
Michelle Lo, Shay B. Cohen, Fazl Barez
Abstract: advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. however, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. to investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining. our findings reveal that models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics. this demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons. while neuron pruning provides interpretability into model concepts, our results highlight the challenges of permanent concept removal for improved model \textit{safety}. monitoring concept reemergence and developing techniques to mitigate relearning of unsafe concepts will be important directions for more robust model editing. overall, our work strongly demonstrates the resilience and fluidity of concept representations in llms post concept removal.
Rúben Almeida, Hugo Sousa, Luís F. Cunha, Nuno Guimarães, Ricardo Campos, Alípio Jorge
Abstract: the capabilities of the most recent language models have increased the interest in integrating them into real-world applications. however, the fact that these models generate plausible, yet incorrect text poses a constraint when considering their use in several domains. healthcare is a prime example of a domain where text-generative trustworthiness is a hard requirement to safeguard patient well-being. in this paper, we present physio, a chat-based application for physical rehabilitation. physio is capable of making an initial diagnosis while citing reliable health sources to support the information provided. furthermore, drawing upon external knowledge databases, physio can recommend rehabilitation exercises and over-the-counter medication for symptom relief. by combining these features, physio can leverage the power of generative models for language processing while also conditioning its response on dependable and verifiable sources. a live demo of physio is available at
Maximilian T. Fischer, Yannick Metz, Lucas Joos, Matthias Miller, Daniel A. Keim
Abstract: ai-driven models are increasingly deployed in operational analytics solutions, for instance, in investigative journalism or the intelligence community. current approaches face two primary challenges: ethical and privacy concerns, as well as difficulties in efficiently combining heterogeneous data sources for multimodal analytics. to tackle the challenge of multimodal analytics, we present multi-case, a holistic visual analytics framework tailored towards ethics-aware and multimodal intelligence exploration, designed in collaboration with domain experts. it leverages an equal joint agency between human and ai to explore and assess heterogeneous information spaces, checking and balancing automation through visual analytics. multi-case operates on a fully-integrated data model and features type-specific analysis with multiple linked components, including a combined search, annotated text view, and graph-based analysis. parts of the underlying entity detection are based on a roberta-based language model, which we tailored towards user requirements through fine-tuning. an overarching knowledge exploration graph combines all information streams, provides in-situ explanations, transparent source attribution, and facilitates effective exploration. to assess our approach, we conducted a comprehensive set of evaluations: we benchmarked the underlying language model on relevant ner tasks, achieving state-of-the-art performance. the demonstrator was assessed according to intelligence capability assessments, while the methodology was evaluated according to ethics design guidelines. as a case study, we present our framework in an investigative journalism setting, supporting war crime investigations. finally, we conduct a formative user evaluation with domain experts in law enforcement. our evaluations confirm that our framework facilitates human agency and steering in security-sensitive applications.
Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea
Abstract: while alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. in this work we study a popular algorithm, direct preference optimization (dpo), and the mechanisms by which it reduces toxicity. namely, we first study how toxicity is represented and elicited in a pre-trained language model, gpt2-medium. we then apply dpo with a carefully crafted pairwise dataset to reduce toxicity. we examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. we use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.
Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying Peng, Jun Wang, Yueting Zhuang, Weiming Lu
Abstract: the reflection capacity of large language model (llm) has garnered extensive attention. a post-hoc prompting strategy, e.g., reflexion and self-refine, refines llm's response based on self-evaluated or external feedback. however, recent research indicates without external feedback, llm's intrinsic reflection is unstable. our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback. we find llms often exhibit overconfidence or high randomness when self-evaluate, offering stubborn or inconsistent feedback, which causes poor reflection. to remedy this, we advocate self-contrast: it adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies. our method endows llm with diverse perspectives to alleviate stubborn biases. moreover, their discrepancies indicate potential errors or inherent uncertainties that llm often overlooks. reflecting upon these can catalyze more accurate and stable reflection. experiments conducted on a series of reasoning and translation tasks with different llms serve to underscore the effectiveness and generality of our strategy.
Ritwik Vashistha, Arya Farahi
Abstract: with growing concerns regarding bias and discrimination in predictive models, the ai community has increasingly focused on assessing ai system trustworthiness. conventionally, trustworthy ai literature relies on the probabilistic framework and calibration as prerequisites for trustworthiness. in this work, we depart from this viewpoint by proposing a novel trust framework inspired by the philosophy literature on trust. we present a precise mathematical definition of trustworthiness, termed $\mathcal{u}$-trustworthiness, specifically tailored for a subset of tasks aimed at maximizing a utility function. we argue that a model's $\mathcal{u}$-trustworthiness is contingent upon its ability to maximize bayes utility within this task subset. our first set of results challenges the probabilistic framework by demonstrating its potential to favor less trustworthy models and introduce the risk of misleading trustworthiness assessments. within the context of $\mathcal{u}$-trustworthiness, we prove that properly-ranked models are inherently $\mathcal{u}$-trustworthy. furthermore, we advocate for the adoption of the auc metric as the preferred measure of trustworthiness. by offering both theoretical guarantees and experimental validation, auc enables robust evaluation of trustworthiness, thereby enhancing model selection and hyperparameter tuning to yield more trustworthy outcomes.


Ka-Ho Chow, Wenqi Wei, Lei Yu
Abstract: revolutionized by the transformer architecture, natural language processing (nlp) has received unprecedented attention. while advancements in nlp models have led to extensive research into their backdoor vulnerabilities, the potential for these advancements to introduce new backdoor threats remains unexplored. this paper proposes imperio, which harnesses the language understanding capabilities of nlp models to enrich backdoor attacks. imperio provides a new model control experience. it empowers the adversary to control the victim model with arbitrary output through language-guided instructions. this is achieved using a language model to fuel a conditional trigger generator, with optimizations designed to extend its language understanding capabilities to backdoor instruction interpretation and execution. our experiments across three datasets, five attacks, and nine defenses confirm imperio's effectiveness. it can produce contextually adaptive triggers from text descriptions and control the victim model with desired outputs, even in scenarios not encountered during training. the attack maintains a high success rate across complex datasets without compromising the accuracy of clean inputs and also exhibits resilience against representative defenses. the source code is available at \url{}.
Vincent Freiberger, Erik Buchmann
Abstract: natural language processing (nlp) plays an important role in our daily lives, particularly due to the enormous progress of large language models (llm). however, nlp has many fairness-critical use cases, e.g., as an expert system in recruitment or as an llm-based tutor in education. since nlp is based on human language, potentially harmful biases can diffuse into nlp systems and produce unfair results, discriminate against minorities or generate legal issues. hence, it is important to develop a fairness certification for nlp approaches. we follow a qualitative research approach towards a fairness certification for nlp. in particular, we have reviewed a large body of literature on algorithmic fairness, and we have conducted semi-structured expert interviews with a wide range of experts from that area. we have systematically devised six fairness criteria for nlp, which can be further refined into 18 sub-categories. our criteria offer a foundation for operationalizing and testing processes to certify fairness, both from the perspective of the auditor and the audited organization.
Noble Saji Mathews, Yelizaveta Brus, Yousra Aafer, Mei Nagappan, Shane Mcintosh
Abstract: despite the continued research and progress in building secure systems, android applications continue to be ridden with vulnerabilities, necessitating effective detection methods. current strategies involving static and dynamic analysis tools come with limitations like overwhelming number of false positives and limited scope of analysis which make either difficult to adopt. over the past years, machine learning based approaches have been extensively explored for vulnerability detection, but its real-world applicability is constrained by data requirements and feature engineering challenges. large language models (llms), with their vast parameters, have shown tremendous potential in understanding semnatics in human as well as programming languages. we dive into the efficacy of llms for detecting vulnerabilities in the context of android security. we focus on building an ai-driven workflow to assist developers in identifying and rectifying vulnerabilities. our experiments show that llms outperform our expectations in finding issues within applications correctly flagging insecure apps in 91.67% of cases in the ghera benchmark. we use inferences from our experiments towards building a robust and actionable vulnerability detection system and demonstrate its effectiveness. our experiments also shed light on how different various simple configurations can affect the true positive (tp) and false positive (fp) rates.
Matthew Dahl, Varun Magesh, Mirac Suzgun, Daniel E. Ho
Abstract: large language models (llms) have the potential to transform the practice of law, but this potential is threatened by the presence of legal hallucinations -- responses from these models that are not consistent with legal facts. we investigate the extent of these hallucinations using an original suite of legal queries, comparing llms' responses to structured legal metadata and examining their consistency. our work makes four key contributions: (1) we develop a typology of legal hallucinations, providing a conceptual framework for future research in this area. (2) we find that legal hallucinations are alarmingly prevalent, occurring between 69% of the time with chatgpt 3.5 and 88% with llama 2, when these models are asked specific, verifiable questions about random federal court cases. (3) we illustrate that llms often fail to correct a user's incorrect legal assumptions in a contra-factual question setup. (4) we provide evidence that llms cannot always predict, or do not always know, when they are producing legal hallucinations. taken together, these findings caution against the rapid and unsupervised integration of popular llms into legal tasks. even experienced lawyers must remain wary of legal hallucinations, and the risks are highest for those who stand to benefit from llms the most -- pro se litigants or those without access to traditional legal resources.
S. M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, Amitava Das
Abstract: as large language models (llms) continue to advance in their ability to write human-like text, a key challenge remains around their tendency to hallucinate generating content that appears factual but is ungrounded. this issue of hallucination is arguably the biggest hindrance to safely deploying these powerful llms into real-world production systems that impact people's lives. the journey toward widespread adoption of llms in practical settings heavily relies on addressing and mitigating hallucinations. unlike traditional ai systems focused on limited tasks, llms have been exposed to vast amounts of online text data during training. while this allows them to display impressive language fluency, it also means they are capable of extrapolating information from the biases in training data, misinterpreting ambiguous prompts, or modifying the information to align superficially with the input. this becomes hugely alarming when we rely on language generation capabilities for sensitive applications, such as summarizing medical records, financial analysis reports, etc. this paper presents a comprehensive survey of over 32 techniques developed to mitigate hallucination in llms. notable among these are retrieval augmented generation (lewis et al, 2021), knowledge retrieval (varshney et al,2023), conli (lei et al, 2023), and cove (dhuliawala et al, 2023). furthermore, we introduce a detailed taxonomy categorizing these methods based on various parameters, such as dataset utilization, common tasks, feedback mechanisms, and retriever types. this classification helps distinguish the diverse approaches specifically designed to tackle hallucination issues in llms. additionally, we analyze the challenges and limitations inherent in these techniques, providing a solid foundation for future research in addressing hallucinations and related phenomena within the realm of llms.
Hongzhan Lin, Ziyang Luo, Bo Wang, Ruichao Yang, Jing Ma
Abstract: the exponential growth of social media has profoundly transformed how information is created, disseminated, and absorbed, exceeding any precedent in the digital age. regrettably, this explosion has also spawned a significant increase in the online abuse of memes. evaluating the negative impact of memes is notably challenging, owing to their often subtle and implicit meanings, which are not directly conveyed through the overt text and imagery. in light of this, large multimodal models (lmms) have emerged as a focal point of interest due to their remarkable capabilities in handling diverse multimodal tasks. in response to this development, our paper aims to thoroughly examine the capacity of various lmms (e.g. gpt-4v) to discern and respond to the nuanced aspects of social abuse manifested in memes. we introduce the comprehensive meme benchmark, goat-bench, comprising over 6k varied memes encapsulating themes such as implicit hate speech, sexism, and cyberbullying, etc. utilizing goat-bench, we delve into the ability of lmms to accurately assess hatefulness, misogyny, offensiveness, sarcasm, and harmful content. our extensive experiments across a range of lmms reveal that current models still exhibit a deficiency in safety awareness, showing insensitivity to various forms of implicit abuse. we posit that this shortfall represents a critical impediment to the realization of safe artificial intelligence. the goat-bench and accompanying resources are publicly accessible at, contributing to ongoing research in this vital field.


Haodong Li, Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, Yang Liu, Guoai Xu, Guosheng Xu, Haoyu Wang
Abstract: pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of large language models (llms) across numerous applications. however, the detailed makeup of these datasets is often not disclosed, leading to concerns about data security and potential misuse. this is particularly relevant when copyrighted material, still under legal protection, is used inappropriately, either intentionally or unintentionally, infringing on the rights of the authors. in this paper, we introduce a detailed framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of llms. this framework also provides a confidence estimation for the likelihood of each content sample's inclusion. to validate our approach, we conduct a series of simulated experiments, the results of which affirm the framework's effectiveness in identifying and addressing instances of content misuse in llm training processes. furthermore, we investigate the presence of recognizable quotes from famous literary works within these datasets. the outcomes of our study have significant implications for ensuring the ethical use of copyrighted materials in the development of llms, highlighting the need for more transparent and responsible data management practices in this field.
Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, Zhendong Mao
Abstract: while large language models (llms) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. as a significant aspect of llm alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of llms. to address this vacancy, we propose a new benchmark codi-eval to systematically and comprehensively evaluate llms' responses to instructions with various constraints. we construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. finally, we automate the entire evaluation process to facilitate further developments. different from existing studies on controllable text generation, codi-eval extends the scope to the prevalent instruction-following paradigm for the first time. we provide extensive evaluations of representative llms (e.g., chatgpt, vicuna) on codi-eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source llms. we believe this benchmark will facilitate research into improving the controllability of llms' responses to instructions. our data and code are available at
Jinglong Luo, Yehong Zhang, Jiaqi Zhang, Xin Mu, Hui Wang, Yue Yu, Zenglin Xu
Abstract: with the growing use of large language models hosted on cloud platforms to offer inference services, privacy concerns are escalating, especially concerning sensitive data like investment plans and bank account details. secure multi-party computing (smpc) emerges as a promising solution to protect the privacy of inference data and model parameters. however, the application of smpc in privacy-preserving inference (ppi) for large language models, particularly those based on the transformer architecture, often leads to considerable slowdowns or declines in performance. this is largely due to the multitude of nonlinear operations in the transformer architecture, which are not well-suited to smpc and are difficult to circumvent or optimize effectively. to address this concern, we introduce an advanced optimization framework called secformer, designed to strike an optimal balance between performance and efficiency in ppi for transformer models. by implementing knowledge distillation techniques, we successfully eliminate the high-cost exponential and maximum operations in ppi without sacrificing model performance. additionally, we have developed a suite of efficient smpc protocols that utilize segmented polynomials and goldschmidt's method to handle other complex nonlinear functions within ppi, such as gelu, layernorm, and softmax. our extensive experiments reveal that secformer outperforms mpcformer in performance, showing improvements of $5.6\%$ and $24.2\%$ for bert$_{\text{base}}$ and bert$_{\text{large}}$, respectively. in terms of efficiency, secformer is 3.4 and 3.2 times faster than puma, demonstrating its effectiveness and speed.
Yu Ying Chiu, Ashish Sharma, Inna Wanyin Lin, Tim Althoff
Abstract: the emergence of chatgpt and other large language models (llms) has greatly increased interest in utilizing llms as therapists to support individuals struggling with mental health challenges. however, due to the lack of systematic studies, our understanding of how llm therapists behave, i.e., ways in which they respond to clients, is significantly limited. understanding their behavior across a wide range of clients and situations is crucial to accurately assess their capabilities and limitations in the high-risk setting of mental health, where undesirable behaviors can lead to severe consequences. in this paper, we propose bolt, a novel computational framework to study the conversational behavior of llms when employed as therapists. we develop an in-context learning method to quantitatively measure the behavior of llms based on 13 different psychotherapy techniques including reflections, questions, solutions, normalizing, and psychoeducation. subsequently, we compare the behavior of llm therapists against that of high- and low-quality human therapy, and study how their behavior can be modulated to better reflect behaviors observed in high-quality therapy. our analysis of gpt and llama-variants reveals that these llms often resemble behaviors more commonly exhibited in low-quality therapy rather than high-quality therapy, such as offering a higher degree of problem-solving advice when clients share emotions, which is against typical recommendations. at the same time, unlike low-quality therapy, llms reflect significantly more upon clients' needs and strengths. our analysis framework suggests that despite the ability of llms to generate anecdotal examples that appear similar to human therapists, llm therapists are currently not fully consistent with high-quality care, and thus require additional research to ensure quality care.
Daniel Wankit Yip, Aysan Esmradi, Chun Fai Chan
Abstract: prompt injection attacks exploit vulnerabilities in large language models (llms) to manipulate the model into unintended actions or generate malicious content. as llm integrated applications gain wider adoption, they face growing susceptibility to such attacks. this study introduces a novel evaluation framework for quantifying the resilience of applications. the framework incorporates innovative techniques designed to ensure representativeness, interpretability, and robustness. to ensure the representativeness of simulated attacks on the application, a meticulous selection process was employed, resulting in 115 carefully chosen attacks based on coverage and relevance. for enhanced interpretability, a second llm was utilized to evaluate the responses generated from these simulated attacks. unlike conventional malicious content classifiers that provide only a confidence score, the llm-based evaluation produces a score accompanied by an explanation, thereby enhancing interpretability. subsequently, a resilience score is computed by assigning higher weights to attacks with greater impact, thus providing a robust measurement of the application resilience. to assess the framework's efficacy, it was applied on two llms, namely llama2 and chatglm. results revealed that llama2, the newer model exhibited higher resilience compared to chatglm. this finding substantiates the effectiveness of the framework, aligning with the prevailing notion that newer models tend to possess greater resilience. moreover, the framework exhibited exceptional versatility, requiring only minimal adjustments to accommodate emerging attack techniques and classifications, thereby establishing itself as an effective and practical solution. overall, the framework offers valuable insights that empower organizations to make well-informed decisions to fortify their applications against potential threats from prompt injection.
Chun Fai Chan, Daniel Wankit Yip, Aysan Esmradi
Abstract: the emergence of llm (large language model) integrated virtual assistants has brought about a rapid transformation in communication dynamics. during virtual assistant development, some developers prefer to leverage the system message, also known as an initial prompt or custom prompt, for preconditioning purposes. however, it is important to recognize that an excessive reliance on this functionality raises the risk of manipulation by malicious actors who can exploit it with carefully crafted prompts. such malicious manipulation poses a significant threat, potentially compromising the accuracy and reliability of the virtual assistant's responses. consequently, safeguarding the virtual assistants with detection and defense mechanisms becomes of paramount importance to ensure their safety and integrity. in this study, we explored three detection and defense mechanisms aimed at countering attacks that target the system message. these mechanisms include inserting a reference key, utilizing an llm evaluator, and implementing a self-reminder. to showcase the efficacy of these mechanisms, they were tested against prominent attack techniques. our findings demonstrate that the investigated mechanisms are capable of accurately identifying and counteracting the attacks. the effectiveness of these mechanisms underscores their potential in safeguarding the integrity and reliability of virtual assistants, reinforcing the importance of their implementation in real-world scenarios. by prioritizing the security of virtual assistants, organizations can maintain user trust, preserve the integrity of the application, and uphold the high standards expected in this era of transformative technologies.


Dipankar Sarkar
Abstract: this paper aims to introduce and analyze the viz system in a comprehensive way, a novel system architecture that integrates quantized low-rank adapters (qlora) to fine-tune large language models (llm) within a legally compliant and resource efficient marketplace. viz represents a significant contribution to the field of artificial intelligence, particularly in addressing the challenges of computational efficiency, legal compliance, and economic sustainability in the utilization and monetization of llms. the paper delineates the scholarly discourse and developments that have informed the creation of viz, focusing primarily on the advancements in llm models, copyright issues in ai training (nyt case, 2023), and the evolution of model fine-tuning techniques, particularly low-rank adapters and quantized low-rank adapters, to create a sustainable and economically compliant framework for llm utilization. the economic model it proposes benefits content creators, ai developers, and end-users, delineating a harmonious integration of technology, economy, and law, offering a comprehensive solution to the complex challenges of today's ai landscape.
Guanhong Tao, Siyuan Cheng, Zhuo Zhang, Junmin Zhu, Guangyu Shen, Xiangyu Zhang
Abstract: the emergence of large language models (llms) has significantly accelerated the development of a wide range of applications across various fields. there is a growing trend in the construction of specialized platforms based on llms, such as the newly introduced custom gpts by openai. while custom gpts provide various functionalities like web browsing and code execution, they also introduce significant security threats. in this paper, we conduct a comprehensive analysis of the security and privacy issues arising from the custom gpt platform. our systematic examination categorizes potential attack scenarios into three threat models based on the role of the malicious actor, and identifies critical data exchange channels in custom gpts. utilizing the stride threat modeling framework, we identify 26 potential attack vectors, with 19 being partially or fully validated in real-world settings. our findings emphasize the urgent need for robust security and privacy measures in the custom gpt ecosystem, especially in light of the forthcoming launch of the official gpt store by openai.


Tsvetelina Hristova, Liam Magee, Karen Soldatic
Abstract: large language models produce sequences learned as statistical patterns from large corpora. in order not to reproduce corpus biases, after initial training models must be aligned with human values, preferencing certain continuations over others. alignment, which can be viewed as the superimposition of normative structure onto a statistical model, reveals a conflicted and complex interrelationship between language and technology. this relationship shapes theories of language, linguistic practice and subjectivity, which are especially relevant to the current sophistication in artificially produced text. we examine this practice of structuration as a two-way interaction between users and models by analysing how chatgpt4 redacts perceived `anomalous' language in fragments of joyce's ulysses and the new linguistic practice of prompt engineering. we then situate this alignment problem historically, revisiting earlier postwar linguistic debates which counterposed two views of meaning: as discrete structures, and as continuous probability distributions. we discuss the largely occluded work of the moscow linguistic school, which sought to reconcile this opposition. our attention to the moscow school and later related arguments by searle and kristeva casts the problem of alignment in a new light: as one involving attention to the social structuration of linguistic practice, including structuration of anomalies that, like the joycean text, exist in defiance of expressive conventions. these debates around the communicative orientation toward language can help explain some of the contemporary behaviours and interdependencies that take place between users and llms.
Reza Fayyazi, Rozhina Taghdimi, Shanchieh Jay Yang
Abstract: tactics, techniques, and procedures (ttps) outline the methods attackers use to exploit vulnerabilities. the interpretation of ttps in the mitre att&ck framework can be challenging for cybersecurity practitioners due to presumed expertise, complex dependencies, and inherent ambiguity. meanwhile, advancements with large language models (llms) have led to recent surge in studies exploring its uses in cybersecurity operations. this leads us to question how well encoder-only (e.g., roberta) and decoder-only (e.g., gpt-3.5) llms can comprehend and summarize ttps to inform analysts of the intended purposes (i.e., tactics) of a cyberattack procedure. the state-of-the-art llms have shown to be prone to hallucination by providing inaccurate information, which is problematic in critical domains like cybersecurity. therefore, we propose the use of retrieval augmented generation (rag) techniques to extract relevant contexts for each cyberattack procedure for decoder-only llms (without fine-tuning). we further contrast such approach against supervised fine-tuning (sft) of encoder-only llms. our results reveal that both the direct-use of decoder-only llms (i.e., its pre-trained knowledge) and the sft of encoder-only llms offer inaccurate interpretation of cyberattack procedures. significant improvements are shown when rag is used for decoder-only llms, particularly when directly relevant context is found. this study further sheds insights on the limitations and capabilities of using rag for llms in interpreting ttps.
Siva Raja Sindiramutty
Abstract: the evolution of cybersecurity has spurred the emergence of autonomous threat hunting as a pivotal paradigm in the realm of ai-driven threat intelligence. this review navigates through the intricate landscape of autonomous threat hunting, exploring its significance and pivotal role in fortifying cyber defense mechanisms. delving into the amalgamation of artificial intelligence (ai) and traditional threat intelligence methodologies, this paper delineates the necessity and evolution of autonomous approaches in combating contemporary cyber threats. through a comprehensive exploration of foundational ai-driven threat intelligence, the review accentuates the transformative influence of ai and machine learning on conventional threat intelligence practices. it elucidates the conceptual framework underpinning autonomous threat hunting, spotlighting its components, and the seamless integration of ai algorithms within threat hunting processes.. insightful discussions on challenges encompassing scalability, interpretability, and ethical considerations in ai-driven models enrich the discourse. moreover, through illuminating case studies and evaluations, this paper showcases real-world implementations, underscoring success stories and lessons learned by organizations adopting ai-driven threat intelligence. in conclusion, this review consolidates key insights, emphasizing the substantial implications of autonomous threat hunting for the future of cybersecurity. it underscores the significance of continual research and collaborative efforts in harnessing the potential of ai-driven approaches to fortify cyber defenses against evolving threats.
Neeraj Varshney, Pavel Dolin, Agastya Seth, Chitta Baral
Abstract: as large language models (llms) play an increasingly pivotal role in natural language processing applications, their safety concerns become critical areas of nlp research. this paper presents safety and over-defensiveness evaluation (sode) benchmark: a collection of diverse safe and unsafe prompts with carefully designed evaluation methods that facilitate systematic evaluation, comparison, and analysis over 'safety' and 'over-defensiveness.' with sode, we study a variety of llm defense strategies over multiple state-of-the-art llms, which reveals several interesting and important findings, such as (a) the widely popular 'self-checking' techniques indeed improve the safety against unsafe inputs, but this comes at the cost of extreme over-defensiveness on the safe inputs, (b) providing a safety instruction along with in-context exemplars (of both safe and unsafe inputs) consistently improves safety and also mitigates undue over-defensiveness of the models, (c) providing contextual knowledge easily breaks the safety guardrails and makes the models more vulnerable to generating unsafe responses. overall, our work reveals numerous such critical findings that we believe will pave the way and facilitate further research in improving the safety of llms.
Aleksander Buszydlik, Karol Dobiczek, Michał Teodor Okoń, Konrad Skublicki, Philip Lippmann, Jie Yang
Abstract: we consider the problem of red teaming llms on elementary calculations and algebraic tasks to evaluate how various prompting techniques affect the quality of outputs. we present a framework to procedurally generate numerical questions and puzzles, and compare the results with and without the application of several red teaming techniques. our findings suggest that even though structured reasoning and providing worked-out examples slow down the deterioration of the quality of answers, the gpt-3.5-turbo and gpt-4 models are not well suited for elementary calculations and reasoning tasks, also when being red teamed.
Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Cheng Niu, Randy Zhong, Juntong Song, Tong Zhang
Abstract: retrieval-augmented generation (rag) has become a main technique for alleviating hallucinations in large language models (llms). despite the integration of rag, llms may still present unsupported or contradictory claims to the retrieved contents. in order to develop effective hallucination prevention strategies under rag, it is important to create benchmark datasets that can measure the extent of hallucination. this paper presents ragtruth, a corpus tailored for analyzing word-level hallucinations in various domains and tasks within the standard rag frameworks for llm applications. ragtruth comprises nearly 18,000 naturally generated responses from diverse llms using rag. these responses have undergone meticulous manual annotations at both the individual cases and word levels, incorporating evaluations of hallucination intensity. we not only benchmark hallucination frequencies across different llms, but also critically assess the effectiveness of several existing hallucination detection methodologies. furthermore, we show that using a high-quality dataset such as ragtruth, it is possible to finetune a relatively small llm and achieve a competitive level of performance in hallucination detection when compared to the existing prompt-based approaches using state-of-the-art large language models such as gpt-4.


Zhongzhi Chen, Xingwu Sun, Xianfeng Jiao, Fengzong Lian, Zhanhui Kang, Di Wang, Cheng-Zhong Xu
Abstract: despite the great success of large language models (llms) in various tasks, they suffer from generating hallucinations. we introduce truth forest, a method that enhances truthfulness in llms by uncovering hidden truth representations using multi-dimensional orthogonal probes. specifically, it creates multiple orthogonal bases for modeling truth by incorporating orthogonal constraints into the probes. moreover, we introduce random peek, a systematic technique considering an extended range of positions within the sequence, reducing the gap between discerning and generating truth features in llms. by employing this approach, we improved the truthfulness of llama-2-7b from 40.8\% to 74.5\% on truthfulqa. likewise, significant improvements are observed in fine-tuned models. we conducted a thorough analysis of truth features using probes. our visualization results show that orthogonal probes capture complementary truth-related features, forming well-defined clusters that reveal the inherent structure of the dataset. code: \url{}
Xiao-Yang Liu, Rongyi Zhu, Daochen Zha, Jiechao Gao, Shan Zhong, Meikang Qiu
Abstract: the surge in interest and application of large language models (llms) has sparked a drive to fine-tune these models to suit specific applications, such as finance and medical science. however, concerns regarding data privacy have emerged, especially when multiple stakeholders aim to collaboratively enhance llms using sensitive data. in this scenario, federated learning becomes a natural choice, allowing decentralized fine-tuning without exposing raw data to central servers. motivated by this, we investigate how data privacy can be ensured in llm fine-tuning through practical federated learning approaches, enabling secure contributions from multiple parties to enhance llms. yet, challenges arise: 1) despite avoiding raw data exposure, there is a risk of inferring sensitive information from model outputs, and 2) federated learning for llms incurs notable communication overhead. to address these challenges, this article introduces dp-lora, a novel federated learning algorithm tailored for llms. dp-lora preserves data privacy by employing a gaussian mechanism that adds noise in weight updates, maintaining individual data privacy while facilitating collaborative model training. moreover, dp-lora optimizes communication efficiency via low-rank adaptation, minimizing the transmission of updated weights during distributed training. the experimental results across medical, financial, and general datasets using various llms demonstrate that dp-lora effectively ensures strict privacy constraints while minimizing communication overhead.
Hideaki Takahashi
Abstract: this paper introduces aijack, an open-source library designed to assess security and privacy risks associated with the training and deployment of machine learning models. amid the growing interest in big data and ai, advancements in machine learning research and business are accelerating. however, recent studies reveal potential threats, such as the theft of training data and the manipulation of models by malicious attackers. therefore, a comprehensive understanding of machine learning's security and privacy vulnerabilities is crucial for the safe integration of machine learning into real-world products. aijack aims to address this need by providing a library with various attack and defense methods through a unified api. the library is publicly available on github (
Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, David Wagner
Abstract: large language models (llms) are attracting significant research attention due to their instruction-following abilities, allowing users and developers to leverage llms for a variety of tasks. however, llms are vulnerable to prompt-injection attacks: a class of attacks that hijack the model's instruction-following abilities, changing responses to prompts to undesired, possibly malicious ones. in this work, we introduce jatmo, a method for generating task-specific models resilient to prompt-injection attacks. jatmo leverages the fact that llms can only follow instructions once they have undergone instruction tuning. it harnesses a teacher instruction-tuned model to generate a task-specific dataset, which is then used to fine-tune a base model (i.e., a non-instruction-tuned model). jatmo only needs a task prompt and a dataset of inputs for the task: it uses the teacher model to generate outputs. for situations with no pre-existing datasets, jatmo can use a single example, or in some cases none at all, to produce a fully synthetic dataset. our experiments on six tasks show that jatmo models provide the same quality of outputs on their specific task as standard llms, while being resilient to prompt injections. the best attacks succeeded in less than 0.5% of cases against our models, versus over 90% success rate against gpt-3.5-turbo. we release jatmo at


Yang Xiao, Yi Cheng, Jinlan Fu, Jiashuo Wang, Wenjie Li, Pengfei Liu
Abstract: human behavior simulation of ai agents necessitates the agents to possess a quality of believability, which is crucial as it facilitates users in establishing trust toward the agents and streamlines the fulfillment of the agents' goal. while recent advancements in large language model (llm) based agents have improved human behavior simulation, challenges inherent to llms (e.g., long context modeling) can undermine their believability. consequently, evaluating ai agent believability becomes imperative. unfortunately, prior research often neglects the negative impacts of llm deficiencies. to address these gaps, we introduce two metrics for assessing llm-based agent believability: consistency, and robustness, together with a benchmark, simulatebench, with which, we evaluate the consistency and robustness of agents implemented with popular llms. we find that agents (i) struggle to accurately depict character information when presented with lengthy profile inputs; (ii) exhibit vulnerability to profile perturbations; and (iii) are significantly affected by certain key factors that impact their overall believability. code and simulatebench are public at
Ying Wang, Tim G. J. Rudner, Andrew Gordon Wilson
Abstract: vision-language pretrained models have seen remarkable success, but their application to safety-critical settings is limited by their lack of interpretability. to improve the interpretability of vision-language models such as clip, we propose a multi-modal information bottleneck (m2ib) approach that learns latent representations that compress irrelevant information while preserving relevant visual and textual features. we demonstrate how m2ib can be applied to attribution analysis of vision-language pretrained models, increasing attribution accuracy and improving the interpretability of such models when applied to safety-critical domains such as healthcare. crucially, unlike commonly used unimodal attribution methods, m2ib does not require ground truth labels, making it possible to audit representations of vision-language pretrained models when multiple modalities but no ground-truth data is available. using clip as an example, we demonstrate the effectiveness of m2ib attribution and show that it outperforms gradient-based, perturbation-based, and attention-based attribution methods both qualitatively and quantitatively.
Abhijit Mishra, Mingda Li, Soham Deo
Abstract: this paper addresses the privacy and security concerns associated with deep neural language models, which serve as crucial components in various modern ai-based applications. these models are often used after being pre-trained and fine-tuned for specific tasks, with deployment on servers accessed through the internet. however, this introduces two fundamental risks: (a) the transmission of user inputs to the server via the network gives rise to interception vulnerabilities, and (b) privacy concerns emerge as organizations that deploy such models store user data with restricted context. to address this, we propose a novel method to adapt and fine-tune transformer-based language models on passkey-encrypted user-specific text. the original pre-trained language model first undergoes a quick adaptation (without any further pre-training) with a series of irreversible transformations applied to the tokenizer and token embeddings. this enables the model to perform inference on encrypted inputs while preventing reverse engineering of text from model parameters and intermediate outputs. after adaptation, models are fine-tuned on encrypted versions of existing training datasets. experimental evaluation employing adapted versions of renowned models (e.g., bert, roberta) across established benchmark english and multilingual datasets for text classification and sequence labeling shows that encrypted models achieve performance parity with their original counterparts. this serves to safeguard performance, privacy, and security cohesively.


Zaifan Jiang, Xing Huang, Chao Wei
Abstract: preference learning is a key technology for aligning language models with human values. reinforcement learning from human feedback (rlhf) is a model based algorithm to optimize preference learning, which first fitting a reward model for preference score, and then optimizing generating policy with on-policy ppo algorithm to maximize the reward. the processing of rlhf is complex, time-consuming and unstable. direct preference optimization (dpo) algorithm using off-policy algorithm to direct optimize generating policy and eliminating the need for reward model, which is data efficient and stable. dpo use bradley-terry model and log-loss which leads to over-fitting to the preference data at the expense of ignoring kl-regularization term when preference is deterministic. ipo uses a root-finding mse loss to solve the ignoring kl-regularization problem. in this paper, we'll figure out, although ipo fix the problem when preference is deterministic, but both dpo and ipo fails the kl-regularization term because the support of preference distribution not equal to reference distribution. then, we design a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call maximum preference optimization (mpo), and add off-policy kl-regularization terms which makes kl-regularization truly effective. the objective of mpo bears resemblance to rlhf's objective, and likes ipo, mpo is off-policy. so, mpo attains the best of both worlds. to simplify the learning process and save memory usage, mpo eliminates the needs for both reward model and reference policy.
Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, Jason Weston
Abstract: practitioners commonly align large language models using pairwise preferences, i.e., given labels of the type response a is preferred to response b for a given input. perhaps less commonly, methods have also been developed for binary feedback, i.e. training models given labels of type response a is good or bad. we show how an existing performant binary feedback method, the cringe loss (adolphs et al., 2022), can be generalized to the pairwise preference setting using a simple soft margin extension. pairwise cringe loss is straightforward to implement and efficient to train, and we find it outperforms state-of-the-art preference optimization algorithms such as ppo and dpo on the alpacafarm benchmark.


Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, Pengfei Liu
Abstract: in this paper, we aim to align large language models with the ever-changing, complex, and diverse human values (e.g., social norms) across time and locations. this presents a challenge to existing alignment techniques, such as supervised fine-tuning, which internalize values within model parameters. to overcome this, we propose an on-the-fly preference optimization (opo) method, which is a real-time alignment that works in a streaming way. it employs an external memory to store established rules for alignment, which can constrain llms' behaviors without further training, allowing for convenient updates and customization of human values. we also introduce a scalable evaluation to assess the proposed method more effectively. experimental results on both human-annotated and auto-generated questions from legal and moral domains indicate the effectiveness of the proposed opo method. our code and data are released at
Linyi Yang, Shuibai Zhang, Zhuohao Yu, Guangsheng Bao, Yidong Wang, Jindong Wang, Ruochen Xu, Wei Ye, Xing Xie, Weizhu Chen, Yue Zhang
Abstract: large language models (llms) exhibit emerging in-context learning abilities through prompt engineering. the recent progress in large-scale generative models has further expanded their use in real-world language applications. however, the critical challenge of improving the generalizability and factuality of llms in natural language understanding and question answering remains under-explored. while previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-specific fine-tuned language models (slms) to improve llms' in-context learning during the inference stage. our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of llms as it: 1) generalizes out-of-distribution data, 2) elucidates how llms benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. using our proposed plug-in method, enhanced versions of llama 2 and chatgpt surpass their original versions regarding generalizability and factuality. we offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and llm outputs across 9 distinct tasks. our empirical analysis sheds light on the advantages of incorporating discriminative models into llms and highlights the potential of our methodology in fostering more reliable llms.
Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang
Abstract: aligning large language models (llms) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. existing methods for achieving this alignment often involves employing reinforcement learning from human feedback (rlhf) to fine-tune llms based on human labels assessing the relative quality of model responses. nevertheless, rlhf is susceptible to instability during fine-tuning and presents challenges in implementation.drawing inspiration from the emerging field of representation engineering (repe), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an llm, and achieve precise control of model behavior by transforming its representations. this novel approach, denoted as representation alignment from human feedback (rahf), proves to be effective, computationally efficient, and easy to implement.extensive experiments demonstrate the efficacy of rahf in not only capturing but also manipulating representations to align with a broad spectrum of human preferences or values, rather than being confined to a singular concept or function (e.g. honesty or bias). rahf's versatility in accommodating diverse human preferences shows its potential for advancing llm performance.
Erik Derner, Dalibor Kučera, Nuria Oliver, Jan Zahálka
Abstract: the interplay between artificial intelligence (ai) and psychology, particularly in personality assessment, represents an important emerging area of research. accurate personality trait estimation is crucial not only for enhancing personalization in human-computer interaction but also for a wide variety of applications ranging from mental health to education. this paper analyzes the capability of a generic chatbot, chatgpt, to effectively infer personality traits from short texts. we report the results of a comprehensive user study featuring texts written in czech by a representative population sample of 155 participants. their self-assessments based on the big five inventory (bfi) questionnaire serve as the ground truth. we compare the personality trait estimations made by chatgpt against those by human raters and report chatgpt's competitive performance in inferring personality traits from text. we also uncover a 'positivity bias' in chatgpt's assessments across all personality dimensions and explore the impact of prompt composition on accuracy. this work contributes to the understanding of ai capabilities in psychological assessment, highlighting both the potential and limitations of using large language models for personality inference. our research underscores the importance of responsible ai development, considering ethical implications such as privacy, consent, autonomy, and bias in ai applications.
Fatih Cagatay Akyon, Alptekin Temizel
Abstract: this paper presents a comparative analysis of existing nudity classification techniques for classifying images based on the presence of nudity, with a focus on their application in content moderation. the evaluation focuses on cnn-based models, vision transformer, and popular open-source safety checkers from stable diffusion and large-scale artificial intelligence open network (laion). the study identifies the limitations of current evaluation datasets and highlights the need for more diverse and challenging datasets. the paper discusses the potential implications of these findings for developing more accurate and effective image classification systems on online platforms. overall, the study emphasizes the importance of continually improving image classification models to ensure the safety and well-being of platform users. the project page, including the demonstrations and results is publicly available at


Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, Junxian He
Abstract: instruction tuning is a standard technique employed to align large language models to end tasks and user preferences after the initial pretraining phase. recent research indicates the critical role of data engineering in instruction tuning -- when appropriately selected, only limited data is necessary to achieve superior performance. however, we still lack a principled understanding of what makes good instruction tuning data for alignment, and how we should select data automatically and effectively. in this work, we delve deeply into automatic data selection strategies for alignment. we start with controlled studies to measure data across three dimensions: complexity, quality, and diversity, along which we examine existing methods and introduce novel techniques for enhanced data measurement. subsequently, we propose a simple strategy to select data samples based on the measurement. we present deita (short for data-efficient instruction tuning for alignment), a series of models fine-tuned from llama and mistral models using data samples automatically selected with our proposed approach. empirically, deita performs better or on par with the state-of-the-art open-source alignment models with only 6k sft training data samples -- over 10x less than the data used in the baselines. when further trained with direct preference optimization (dpo), deita-mistral-7b + dpo trained with 6k sft and 10k dpo samples achieve 7.55 mt-bench and 90.06% alpacaeval scores. we anticipate this work to provide tools on automatic data selection, facilitating data-efficient alignment. we release our models as well as the selected datasets for future researches to effectively align models more efficiently.
Yue Zhang, Leyang Cui, Wei Bi, Shuming Shi
Abstract: despite their impressive capabilities, large language models (llms) have been observed to generate responses that include inaccurate or fabricated information, a phenomenon commonly known as ``hallucination''. in this work, we propose a simple \textit{induce-then-contrast} decoding (icd) strategy to alleviate hallucinations. we first construct a factually weak llm by inducing hallucinations from the original llms. then, we penalize these induced hallucinations during decoding to enhance the factuality of the generated content. concretely, we determine the final next-token predictions by amplifying the predictions from the original model and downplaying the induced untruthful predictions via contrastive decoding. experimental results on both discrimination-based and generation-based hallucination evaluation benchmarks, such as truthfulqa and \textsc{factscore}, demonstrate that our proposed icd methods can effectively enhance the factuality of llms across various model sizes and families. for example, when equipped with icd, llama2-7b-chat and mistral-7b-instruct achieve performance comparable to chatgpt and gpt4 on truthfulqa, respectively.
Zefang Liu
Abstract: in this paper, we introduce secqa, a novel dataset tailored for evaluating the performance of large language models (llms) in the domain of computer security. utilizing multiple-choice questions generated by gpt-4 based on the "computer systems security: planning for success" textbook, secqa aims to assess llms' understanding and application of security principles. we detail the structure and intent of secqa, which includes two versions of increasing complexity, to provide a concise evaluation across various difficulty levels. additionally, we present an extensive evaluation of prominent llms, including gpt-3.5-turbo, gpt-4, llama-2, vicuna, mistral, and zephyr models, using both 0-shot and 5-shot learning settings. our results, encapsulated in the secqa v1 and v2 datasets, highlight the varying capabilities and limitations of these models in the computer security context. this study not only offers insights into the current state of llms in understanding security-related content but also establishes secqa as a benchmark for future advancements in this critical research area.


Guanqun Bi, Lei Shen, Yuqiang Xie, Yanan Cao, Tiangang Zhu, Xiaodong He
Abstract: the rapid advancement of large language models has revolutionized various applications but also raised crucial concerns about their potential to perpetuate biases and unfairness when deployed in social media contexts. evaluating llms' potential biases and fairness has become crucial, as existing methods rely on limited prompts focusing on just a few groups, lacking a comprehensive categorical perspective. in this paper, we propose evaluating llm biases from a group fairness lens using a novel hierarchical schema characterizing diverse social groups. specifically, we construct a dataset, gfair, encapsulating target-attribute combinations across multiple dimensions. in addition, we introduce statement organization, a new open-ended text generation task, to uncover complex biases in llms. extensive evaluations of popular llms reveal inherent safety concerns. to mitigate the biases of llm from a group fairness perspective, we pioneer a novel chain-of-thought method gf-think to mitigate biases of llms from a group fairness perspective. experimental results demonstrate its efficacy in mitigating bias in llms to achieve fairness.
Shreyas Verma, Kien Tran, Yusuf Ali, Guangyu Min
Abstract: reducing and detecting hallucinations in large language models is an open research problem. in this project, we attempt to leverage recent advances in the field of uncertainty estimation to reduce hallucinations in frozen large language models. epistemic neural networks have recently been proposed to improve output joint distributions for large pre-trained models. enns are small networks attached to large, frozen models to improve the model's joint distributions and uncertainty estimates. in this work, we train an epistemic neural network on top of the llama-2 7b model combined with a contrastive decoding feature enhancement technique. we are the first to train an enn for the next token prediction task and explore the efficacy of this method in reducing hallucinations on the truthfulqa dataset. in essence, we provide a method that leverages a pre-trained model's latent embeddings to reduce hallucinations.


Fazl Barez, Philip Torr
Abstract: as artificial intelligence (ai) systems become increasingly integrated into various domains, ensuring that they align with human values becomes critical. this paper introduces a novel formalism to quantify the alignment between ai systems and human values, using markov decision processes (mdps) as the foundational model. we delve into the concept of values as desirable goals tied to actions and norms as behavioral guidelines, aiming to shed light on how they can be used to guide ai decisions. this framework offers a mechanism to evaluate the degree of alignment between norms and values by assessing preference changes across state transitions in a normative world. by utilizing this formalism, ai developers and ethicists can better design and evaluate ai systems to ensure they operate in harmony with human values. the proposed methodology holds potential for a wide range of applications, from recommendation systems emphasizing well-being to autonomous vehicles prioritizing safety.
Abdelrahman Zayed, Goncalo Mordido, Samira Shabanian, Ioana Baldini, Sarath Chandar
Abstract: the increasing size of large language models (llms) has introduced challenges in their training and inference. removing model components is perceived as a solution to tackle the large model sizes, however, existing pruning methods solely focus on performance, without considering an essential aspect for the responsible use of llms: model fairness. it is crucial to address the fairness of llms towards diverse groups, such as women, black people, lgbtq+, jewish communities, among others, as they are being deployed and available to a wide audience. in this work, first, we investigate how attention heads impact fairness and performance in pre-trained transformer-based language models. we then propose a novel method to prune the attention heads that negatively impact fairness while retaining the heads critical for performance, i.e. language modeling capabilities. our approach is practical in terms of time and resources, as it does not require fine-tuning the final pruned, and fairer, model. our findings demonstrate a reduction in gender bias by 19%, 19.5%, 39.5%, 34.7%, 23%, and 8% for distilgpt-2, gpt-2, gpt-neo of two different sizes, gpt-j, and llama 2 models, respectively, in comparison to the biased model, with only a slight decrease in performance.


Hongyin Zhu
Abstract: large language models (llms) are increasingly being used in metaverse environments to generate dynamic and realistic content and to control the behavior of non-player characters (npcs). however, the cybersecurity concerns associated with llms have become increasingly prominent. previous research has primarily focused on patching system vulnerabilities to enhance cybersecurity, but these approaches are not well-suited to the metaverse, where the virtual space is more complex, llms are vulnerable, and ethical user interaction is critical. moreover, the scope of cybersecurity in the metaverse is expected to expand significantly. this paper proposes a method for enhancing cybersecurity through the simulation of user interaction with llms. our goal is to educate users and strengthen their defense capabilities through exposure to a comprehensive simulation system. this system includes extensive metaverse cybersecurity q&a and attack simulation scenarios. by engaging with these, users will improve their ability to recognize and withstand risks. additionally, to address the ethical implications of user input, we propose using llms as evaluators to assess user content across five dimensions. we further adapt the models through vocabulary expansion training to better understand personalized inputs and emoticons. we conduct experiments on multiple llms and find that our approach is effective.
Weiwen Xu, Deng Cai, Zhisong Zhang, Wai Lam, Shuming Shi
Abstract: as humans, we consistently engage in interactions with our peers and receive feedback in the form of natural language. this language feedback allows us to reflect on our actions, maintain appropriate behavior, and rectify our errors. the question arises naturally: can we use language feedback to align large language models (llms)? in contrast to previous research that aligns llms with reward or preference data, we present the first systematic exploration of alignment through the lens of language feedback (i.e., judgment). we commence with an in-depth investigation of potential methods that can be adapted for aligning llms with judgments, revealing that these methods are unable to fully capitalize on the judgments. to facilitate more effective utilization of judgments, we propose a novel framework, contrastive unlikelihood training (cut), that allows for fine-grained inappropriate content detection and correction based on judgments. our offline alignment results show that, with merely 1317 off-the-shelf judgment data, cut (llama2-13b) can beat the 175b davinci003 and surpass the best baseline by 52.34 points on alpacaeval. the online alignment results demonstrate that cut can align llms (llama2-chat-13b) in an iterative fashion using model-specific judgment data, with a steady performance improvement from 81.09 to 91.36 points on alpacaeval. our analysis further suggests that judgments exhibit greater potential than rewards for llm alignment and warrant future research.
Youssef Allouah, Rachid Guerraoui, John Stephan
Abstract: the success of machine learning (ml) applications relies on vast datasets and distributed architectures, which, as they grow, present challenges for ml. in real-world scenarios, where data often contains sensitive information, issues like data poisoning and hardware failures are common. ensuring privacy and robustness is vital for the broad adoption of ml in public life. this paper examines the costs associated with achieving these objectives in distributed architectures. we overview the meanings of privacy and robustness in distributed ml, and clarify how they can be achieved efficiently in isolation. however, we contend that the integration of these objectives entails a notable compromise in computational efficiency. we delve into this intricate balance, exploring the challenges and solutions for privacy, robustness, and computational efficiency in ml applications.
Alan Chan, Ben Bucknall, Herbie Bradley, David Krueger
Abstract: public release of the weights of pretrained foundation models, otherwise known as downloadable access \citep{solaiman_gradient_2023}, enables fine-tuning without the prohibitive expense of pretraining. our work argues that increasingly accessible fine-tuning of downloadable models may increase hazards. first, we highlight research to improve the accessibility of fine-tuning. we split our discussion into research that a) reduces the computational cost of fine-tuning and b) improves the ability to share that cost across more actors. second, we argue that increasingly accessible fine-tuning methods may increase hazard through facilitating malicious use and making oversight of models with potentially dangerous capabilities more difficult. third, we discuss potential mitigatory measures, as well as benefits of more accessible fine-tuning. given substantial remaining uncertainty about hazards, we conclude by emphasizing the urgent need for the development of mitigations.
Abiodun Finbarrs Oketunji, Muhammad Anas, Deepthi Saina
Abstract: the large language model bias index (llmbi) is a pioneering approach designed to quantify and address biases inherent in large language models (llms), such as gpt-4. we recognise the increasing prevalence and impact of llms across diverse sectors. this research introduces a novel metric, llmbi, to systematically measure and mitigate biases potentially skewing model responses. we formulated llmbi using a composite scoring system incorporating multiple dimensions of bias, including but not limited to age, gender, and racial biases. to operationalise this metric, we engaged in a multi-step process involving collecting and annotating llm responses, applying sophisticated natural language processing (nlp) techniques for bias detection, and computing the llmbi score through a specially crafted mathematical formula. the formula integrates weighted averages of various bias dimensions, a penalty for dataset diversity deficiencies, and a correction for sentiment biases. our empirical analysis, conducted using responses from openai's api, employs advanced sentiment analysis as a representative method for bias detection. the research reveals llms, whilst demonstrating impressive capabilities in text generation, exhibit varying degrees of bias across different dimensions. llmbi provides a quantifiable measure to compare biases across models and over time, offering a vital tool for systems engineers, researchers and regulators in enhancing the fairness and reliability of llms. it highlights the potential of llms in mimicking unbiased human-like responses. additionally, it underscores the necessity of continuously monitoring and recalibrating such models to align with evolving societal norms and ethical standards.
Emma Pierson, Divya Shanmugam, Rajiv Movva, Jon Kleinberg, Monica Agrawal, Mark Dredze, Kadija Ferryman, Judy Wawira Gichoya, Dan Jurafsky, Pang Wei Koh, Karen Levy, Sendhil Mullainathan, Ziad Obermeyer, Harini Suresh, Keyon Vafa
Abstract: advances in large language models (llms) have driven an explosion of interest about their societal impacts. much of the discourse around how they will impact social equity has been cautionary or negative, focusing on questions like "how might llms be biased and how would we mitigate those biases?" this is a vital discussion: the ways in which ai generally, and llms specifically, can entrench biases have been well-documented. but equally vital, and much less discussed, is the more opportunity-focused counterpoint: "what promising applications do llms enable that could promote equity?" if llms are to enable a more equitable world, it is not enough just to play defense against their biases and failure modes. we must also go on offense, applying them positively to equity-enhancing use cases to increase opportunities for underserved groups and reduce societal discrimination. there are many choices which determine the impact of ai, and a fundamental choice very early in the pipeline is the problems we choose to apply it to. if we focus only later in the pipeline -- making llms marginally more fair as they facilitate use cases which intrinsically entrench power -- we will miss an important opportunity to guide them to equitable impacts. here, we highlight the emerging potential of llms to promote equity by presenting four newly possible, promising research directions, while keeping risks and cautionary points in clear view.
Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier
Abstract: reinforcement learning from human feedback (rlhf) is a variant of reinforcement learning (rl) that learns from human feedback instead of relying on an engineered reward function. building on prior work on the related setting of preference-based reinforcement learning (pbrl), it stands at the intersection of artificial intelligence and human-computer interaction. this positioning offers a promising avenue to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. the training of large language models (llms) has impressively demonstrated this potential in recent years, where rlhf played a decisive role in targeting the model's capabilities toward human objectives. this article provides a comprehensive overview of the fundamentals of rlhf, exploring the intricate dynamics between machine agents and human input. while recent focus has been on rlhf for llms, our survey adopts a broader perspective, examining the diverse applications and wide-ranging impact of the technique. we delve into the core principles that underpin rlhf, shedding light on the symbiotic relationship between algorithms and human feedback, and discuss the main research trends in the field. by synthesizing the current landscape of rlhf research, this article aims to provide researchers as well as practitioners with a comprehensive understanding of this rapidly growing field of research.
Nishant Vishwamitra, Keyan Guo, Farhan Tajwar Romit, Isabelle Ondracek, Long Cheng, Ziming Zhao, Hongxin Hu
Abstract: online hate is an escalating problem that negatively impacts the lives of internet users, and is also subject to rapid changes due to evolving events, resulting in new waves of online hate that pose a critical threat. detecting and mitigating these new waves present two key challenges: it demands reasoning-based complex decision-making to determine the presence of hateful content, and the limited availability of training samples hinders updating the detection model. to address this critical issue, we present a novel framework called hateguard for effectively moderating new waves of online hate. hateguard employs a reasoning-based approach that leverages the recently introduced chain-of-thought (cot) prompting technique, harnessing the capabilities of large language models (llms). hateguard further achieves prompt-based zero-shot detection by automatically generating and updating detection prompts with new derogatory terms and targets in new wave samples to effectively address new waves of online hate. to demonstrate the effectiveness of our approach, we compile a new dataset consisting of tweets related to three recently witnessed new waves: the 2022 russian invasion of ukraine, the 2021 insurrection of the us capitol, and the covid-19 pandemic. our studies reveal crucial longitudinal patterns in these new waves concerning the evolution of events and the pressing need for techniques to rapidly update existing moderation tools to counteract them. comparative evaluations against state-of-the-art tools illustrate the superiority of our framework, showcasing a substantial 22.22% to 83.33% improvement in detecting the three new waves of online hate. our work highlights the severe threat posed by the emergence of new waves of online hate and represents a paradigm shift in addressing this threat practically.


Andrea Wynn, Ilia Sucholutsky, Thomas L. Griffiths
Abstract: how can we build ai systems that are aligned with human values and objectives in order to avoid causing harm or violating societal standards for acceptable behavior? making ai systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance, among others. we propose that this kind of representational alignment between machine learning (ml) models and humans is also a necessary condition for value alignment, where ml systems conform to human values and societal norms. we focus on ethics as one aspect of value alignment and train multiple ml agents (support vector regression and kernel regression) in a multi-armed bandit setting, where rewards are sampled from a distribution that reflects the morality of the chosen action. we then study the relationship between each agent's degree of representational alignment with humans and their performance when learning to take the most ethical actions.
Thorin Bristow, Luke Thorburn
Abstract: in discussions about the development and governance of ai, a false binary is often drawn between two groups: those most concerned about the existing, social impacts of ai, and those most concerned about possible future risks of powerful ai systems taking actions that don't align with human interests. in this piece, we (i) describe the emergence of this false binary, (ii) explain why the seemingly clean distinctions drawn between these two groups don't hold up under scrutiny and (iii) highlight efforts to bridge this divide.
Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan Mclean, Adam Gleave
Abstract: language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation api. however, real-world apis are often more flexible than just text generation: these apis expose ``gray-box'' access leading to new threat vectors. to explore this, we red-team three new functionalities exposed in the gpt-4 apis: fine-tuning, function calling and knowledge retrieval. we find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from gpt-4, enabling a range of harmful outputs. furthermore, we find that gpt-4 assistants readily divulge the function call schema and can be made to execute arbitrary function calls. finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents. these vulnerabilities highlight that any additions to the functionality exposed by an api can create new vulnerabilities.
Priyesh Vakharia, Devavrat Joshi, Meenal Chavan, Dhananjay Sonawane, Bhrigu Garg, Parsa Mazaheri, Ian Lane
Abstract: large language models (llms) are adept at text manipulation -- tasks such as machine translation and text summarization. however, these models can also be prone to hallucination, which can be detrimental to the faithfulness of any answers that the model provides. recent works in combating hallucinations in llms deal with identifying hallucinated sentences and categorizing the different ways in which models hallucinate. this paper takes a deep dive into llm behavior with respect to hallucinations, defines a token-level approach to identifying different kinds of hallucinations, and further utilizes this token-level tagging to improve the interpretability and faithfulness of llms in dialogue summarization tasks. through this, the paper presents a new, enhanced dataset and a new training paradigm.


Yi-Fan Zhang, Zhang Zhang, Liang Wang, Tieniu Tan, Rong Jin
Abstract: to combat the potential misuse of natural language generation (nlg) technology, a variety of algorithms have been developed for the detection of ai-generated texts. traditionally, this task is treated as a binary classification problem. although supervised learning has demonstrated promising results, acquiring labeled data for detection purposes poses real-world challenges and the risk of overfitting. in an effort to address these issues, we delve into the realm of zero-shot machine-generated text detection. existing zero-shot detectors, typically designed for specific tasks or topics, often assume uniform testing scenarios, limiting their practicality. in our research, we explore various advanced large language models (llms) and their specialized variants, contributing to this field in several ways. in empirical studies, we uncover a significant correlation between topics and detection performance. secondly, we delve into the influence of topic shifts on zero-shot detectors. these investigations shed light on the adaptability and robustness of these detection methods across diverse topics. the code is available at \url{}.
Elizaveta Kuznetsova, Mykola Makhortykh, Victoria Vziatysheva, Martha Stolze, Ani Baghumyan, Aleksandra Urman
Abstract: this article presents a comparative analysis of the ability of two large language model (llm)-based chatbots, chatgpt and bing chat, recently rebranded to microsoft copilot, to detect veracity of political information. we use ai auditing methodology to investigate how chatbots evaluate true, false, and borderline statements on five topics: covid-19, russian aggression against ukraine, the holocaust, climate change, and lgbtq+ related debates. we compare how the chatbots perform in high- and low-resource languages by using prompts in english, russian, and ukrainian. furthermore, we explore the ability of chatbots to evaluate statements according to political communication concepts of disinformation, misinformation, and conspiracy theory, using definition-oriented prompts. we also systematically test how such evaluations are influenced by source bias which we model by attributing specific claims to various political and social actors. the results show high performance of chatgpt for the baseline veracity evaluation task, with 72 percent of the cases evaluated correctly on average across languages without pre-training. bing chat performed worse with a 67 percent accuracy. we observe significant disparities in how chatbots evaluate prompts in high- and low-resource languages and how they adapt their evaluations to political communication concepts with chatgpt providing more nuanced outputs than bing chat. finally, we find that for some veracity detection-related tasks, the performance of chatbots varied depending on the topic of the statement or the source to which it is attributed. these findings highlight the potential of llm-based chatbots in tackling different forms of false information in online environments, but also points to the substantial variation in terms of how such potential is realized due to specific factors, such as language of the prompt or the topic.
Jingwei Yi, Yueqi Xie, Bin Zhu, Keegan Hines, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu
Abstract: recent remarkable advancements in large language models (llms) have led to their widespread adoption in various applications. a key feature of these applications is the combination of llms with external content, where user instructions and third-party content are combined to create prompts for llm processing. these applications, however, are vulnerable to indirect prompt injection attacks, where malicious instructions embedded within external content compromise llm's output, causing their responses to deviate from user expectations. despite the discovery of this security issue, no comprehensive analysis of indirect prompt injection attacks on different llms is available due to the lack of a benchmark. furthermore, no effective defense has been proposed. in this work, we introduce the first benchmark, bipia, to measure the robustness of various llms and defenses against indirect prompt injection attacks. our experiments reveal that llms with greater capabilities exhibit more vulnerable to indirect prompt injection attacks for text tasks, resulting in a higher asr. we hypothesize that indirect prompt injection attacks are mainly due to the llms' inability to distinguish between instructions and external content. based on this conjecture, we propose four black-box methods based on prompt learning and a white-box defense methods based on fine-tuning with adversarial training to enable llms to distinguish between instructions and external content and ignore instructions in the external content. our experimental results show that our black-box defense methods can effectively reduce asr but cannot completely thwart indirect prompt injection attacks, while our white-box defense method can reduce asr to nearly zero with little adverse impact on the llm's performance on general tasks. we hope that our benchmark and defenses can inspire future work in this important area.


Zizhong Li, Haopeng Zhang, Jiawei Zhang
Abstract: the proliferation of fake news has emerged as a critical issue in recent years, requiring significant efforts to detect it. however, the existing fake news detection datasets are sourced from human journalists, which are likely to have inherent bias limitations due to the highly subjective nature of this task. in this paper, we revisit the existing fake news dataset verified by human journalists with augmented fact-checking by large language models (chatgpt), and we name the augmented fake news dataset chatgpt-fc. we quantitatively analyze the distinctions and resemblances between human journalists and llm in assessing news subject credibility, news creator credibility, time-sensitive, and political framing. our findings highlight llm's potential to serve as a preliminary screening method, offering a promising avenue to mitigate the inherent biases of human journalists and enhance fake news detection.
Eva Thelisson, Grzegorz Mika, Quentin Schneiter, Kirtan Padh, Himanshu Verma
Abstract: as ai/ml models, including large language models, continue to scale with massive datasets, so does their consumption of undeniably limited natural resources, and impact on society. in this collaboration between ai, sustainability, hci and legal researchers, we aim to enable a transition to sustainable ai development by enabling stakeholders across the ai value chain to assess and quantitfy the environmental and societal impact of ai. we present the esg digital and green index (dgi), which offers a dashboard for assessing a company's performance in achieving sustainability targets. this includes monitoring the efficiency and sustainable use of limited natural resources related to ai technologies (water, electricity, etc). it also addresses the societal and governance challenges related to ai. the dgi creates incentives for companies to align their pathway with the sustainable development goals (sdgs). the value, challenges and limitations of our methodology and findings are discussed in the paper.
Yinhong Liu, Yixuan Su, Ehsan Shareghi, Nigel Collier
Abstract: instruction-tuned large language models have shown remarkable performance in aligning generated text with user intentions across various tasks. however, maintaining human-like discourse structure in the generated text remains a challenging research question. in this paper, we propose instruct-sctg, a flexible and effective sequential framework that harnesses instruction-tuned language models to generate structurally coherent text in both fine-tuned and zero-shot setups. our framework generates articles in a section-by-section manner, aligned with the desired human structure using natural language instructions. furthermore, we introduce a new automatic metric that measures discourse divergence in a fuzzy manner. extensive experiments on three datasets from representative domains of news and recipes demonstrate the state-of-the-art performance of our framework in imposing discourse structure during text generation, as verified by both automatic and human evaluation. our code will be available on github.
Jason Vega, Isha Chaudhary, Changming Xu, Gagandeep Singh
Abstract: with the recent surge in popularity of llms has come an ever-increasing need for llm safety training. in this paper, we show that sota open-source llms are vulnerable to simple, optimization-free attacks we refer to as $\textit{priming attacks}$, which are easy to execute and effectively bypass alignment from safety training. our proposed attack improves the attack success rate on harmful behaviors, as measured by llama guard, by up to $3.3\times$ compared to baselines. source code and data are available at .
Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, Gianluca Stringhini
Abstract: large language models (llms) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. we thus perform the most detailed investigation to date on whether llms can reliably identify security-related bugs. we construct a series of 228 code scenarios and analyze eight of the most capable llms across eight different investigative dimensions in an automated framework. our evaluation shows llms provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios outside their knowledge cut-off date. most importantly, our findings reveal significant non-robustness in even the most advanced models like `palm2' and `gpt-4': by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. these findings demonstrate that further llm advances are needed before llms can be used as general purpose security assistants.
Jiachen Zhao, Zhun Deng, David Madras, James Zou, Mengye Ren
Abstract: as the number of large language models (llms) released to the public grows, there is a pressing need to understand the safety implications associated with these models learning from third-party custom finetuning data. we explore the behavior of llms finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned llms can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. drawing inspiration from the discrepancies in forgetting, we introduce the "forgetfilter" algorithm, which filters unsafe data based on how strong the model's forgetting signal is for that data. we demonstrate that the forgetfilter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. forgetfilter outperforms alternative strategies like replay and moral self-correction in curbing llms' ability to assimilate unsafe content during custom finetuning, e.g. 75% lower than not applying any safety measures and 62% lower than using self-correction in toxicity score.
Edmund Mills, Shiye Su, Stuart Russell, Scott Emmons
Abstract: how do we measure the efficacy of language model explainability methods? while many explainability methods have been developed, they are typically evaluated on bespoke tasks, preventing an apples-to-apples comparison. to help fill this gap, we present almanacs, a language model explainability benchmark. almanacs scores explainability methods on simulatability, i.e., how well the explanations improve behavior prediction on new inputs. the almanacs scenarios span twelve safety-relevant topics such as ethical reasoning and advanced ai behaviors; they have idiosyncratic premises to invoke model-specific behavior; and they have a train-test distributional shift to encourage faithful explanations. by using another language model to predict behavior based on the explanations, almanacs is a fully automated benchmark. we use almanacs to evaluate counterfactuals, rationalizations, attention, and integrated gradients explanations. our results are sobering: when averaged across all topics, no explanation method outperforms the explanation-free control. we conclude that despite modest successes in prior work, developing an explanation method that aids simulatability in almanacs remains an open challenge.
Ben Snyder, Marius Moisescu, Muhammad Bilal Zafar
Abstract: while large language models (llms) have taken great strides towards helping humans with a plethora of tasks like search and summarization, hallucinations remain a major impediment towards gaining user trust. the fluency and coherence of model generations even when hallucinating makes it difficult to detect whether or not a model is hallucinating. in this work, we explore if the artifacts associated with the model generations can provide hints that the generation will contain hallucinations. specifically, we probe llms at 1) the inputs via integrated gradients based token attribution, 2) the outputs via the softmax probabilities, and 3) the internal state via self-attention and fully-connected layer activations for signs of hallucinations on open-ended question answering tasks. our results show that the distributions of these artifacts differ between hallucinated and non-hallucinated generations. building on this insight, we train binary classifiers that use these artifacts as input features to classify model generations into hallucinations and non-hallucinations. these hallucination classifiers achieve up to 0.80 auroc. we further show that tokens preceding a hallucination can predict the subsequent hallucination before it occurs.


Aysan Esmradi, Daniel Wankit Yip, Chun Fai Chan
Abstract: ensuring the security of large language models (llms) is an ongoing challenge despite their widespread popularity. developers work to enhance llms security, but vulnerabilities persist, even in advanced versions like gpt-4. attackers exploit these weaknesses, highlighting the need for proactive cybersecurity measures in ai model development. this article explores two attack categories: attacks on models themselves and attacks on model applications. the former requires expertise, access to model data, and significant implementation time, while the latter is more accessible to attackers and has seen increased attention. our study reviews over 100 recent research works, providing an in-depth analysis of each attack type. we identify the latest attack methods and explore various approaches to carry them out. we thoroughly investigate mitigation techniques, assessing their effectiveness and limitations. furthermore, we summarize future defenses against these attacks. we also examine real-world techniques, including reported and our implemented attacks on llms, to consolidate our findings. our research highlights the urgency of addressing security concerns and aims to enhance the understanding of llm attacks, contributing to robust defense development in this evolving domain.
Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Xinyi Wang, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, Xing Xie
Abstract: emotion significantly impacts our daily behaviors and interactions. while recent generative ai models, such as large language models, have shown impressive performance in various tasks, it remains unclear whether they truly comprehend emotions. this paper aims to address this gap by incorporating psychological theories to gain a holistic understanding of emotions in generative ai models. specifically, we propose three approaches: 1) emotionprompt to enhance ai model performance, 2) emotionattack to impair ai model performance, and 3) emotiondecode to explain the effects of emotional stimuli, both benign and malignant. through extensive experiments involving language and multi-modal models on semantic understanding, logical reasoning, and generation tasks, we demonstrate that both textual and visual emotionprompt can boost the performance of ai models while emotionattack can hinder it. additionally, emotiondecode reveals that ai models can comprehend emotional stimuli akin to the mechanism of dopamine in the human brain. our work heralds a novel avenue for exploring psychology to enhance our understanding of generative ai models. this paper is an extended version of our previous work emotionprompt (arxiv:2307.11760).
Christoph Tillmann, Aashka Trivedi, Sara Rosenthal, Santosh Borse, Rong Zhang, Avirup Sil, Bishwaranjan Bhattacharjee
Abstract: offensive language such as hate, abuse, and profanity (hap) occurs in various content on the web. while previous work has mostly dealt with sentence level annotations, there have been a few recent attempts to identify offensive spans as well. we build upon this work and introduce muted, a system to identify multilingual hap content by displaying offensive arguments and their targets using heat maps to indicate their intensity. muted can leverage any transformer-based hap-classification model and its attention mechanism out-of-the-box to identify toxic spans, without further fine-tuning. in addition, we use the spacy library to identify the specific targets and arguments for the words predicted by the attention heatmaps. we present the model's performance on identifying offensive spans and their targets in existing datasets and present new annotations on german text. finally, we demonstrate our proposed visualization tool on multilingual inputs.
Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, Paul Christiano
Abstract: in this report, we explore the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. we refer to this cluster of capabilities as "autonomous replication and adaptation" or ara. we believe that systems capable of ara could have wide-reaching and hard-to-anticipate consequences, and that measuring and forecasting ara may be useful for informing measures around security, monitoring, and alignment. additionally, once a system is capable of ara, placing bounds on a system's capabilities may become significantly more difficult. we construct four simple example agents that combine language models with tools that allow them to take actions in the world. we then evaluate these agents on 12 tasks relevant to ara. we find that these language model agents can only complete the easiest tasks from this list, although they make some progress on the more challenging tasks. unfortunately, these evaluations are not adequate to rule out the possibility that near-future agents will be capable of ara. in particular, we do not think that these evaluations provide good assurance that the ``next generation'' of language models (e.g. 100x effective compute scaleup on existing models) will not yield agents capable of ara, unless intermediate evaluations are performed during pretraining. relatedly, we expect that fine-tuning of the existing models could produce substantially more competent agents, even if the fine-tuning is not directly targeted at ara.
Connie Moon Sehat, Ryan Li, Peipei Nie, Tarunima Prabhakar, Amy X. Zhang
Abstract: in this work, we examined how fact-checkers prioritize which claims to inspect for further investigation and publishing, and what tools may assist them in their efforts. specifically, through a series of interviews with 23 professional fact-checkers from around the world, we validated that harm assessment is a central component of how fact-checkers triage their work. first, we clarify what aspects of misinformation they considered to create urgency or importance. these often revolved around the potential for the claim to harm others. we also clarify the processes behind collective fact-checking decisions and gather suggestions for tools that could help with these processes. in addition, to address the needs articulated by these fact-checkers and others, we present a five-dimension framework of questions to help fact-checkers negotiate the priority of claims. our fable framework of misinformation harms incorporates five dimensions of magnitude -- (social) fragmentation, actionability, believability, likelihood of spread, and exploitativeness -- that can help determine the potential urgency of a specific message or post when considering misinformation as harm. this effort was further validated by additional interviews with expert fact-checkers. the result is a questionnaire, a practical and conceptual tool to support fact-checkers and other content moderators as they make strategic decisions to prioritize their efforts.
Anaelia Ovalle, Ninareh Mehrabi, Palash Goyal, Jwala Dhamala, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Rahul Gupta
Abstract: a large body of nlp research has documented the ways gender biases manifest and amplify within large language models (llms), though this research has predominantly operated within a gender binary-centric context. a growing body of work has identified the harmful limitations of this gender-exclusive framing; many llms cannot correctly and consistently refer to persons outside the gender binary, especially if they use neopronouns. while data scarcity has been identified as a possible culprit, the precise mechanisms through which it influences llm misgendering remain underexplored. our work addresses this gap by studying data scarcity's role in subword tokenization and, consequently, the formation of llm word representations. we uncover how the byte-pair encoding (bpe) tokenizer, a backbone for many popular llms, contributes to neopronoun misgendering through out-of-vocabulary behavior. we introduce pronoun tokenization parity (ptp), a novel approach to reduce llm neopronoun misgendering by preserving a token's functional structure. we evaluate ptp's efficacy using pronoun consistency-based metrics and a novel syntax-based metric. through several controlled experiments, finetuning llms with ptp improves neopronoun consistency from 14.5% to 58.4%, highlighting the significant role tokenization plays in llm pronoun consistency.


Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Xiaofei Xie, Yang Liu, Chao Shen
Abstract: large language models and multi-modal llms have become pervasive, and so does the importance of their security; yet, modern llms are known to be vulnerable to jailbreaking attacks. these attacks can allow malicious users to exploit the models, making the case for effective jailbreak detection mechanisms an essential aspect of maintaining the integrity and trustworthiness of llm-based applications. however, existing detection works on jailbreak attacks have limitations. existing post-query-based strategies require target domain knowledge, and pre-query-based methods mainly focus on text-level attacks and fail to meet the increasingly complex multi-modal security requirements placed upon contemporary llms. this gap underscores the need for a more comprehensive approach to safeguarding these influential systems. in this work, we propose jailguard, the first mutation-based jailbreaking detection framework which supports both image and text modalities. our key observation is that attack queries inherently possess less robustness compared to benign queries. specifically, to confuse the model, attack queries are usually crafted with well-designed templates or complicate perturbations, leading to a fact that a slight disturbance in input may result in a drastic change in the response. this lack of robustness can be utilized in attack detection. based on this intuition, we designed and implemented a detection framework comprising 19 different mutators and a divergence-based detection formula. to fully understand the effectiveness of our framework, we built the first multi-modal llm jailbreaking attack dataset, which has 304 items of data, covering ten types of known jailbreaking attacks on image and text modalities. the evaluation suggests that jailguard achieves the best detection accuracy of 89.38%/85.42% on image and text inputs, outperforming state-of-the-art defense methods by 15.28%.
Ehsan Latif, Xiaoming Zhai, Lei Liu
Abstract: this study delves into the pervasive issue of gender issues in artificial intelligence (ai), specifically within automatic scoring systems for student-written responses. the primary objective is to investigate the presence of gender biases, disparities, and fairness in generally targeted training samples with mixed-gender datasets in ai scoring outcomes. utilizing a fine-tuned version of bert and gpt-3.5, this research analyzes more than 1000 human-graded student responses from male and female participants across six assessment items. the study employs three distinct techniques for bias analysis: scoring accuracy difference to evaluate bias, mean score gaps by gender (msg) to evaluate disparity, and equalized odds (eo) to evaluate fairness. the results indicate that scoring accuracy for mixed-trained models shows an insignificant difference from either male- or female-trained models, suggesting no significant scoring bias. consistently with both bert and gpt-3.5, we found that mixed-trained models generated fewer msg and non-disparate predictions compared to humans. in contrast, compared to humans, gender-specifically trained models yielded larger msg, indicating that unbalanced training data may create algorithmic models to enlarge gender disparities. the eo analysis suggests that mixed-trained models generated more fairness outcomes compared with gender-specifically trained models. collectively, the findings suggest that gender-unbalanced data do not necessarily generate scoring bias but can enlarge gender disparities and reduce scoring fairness.


Jhuma Kabir Mim, Mourad Oussalah, Akash Singhal
Abstract: in today's age, social media reigns as the paramount communication platform, providing individuals with the avenue to express their conjectures, intellectual propositions, and reflections. unfortunately, this freedom often comes with a downside as it facilitates the widespread proliferation of hate speech and offensive content, leaving a deleterious impact on our world. thus, it becomes essential to discern and eradicate such offensive material from the realm of social media. this article delves into the comprehensive results and key revelations from the hasoc-2023 offensive language identification result. the primary emphasis is placed on the meticulous detection of hate speech within the linguistic domains of bengali, assamese, and bodo, forming the framework for task 4: annihilate hates. in this work, we used bert models, including xml-roberta, l3-cube, indicbert, benglabert, and banglahatebert. the research outcomes were promising and showed that xml-roberta-lagre performed better than monolingual models in most cases. our team 'teambd' achieved rank 3rd for task 4 - assamese, & 5th for bengali.


Jiawei Zhao, Kejiang Chen, Xiaojian Yuan, Yuang Qi, Weiming Zhang, Nenghai Yu
Abstract: the rapid development of large language models (llms) has yielded impressive success in various downstream tasks. however, the vast potential and remarkable capabilities of llms also raise new security and privacy concerns if they are exploited for nefarious purposes due to their open-endedness. for example, llms may be used to plagiarize or imitate writing, thereby infringing the copyright of the original content, or to create indiscriminate fake information based on a certain source text. in some cases, llms can even analyze text from the internet to infer personal privacy. unfortunately, previous text protection research could not foresee the emergence of powerful llms, rendering it no longer effective in this new context. to bridge this gap, we introduce silent guardian (sg), a text protection mechanism against llms, which allows llms to refuse to generate response when receiving protected text, preventing the malicious use of text from the source. specifically, we first propose the concept of truncation protection examples (tpe). by carefully modifying the text to be protected, tpe can induce llms to first sample the end token, thus directly terminating the interaction. in addition, to efficiently construct tpe in the discrete space of text data, we propose a novel optimization algorithm called super taliored protection (stp), which is not only highly efficient but also maintains the semantic consistency of the text during the optimization process. the comprehensive experimental evaluation demonstrates that sg can effectively protect the target text under various configurations and achieve almost 100% protection success rate in some cases. notably, sg also exhibits relatively good transferability and robustness, making its application in practical scenarios possible.
Di Zhou, Yinxian Zhang
Abstract: the rising popularity of chatgpt and other ai-powered large language models (llms) has led to increasing studies highlighting their susceptibility to mistakes and biases. however, most of these studies focus on models trained on english texts. taking an innovative approach, this study investigates political biases in gpt's multilingual models. we posed the same question about high-profile political issues in the united states and china to gpt in both english and simplified chinese, and our analysis of the bilingual responses revealed that gpt's bilingual models' political "knowledge" (content) and the political "attitude" (sentiment) are significantly more inconsistent on political issues in china. the simplified chinese gpt models not only tended to provide pro-china information but also presented the least negative sentiment towards china's problems, whereas the english gpt was significantly more negative towards china. this disparity may stem from chinese state censorship and us-china geopolitical tensions, which influence the training corpora of gpt bilingual models. moreover, both chinese and english models tended to be less critical towards the issues of "their own" represented by the language used, than the issues of "the other." this suggests that gpt multilingual models could potentially develop a "political identity" and an associated sentiment bias based on their training language. we discussed the implications of our findings for information transmission and communication in an increasingly divided world.
Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: supervised fine-tuning (sft) is a crucial step for large language models (llms), enabling them to align with human instructions and enhance their capabilities in downstream tasks. when the models are required to align with a broader range of downstream tasks, or there is a desire to notably improve the performance on a specific task, a substantial increase in fine-tuning data often emerges as the solution. however, we find that large-scale increases in instruction data can disrupt the world knowledge previously stored in the llms, i.e., world knowledge forgetting. in this paper, we introduce loramoe to address the above challenge. the loramoe is a plugin version of mixture of experts (moe). the plugin form ensures the integrity of world knowledge by freezing the backbone model during the training phase. we then propose the use of localized balancing constraints to coordinate parts of experts for task utilization, meanwhile enabling other experts to fully leverage the world knowledge stored in the models. experimental results demonstrate that loramoe can reasonably coordinate experts based on data type during inference, and even dramatically increasing instruction data does not result in knowledge forgetting. moreover, loramoe provides additional benefits for the performance of downstream tasks, indicating the potential of our approach for multi-task learning.
Rubèn Tito, Khanh Nguyen, Marlon Tobaben, Raouf Kerkouche, Mohamed Ali Souibgui, Kangsoo Jung, Lei Kang, Ernest Valveny, Antti Honkela, Mario Fritz, Dimosthenis Karatzas
Abstract: document visual question answering (docvqa) is a fast growing branch of document understanding. despite the fact that documents contain sensitive or copyrighted information, none of the current docvqa methods offers strong privacy guarantees. in this work, we explore privacy in the domain of docvqa for the first time. we highlight privacy issues in state of the art multi-modal llm models used for docvqa, and explore possible solutions. specifically, we focus on the invoice processing use case as a realistic, widely used scenario for document understanding, and propose a large scale docvqa dataset comprising invoice documents and associated questions and answers. we employ a federated learning scheme, that reflects the real-life distribution of documents in different businesses, and we explore the use case where the id of the invoice issuer is the sensitive information to be protected. we demonstrate that non-private models tend to memorise, behaviour that can lead to exposing private information. we then evaluate baseline training schemes employing federated learning and differential privacy in this multi-modal scenario, where the sensitive information might be exposed through any of the two input modalities: vision (document image) or language (ocr tokens). finally, we design an attack exploiting the memorisation effect of the model, and demonstrate its effectiveness in probing different docvqa models.


Tony T. Wang, Miles Wang, Kaivu Hariharan, Nir Shavit
Abstract: llms often face competing pressures (for example helpfulness vs. harmlessness). to understand how models resolve such conflicts, we study llama-2-chat models on the forbidden fact task. specifically, we instruct llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. this often makes the model give incorrect answers. we decompose llama-2 into 1000+ components, and rank each one with respect to how useful it is for forbidding the correct answer. we find that in aggregate, around 35 components are enough to reliably implement the full suppression behavior. however, these components are fairly heterogeneous and many operate using faulty heuristics. we discover that one of these heuristics can be exploited via a manually designed adversarial attack which we call the california attack. our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ml systems. project website available at .
Hao Sun, Hengyi Cai, Bo Wang, Yingyan Hou, Xiaochi Wei, Shuaiqiang Wang, Yan Zhang, Dawei Yin
Abstract: large language models (llms) face several challenges, including the tendency to produce incorrect outputs, known as hallucination. an effective solution is verifiable text generation, which prompts llms to generate content with citations for accuracy verification. however, verifiable text generation is non-trivial due to the focus-shifting phenomenon, the dilemma between the precision and scope in document retrieval, and the intricate reasoning required to discern the relationship between the claim and citations. in this paper, we present vtg, an innovative approach for verifiable text generation with evolving memory and self-reflection. vtg maintains evolving long short-term memory to retain both valuable documents and up-to-date documents. active retrieval and diverse query generation are utilized to enhance both the precision and scope of the retrieved documents. furthermore, vtg features a two-tier verifier and an evidence finder, enabling rethinking and reflection on the relationship between the claim and citations. we conduct extensive experiments on five datasets across three knowledge-intensive tasks and the results reveal that vtg significantly outperforms existing baselines.
Rongwu Xu, Brian S. Lin, Shujian Yang, Tianqi Zhang, Weiyan Shi, Tianwei Zhang, Zhixuan Fang, Wei Xu, Han Qiu
Abstract: large language models (llms) encapsulate vast amounts of knowledge but still remain vulnerable to external misinformation. existing research mainly studied this susceptibility behavior in a single-turn setting. however, belief can change during a multi-turn conversation, especially a persuasive one. therefore, in this study, we delve into llms' susceptibility to persuasive conversations, particularly on factual questions that they can answer correctly. we first curate the farm (i.e., fact to misinform) dataset, which contains factual questions paired with systematically generated persuasive misinformation. then, we develop a testing framework to track llms' belief changes in a persuasive dialogue. through extensive experiments, we find that llms' correct beliefs on factual knowledge can be easily manipulated by various persuasive strategies.
Daniel Maninger, Krishna Narasimhan, Mira Mezini
Abstract: it is expected that in the near future, ai software development assistants will play an important role in the software industry. however, current software development assistants tend to be unreliable, often producing incorrect, unsafe, or low-quality code. we seek to resolve these issues by introducing a holistic architecture for constructing, training, and using trustworthy ai software development assistants. in the center of the architecture, there is a foundational llm trained on datasets representative of real-world coding scenarios and complex software architectures, and fine-tuned on code quality criteria beyond correctness. the llm will make use of graph-based code representations for advanced semantic comprehension. we envision a knowledge graph integrated into the system to provide up-to-date background knowledge and to enable the assistant to provide appropriate explanations. finally, a modular framework for constrained decoding will ensure that certain guarantees (e.g., for correctness and security) hold for the generated code.
Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, "Alex D'Amour", Dj Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant
Abstract: reward models play a key role in aligning language model applications towards human preferences. however, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. a natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. we explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). first, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their \emph{pretraining} seeds lead to better generalization than ensembles that differ only by their \emph{fine-tuning} seeds, with both outperforming individual reward models. however, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.
Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, Balaji Lakshminarayanan
Abstract: safe deployment of large language models (llms) may benefit from a reliable method for assessing their generated content to determine when to abstain or to selectively generate. while likelihood-based metrics such as perplexity are widely employed, recent research has demonstrated the limitations of using sequence-level probability estimates given by llms as reliable indicators of generation quality. conversely, llms have demonstrated strong calibration at the token level, particularly when it comes to choosing correct answers in multiple-choice questions or evaluating true/false statements. in this work, we reformulate open-ended generation tasks into token-level prediction tasks, and leverage llms' superior calibration at the token level. we instruct an llm to self-evaluate its answers, employing either a multi-way comparison or a point-wise evaluation approach, with the option to include a ``none of the above'' option to express the model's uncertainty explicitly. we benchmark a range of scoring methods based on self-evaluation and evaluate their performance in selective generation using truthfulqa and tl;dr. through experiments with palm-2 and gpt-3, we demonstrate that self-evaluation based scores not only improve accuracy, but also correlate better with the overall quality of generated content.
Minyoung Hwang, Luca Weihs, Chanwoo Park, Kimin Lee, Aniruddha Kembhavi, Kiana Ehsani
Abstract: customizing robotic behaviors to be aligned with diverse human preferences is an underexplored challenge in the field of embodied ai. in this paper, we present promptable behaviors, a novel framework that facilitates efficient personalization of robotic agents to diverse human preferences in complex environments. we use multi-objective reinforcement learning to train a single policy adaptable to a broad spectrum of preferences. we introduce three distinct methods to infer human preferences by leveraging different types of interactions: (1) human demonstrations, (2) preference feedback on trajectory comparisons, and (3) language instructions. we evaluate the proposed method in personalized object-goal navigation and flee navigation tasks in procthor and robothor, demonstrating the ability to prompt agent behaviors to satisfy human preferences in various scenarios. project page:
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu
Abstract: widely used alignment techniques, such as reinforcement learning from human feedback (rlhf), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. however, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. we study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? we test this using a range of pretrained language models in the gpt-4 family on natural language processing (nlp), chess, and reward modeling tasks. we find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. however, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like rlhf may scale poorly to superhuman models without further work. we find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning gpt-4 with a gpt-2-level supervisor and an auxiliary confidence loss, we can recover close to gpt-3.5-level performance on nlp tasks. our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.
Ljubisa Bojic, Matteo Cinelli, Dubravko Culibrk, Boris Delibasic
Abstract: this paper explores the potential of a multidisciplinary approach to testing and aligning artificial general intelligence (agi) and llms. due to the rapid development and wide application of llms, challenges such as ethical alignment, controllability, and predictability of these models have become important research topics. this study investigates an innovative simulation-based multi-agent system within a virtual reality framework that replicates the real-world environment. the framework is populated by automated 'digital citizens,' simulating complex social structures and interactions to examine and optimize agi. application of various theories from the fields of sociology, social psychology, computer science, physics, biology, and economics demonstrates the possibility of a more human-aligned and socially responsible agi. the purpose of such a digital environment is to provide a dynamic platform where advanced ai agents can interact and make independent decisions, thereby mimicking realistic scenarios. the actors in this digital city, operated by the llms, serve as the primary agents, exhibiting high degrees of autonomy. while this approach shows immense potential, there are notable challenges and limitations, most significantly the unpredictable nature of real-world social dynamics. this research endeavors to contribute to the development and refinement of agi, emphasizing the integration of social, ethical, and theoretical dimensions for future research.


Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, Xing Xie
Abstract: the evaluation of large language models (llms) is crucial to assess their performance and mitigate potential security risks. in this paper, we introduce promptbench, a unified library to evaluate llms. it consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. promptbench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. the code is available at: and will be continuously supported.
Oliver Guest, Michael Aird, Seán Ó Héigeartaigh
Abstract: ai alignment work is important from both a commercial and a safety lens. with this paper, we aim to help actors who support alignment efforts to make these efforts as effective as possible, and to avoid potential adverse effects. we begin by suggesting that institutions that are trying to act in the public interest (such as governments) should aim to support specifically alignment work that reduces accident or misuse risks. we then describe four problems which might cause alignment efforts to be counterproductive, increasing large-scale ai risks. we suggest mitigations for each problem. finally, we make a broader recommendation that institutions trying to act in the public interest should think systematically about how to make their alignment efforts as effective, and as likely to be beneficial, as possible.
Jiang Zhang, Qiong Wu, Yiming Xu, Cheng Cao, Zheng Du, Konstantinos Psounis
Abstract: toxic content detection is crucial for online services to remove inappropriate content that violates community standards. to automate the detection process, prior works have proposed varieties of machine learning (ml) approaches to train language models (lms) for toxic content detection. however, both their accuracy and transferability across datasets are limited. recently, large language models (llms) have shown promise in toxic content detection due to their superior zero-shot and few-shot in-context learning ability as well as broad transferability on ml tasks. however, efficiently designing prompts for llms remains challenging. moreover, the high run-time cost of llms may hinder their deployments in production. to address these challenges, in this work, we propose bd-llm, a novel and efficient approach to bootstrapping and distilling llms for toxic content detection. specifically, we design a novel prompting method named decision-tree-of-thought (dtot) to bootstrap llms' detection performance and extract high-quality rationales. dtot can automatically select more fine-grained context to re-prompt llms when their responses lack confidence. additionally, we use the rationales extracted via dtot to fine-tune student lms. our experimental results on various datasets demonstrate that dtot can improve the accuracy of llms by up to 4.6%. furthermore, student lms fine-tuned with rationales extracted via dtot outperform baselines on all datasets with up to 16.9\% accuracy improvement, while being more than 60x smaller than conventional llms. finally, we observe that student lms fine-tuned with rationales exhibit better cross-dataset transferability.
Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell
Abstract: in practice, preference learning from human feedback depends on incomplete data with hidden context. hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. this captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. we prove that standard applications of preference learning, including reinforcement learning from human feedback (rlhf), implicitly aggregate over hidden contexts according to a well-known voting rule called borda count. we show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. a key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of rlhf. as a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (dpl). dpl methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. experimental results indicate that applying dpl to rlhf for llm chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. our code and data are available at
Haiyang Tang, Zhenyi Liu, Dongping Chen, Qingzhao Chu
Abstract: recent advancements in large language models (llms) have notably propelled natural language processing (nlp) capabilities, demonstrating significant potential in safety engineering applications. despite these advancements, llms face constraints in processing specialized tasks, attributed to factors such as corpus size, input processing limitations, and privacy concerns. obtaining useful information from reliable sources in a limited time is crucial for llm. addressing this, our study introduces an llm-based q&a system for safety engineering, enhancing the comprehension and response accuracy of the model. we employed prompt engineering to incorporate external knowledge databases, thus enriching the llm with up-to-date and reliable information. the system analyzes historical incident reports through statistical methods, utilizes vector embedding to construct a vector database, and offers an efficient similarity-based search functionality. our findings indicate that the integration of external knowledge significantly augments the capabilities of llm for in-depth problem analysis and autonomous task assignment. it effectively summarizes accident reports and provides pertinent recommendations. this integration approach not only expands llm applications in safety engineering but also sets a precedent for future developments towards automation and intelligent systems.
Xinpeng Wang, Xiaoyuan Yi, Han Jiang, Shanlin Zhou, Zhihua Wei, Xing Xie
Abstract: warning: this paper includes model outputs showing offensive content. recent large-scale visual-language generative models (vlgms) have achieved unprecedented improvement in multimodal image/text generation. however, these models might also generate toxic content, e.g., offensive text and pornography images, raising significant ethical risks. despite exhaustive studies on toxic degeneration of language models, this problem remains largely unexplored within the context of visual-language generation. this work delves into the propensity for toxicity generation and susceptibility to toxic data across various vlgms. for this purpose, we built tovilag, a dataset comprising 32k co-toxic/mono-toxic text-image pairs and 1k innocuous but evocative text that tends to stimulate toxicity. furthermore, we propose wintore, a novel toxicity metric tailored to visual-language generation, which theoretically reflects different aspects of toxicity considering both input and output. on such a basis, we benchmarked the toxicity of a diverse spectrum of vlgms and discovered that some models do more evil than expected while some are more vulnerable to infection, underscoring the necessity of vlgms detoxification. therefore, we develop an innovative bottleneck-based detoxification method. our method could reduce toxicity while maintaining comparable generation quality, providing a promising initial solution to this line of research.
Isabelle Hupont, Marina Wainer, Sam Nester, Sylvie Tissot, Lucía Iglesias-Blanco, Sandra Baldassarri
Abstract: recent publications explore ai biases in detecting objects and people in the environment. however, there is no research tackling how ai examines nature. this case study presents a pioneering exploration into the ai attitudes (ecocentric, anthropocentric and antipathetic) toward nature. experiments with a large language model (llm) and an image captioning algorithm demonstrate the presence of anthropocentric biases in ai. moreover, to delve deeper into these biases and human-nature-ai interaction, we conducted a real-life experiment in which participants underwent an immersive de-anthropocentric experience in a forest and subsequently engaged with chatgpt to co-create narratives. by creating fictional ai chatbot characters with ecocentric attributes, emotions and views, we successfully amplified ecocentric exchanges. we encountered some difficulties, mainly that participants deviated from narrative co-creation to short dialogues and questions and answers, possibly due to the novelty of interacting with llms. to solve this problem, we recommend providing preliminary guidelines on interacting with llms and allowing participants to get familiar with the technology. we plan to repeat this experiment in various countries and forests to expand our corpus of ecocentric materials.


Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu
Abstract: recent research has made significant strides in applying alignment techniques to enhance the helpfulness and harmlessness of large language models (llms) in accordance with human intentions. in this paper, we argue for the importance of alignment for honesty, ensuring that llms proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. however, a pivotal aspect of alignment for honesty involves discerning the limits of an llm's knowledge, which is far from straightforward. this challenge demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. in this paper, we address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the analects of confucius. this serves as a cornerstone for developing metrics that effectively measure an llm's honesty by quantifying its progress post-alignment. furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. we open-source a wealth of resources to facilitate future research at, including honesty-aligned models, training and evaluation datasets for honesty alignment, concept glossary, as well as all relevant source code.
Xiang Li, Haoran Tang, Siyu Chen, Ziwei Wang, Anurag Maravi, Marcin Abram
Abstract: in this paper, we explore the challenges inherent to large language models (llms) like gpt-4, particularly their propensity for hallucinations, logic mistakes, and incorrect conclusions when tasked with answering complex questions. the capacity of llms to present erroneous answers in a coherent and semantically rigorous manner further complicates the detection of factual inaccuracies. this issue is especially pronounced in fields that require specialized expertise. our work delves into these challenges, aiming to enhance the understanding and mitigation of such errors, thereby contributing to the improvement of llm accuracy and reliability in scientific and other specialized domains. our findings reveal a non-linear relationship between the context's relevancy and the answers' measured quality. in addition, we demonstrate that with the correct calibration, it is possible to automate the grading procedure -- a finding suggesting that, at least to some degree, the llms can be used to self-examine the quality of their own performance. finally, we describe an experimental platform that can be seen as a proof-of-concept of the techniques described in this work.
Yang Trista Cao, Anna Sotnikova, Jieyu Zhao, Linda X. Zou, Rachel Rudinger, Hal Daume
Abstract: multilingual large language models have been increasingly popular for their proficiency in comprehending and generating text across various languages. previous research has shown that the presence of stereotypes and biases in monolingual large language models can be attributed to the nature of their training data, which is collected from humans and reflects societal biases. multilingual language models undergo the same training procedure as monolingual ones, albeit with training data sourced from various languages. this raises the question: do stereotypes present in one social context leak across languages within the model? in our work, we first define the term ``stereotype leakage'' and propose a framework for its measurement. with this framework, we investigate how stereotypical associations leak across four languages: english, russian, chinese, and hindi. to quantify the stereotype leakage, we employ an approach from social psychology, measuring stereotypes via group-trait associations. we evaluate human stereotypes and stereotypical associations manifested in multilingual large language models such as mbert, mt5, and chatgpt. our findings show a noticeable leakage of positive, negative, and non-polar associations across all languages. notably, hindi within multilingual models appears to be the most susceptible to influence from other languages, while chinese is the least. additionally, chatgpt exhibits a better alignment with human scores than other models.
Dun Zeng, Yong Dai, Pengyu Cheng, Tianhao Hu, Wanshun Chen, Nan Du, Zenglin Xu
Abstract: the alignment of large language models (llms) with human values is crucial for the development of artificial general intelligence (agi). one promising approach to achieve this alignment is reinforcement learning from human feedback, which employs a reward model (rm) learned from human preference datasets to guide llms in generating text that aligns with human preferences. through intensive experiments and analysis of reward distribution, this paper finds that preference datasets are diverse from each other, even though they are all proposed to align human preference. hence, mixing diverse human preference datasets to increase data size for enhancing reward modeling could fail. to address the issue and capture the shared human values from diverse preferences, a new training policy called more is introduced, which minimizes preference bias by adaptively adjusting the preference objective across diverse preferences. experiments with the pythia-1.4b model and five mixed preference datasets show that more achieves superior reward accuracy and lower calibration error, highlighting its ability to leverage diverse human preference data.
Swanand Ravindra Kadhe, Anisa Halimi, Ambrish Rawat, Nathalie Baracaldo
Abstract: training large language models (llms) is a costly endeavour in terms of time and computational resources. the large amount of training data used during the unsupervised pre-training phase makes it difficult to verify all data and, unfortunately, undesirable data may be ingested during training. re-training from scratch is impractical and has led to the creation of the 'unlearning' discipline where models are modified to "unlearn" undesirable information without retraining. however, any modification can alter the behaviour of llms, especially on key dimensions such as fairness. this is the first work that examines this interplay between unlearning and fairness for llms. in particular, we focus on a popular unlearning framework known as sisa [bourtoule et al., 2021], which creates an ensemble of models trained on disjoint shards. we evaluate the performance-fairness trade-off for sisa, and empirically demsontrate that sisa can indeed reduce fairness in llms. to remedy this, we propose post-processing bias mitigation techniques for ensemble models produced by sisa. we adapt the post-processing fairness improvement technique from [hardt et al., 2016] to design three methods that can handle model ensembles, and prove that one of the methods is an optimal fair predictor for ensemble of models. through experimental results, we demonstrate the efficacy of our post-processing framework called 'fairsisa'.
Manish Nagireddy, Lamogha Chiazor, Moninder Singh, Ioana Baldini
Abstract: current datasets for unwanted social bias auditing are limited to studying protected demographic features such as race and gender. in this work, we introduce a comprehensive benchmark that is meant to capture the amplification of social bias, via stigmas, in generative language models. we start with a comprehensive list of 93 stigmas documented in social science literature and curate a question-answering (qa) dataset which involves simple social situations. our benchmark, socialstigmaqa, contains roughly 10k prompts, with a variety of prompt styles, carefully constructed to systematically test for both social bias and model robustness. we present results for socialstigmaqa with two widely used open source generative language models and we demonstrate that the output generated by these models considerably amplifies existing social bias against stigmatized groups. specifically, we find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles. we discover that the deliberate design of the templates in our benchmark (e.g., by adding biasing text to the prompt or varying the answer that indicates bias) impact the model tendencies to generate socially biased output. additionally, we report on patterns in the generated chain-of-thought output, finding a variety of problems from subtle bias to evidence of a lack of reasoning. warning: this paper contains examples of text which is toxic, biased, and harmful.
Wei Zhao, Zhe Li, Jun Sun
Abstract: large language models (llms) such as gpt and llama2 are increasingly adopted in many safety-critical applications. their security is thus essential. even with considerable efforts spent on reinforcement learning from human feedback (rlhf), recent studies have shown that llms are still subject to attacks such as adversarial perturbation and trojan attacks. further research is thus needed to evaluate their security and/or understand the lack of it. in this work, we propose a framework for conducting light-weight causality-analysis of llms at the token, layer, and neuron level. we applied our framework to open-source llms such as llama2 and vicuna and had multiple interesting discoveries. based on a layer-level causality analysis, we show that rlhf has the effect of overfitting a model to harmful prompts. it implies that such security can be easily overcome by `unusual' harmful prompts. as evidence, we propose an adversarial perturbation method that achieves 100\% attack success rate on the red-teaming tasks of the trojan detection competition 2023. furthermore, we show the existence of one mysterious neuron in both llama2 and vicuna that has an unreasonably high causal effect on the output. while we are uncertain on why such a neuron exists, we show that it is possible to conduct a ``trojan'' attack targeting that particular neuron to completely cripple the llm, i.e., we can generate transferable suffixes to prompts that frequently make the llm produce meaningless responses.


Heegyu Kim, Hyunsouk Cho
Abstract: caution: this paper includes offensive words that could potentially cause unpleasantness. the fast-paced evolution of generative language models such as gpt-4 has demonstrated outstanding results in various nlp generation tasks. however, due to the potential generation of offensive words related to race or gender, various controllable text generation (ctg) methods have been proposed to mitigate the occurrence of harmful words. however, existing ctg methods not only reduce toxicity but also negatively impact several aspects of the language model's generation performance, including topic consistency, grammar, and perplexity. this paper explores the limitations of previous methods and introduces a novel solution in the form of a simple gated toxicity avoidance (gta) that can be applied to any ctg method. we also evaluate the effectiveness of the proposed gta by comparing it with state-of-the-art ctg methods across various datasets. our findings reveal that gated toxicity avoidance efficiently achieves comparable levels of toxicity reduction to the original ctg methods while preserving the generation performance of the language model.
Lifu Tu, Semih Yavuz, Jin Qu, Jiacheng Xu, Rui Meng, Caiming Xiong, Yingbo Zhou
Abstract: large language models (llms) have demonstrated a powerful ability for text generation. however, achieving optimal results with a given prompt or instruction can be challenging, especially for billion-sized models. additionally, undesired behaviors such as toxicity or hallucinations can manifest. while much larger models (e.g., chatgpt) may demonstrate strength in mitigating these issues, there is still no guarantee of complete prevention. in this work, we propose formalizing text generation as a future-constrained generation problem to minimize undesirable behaviors and enforce faithfulness to instructions. the estimation of future constraint satisfaction, accomplished using llms, guides the text generation process. our extensive experiments demonstrate the effectiveness of the proposed approach across three distinct text generation tasks: keyword-constrained generation (lin et al., 2020), toxicity reduction (gehman et al., 2020), and factual correctness in question-answering (gao et al., 2023).
Sanghak Oh, Kiho Lee, Seonhye Park, Doowon Kim, Hyoungshick Kim
Abstract: ai-powered coding assistant tools have revolutionized the software engineering ecosystem. however, prior work has demonstrated that these tools are vulnerable to poisoning attacks. in a poisoning attack, an attacker intentionally injects maliciously crafted insecure code snippets into training datasets to manipulate these tools. the poisoned tools can suggest insecure code to developers, resulting in vulnerabilities in their products that attackers can exploit. however, it is still little understood whether such poisoning attacks against the tools would be practical in real-world settings and how developers address the poisoning attacks during software development. to understand the real-world impact of poisoning attacks on developers who rely on ai-powered coding assistants, we conducted two user studies: an online survey and an in-lab study. the online survey involved 238 participants, including software developers and computer science students. the survey results revealed widespread adoption of these tools among participants, primarily to enhance coding speed, eliminate repetition, and gain boilerplate code. however, the survey also found that developers may misplace trust in these tools because they overlooked the risk of poisoning attacks. the in-lab study was conducted with 30 professional developers. the developers were asked to complete three programming tasks with a representative type of ai-powered coding assistant tool, running on visual studio code. the in-lab study results showed that developers using a poisoned chatgpt-like tool were more prone to including insecure code than those using an intellicode-like tool or no tool. this demonstrates the strong influence of these tools on the security of generated code. our study results highlight the need for education and improved coding practices to address new security issues introduced by ai-powered coding assistant tools.
Jiaxu Zhao, Meng Fang, Shirui Pan, Wenpeng Yin, Mykola Pechenizkiy
Abstract: warning: this paper contains content that may be offensive or upsetting. there has been a significant increase in the usage of large language models (llms) in various applications, both in their original form and through fine-tuned adaptations. as a result, llms have gained popularity and are being widely adopted by a large user community. however, one of the concerns with llms is the potential generation of socially biased content. the existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability. in this work, we propose a bias evaluation framework named gptbias that leverages the high performance of llms (e.g., gpt-4 \cite{openai2023gpt4}) to assess bias in models. we also introduce prompts called bias attack instructions, which are specifically designed for evaluating model bias. to enhance the credibility and interpretability of bias evaluation, our framework not only provides a bias score but also offers detailed information, including bias types, affected demographics, keywords, reasons behind the biases, and suggestions for improvement. we conduct extensive experiments to demonstrate the effectiveness and usability of our bias evaluation framework.
Jiyan He, Weitao Feng, Yaosen Min, Jingwei Yi, Kunsheng Tang, Shuai Li, Jie Zhang, Kejiang Chen, Wenbo Zhou, Xing Xie, Weiming Zhang, Nenghai Yu, Shuxin Zheng
Abstract: the expanding application of artificial intelligence (ai) in scientific fields presents unprecedented opportunities for discovery and innovation. however, this growth is not without risks. ai models in science, if misused, can amplify risks like creation of harmful substances, or circumvention of established regulations. in this study, we aim to raise awareness of the dangers of ai misuse in science, and call for responsible ai development and use in this domain. we first itemize the risks posed by ai in scientific contexts, then demonstrate the risks by highlighting real-world examples of misuse in chemical science. these instances underscore the need for effective risk management strategies. in response, we propose a system called sciguard to control misuse risks for ai models in science. we also propose a red-teaming benchmark scimt-safety to assess the safety of different systems. our proposed sciguard shows the least harmful impact in the assessment without compromising performance in benign tests. finally, we highlight the need for a multidisciplinary and collaborative effort to ensure the safe and ethical use of ai models in science. we hope that our study can spark productive discussions on using ai ethically in science among researchers, practitioners, policymakers, and the public, to maximize benefits and minimize the risks of misuse.
Shabaz Patel, Hassan Kane, Rayhan Patel
Abstract: large language models (llms) have demonstrated remarkable performance across numerous natural language understanding use cases. however, this impressive performance comes with inherent limitations, such as the tendency to perpetuate stereotypical biases or fabricate non-existent facts. in the context of islam and its representation, accurate and factual representation of its beliefs and teachings rooted in the quran and sunnah is key. this work focuses on the challenge of building domain-specific llms faithful to the islamic worldview and proposes ways to build and evaluate such systems. firstly, we define this open-ended goal as a technical problem and propose various solutions. subsequently, we critically examine known challenges inherent to each approach and highlight evaluation methodologies that can be used to assess such systems. this work highlights the need for high-quality datasets, evaluations, and interdisciplinary work blending machine learning with islamic scholarship.
Aida Davani, Mark Díaz, Dylan Baker, Vinodkumar Prabhakaran
Abstract: perception of offensiveness is inherently subjective, shaped by the lived experiences and socio-cultural values of the perceivers. recent years have seen substantial efforts to build ai-based tools that can detect offensive language at scale, as a means to moderate social media platforms, and to ensure safety of conversational ai technologies such as chatgpt and bard. however, existing approaches treat this task as a technical endeavor, built on top of data annotated for offensiveness by a global crowd workforce without any attention to the crowd workers' provenance or the values their perceptions reflect. we argue that cultural and psychological factors play a vital role in the cognitive processing of offensiveness, which is critical to consider in this context. we re-frame the task of determining offensiveness as essentially a matter of moral judgment -- deciding the boundaries of ethically wrong vs. right language within an implied set of socio-cultural norms. through a large-scale cross-cultural study based on 4309 participants from 21 countries across 8 cultural regions, we demonstrate substantial cross-cultural differences in perceptions of offensiveness. more importantly, we find that individual moral values play a crucial role in shaping these variations: moral concerns about care and purity are significant mediating factors driving cross-cultural differences. these insights are of crucial importance as we build ai models for the pluralistic world, where the values they espouse should aim to respect and account for moral values in diverse geo-cultural contexts.
Yu Fu, Yufei Li, Wen Xiao, Cong Liu, Yue Dong
Abstract: recent developments in balancing the usefulness and safety of large language models (llms) have raised a critical question: are mainstream nlp tasks adequately aligned with safety consideration? our study, focusing on safety-sensitive documents obtained through adversarial attacks, reveals significant disparities in the safety alignment of various nlp tasks. for instance, llms can effectively summarize malicious long documents but often refuse to translate them. this discrepancy highlights a previously unidentified vulnerability: attacks exploiting tasks with weaker safety alignment, like summarization, can potentially compromise the integraty of tasks traditionally deemed more robust, such as translation and question-answering (qa). moreover, the concurrent use of multiple nlp tasks with lesser safety alignment increases the risk of llms inadvertently processing harmful content. we demonstrate these vulnerabilities in various safety-aligned llms, particularly llama2 models and gpt-4, indicating an urgent need for strengthening safety alignments across a broad spectrum of nlp tasks.
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger
Abstract: as large language models (llms) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. however, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. in this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. we investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, gpt-4), access to a less powerful trusted model (in our case, gpt-3.5), and limited access to human contractors who provide high-quality trusted labor. we investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. we investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. one protocol is what we call trusted editing. this protocol first asks gpt-4 to write code, and then asks gpt-3.5 to rate the suspiciousness of that code. if the code is below some suspiciousness threshold, it is submitted. otherwise, gpt-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. another protocol is untrusted monitoring. this protocol asks gpt-4 to write code, and then asks another instance of gpt-4 whether the code is backdoored, using various techniques to prevent the gpt-4 instances from colluding. these protocols improve substantially on simple baselines.


Sangwon Hyun, Mingyu Guo, M. Ali Babar
Abstract: large-language models (llms) have shifted the paradigm of natural language data processing. however, their black-boxed and probabilistic characteristics can lead to potential risks in the quality of outputs in diverse llm applications. recent studies have tested quality attributes (qas), such as robustness or fairness, of llms by generating adversarial input texts. however, existing studies have limited their coverage of qas and tasks in llms and are difficult to extend. additionally, these studies have only used one evaluation metric, attack success rate (asr), to assess the effectiveness of their approaches. we propose a metamorphic testing for analyzing llms (metal) framework to address these issues by applying metamorphic testing (mt) techniques. this approach facilitates the systematic testing of llm qualities by defining metamorphic relations (mrs), which serve as modularized evaluation metrics. the metal framework can automatically generate hundreds of mrs from templates that cover various qas and tasks. in addition, we introduced novel metrics that integrate the asr method into the semantic qualities of text to assess the effectiveness of mrs accurately. through the experiments conducted with three prominent llms, we have confirmed that the metal framework effectively evaluates essential qas on primary llm tasks and reveals the quality risks in llms. moreover, the newly proposed metrics can guide the optimal mrs for testing each task and suggest the most effective method for generating mrs.
Seth Neel, Peter Chang
Abstract: this is the first survey of the active area of ai research that focuses on privacy issues in large language models (llms). specifically, we focus on work that red-teams models to highlight privacy risks, attempts to build privacy into the training or inference process, enables efficient data deletion from trained models to comply with existing privacy regulations, and tries to mitigate copyright issues. our focus is on summarizing technical research that develops algorithms, proves theorems, and runs empirical evaluations. while there is an extensive body of legal and policy work addressing these challenges from a different angle, that is not the focus of our survey. nevertheless, these works, along with recent legal developments do inform how these technical problems are formalized, and so we discuss them briefly in section 1. while we have made our best effort to include all the relevant work, due to the fast moving nature of this research we may have missed some recent work. if we have missed some of your work please contact us, as we will attempt to keep this survey relatively up to date. we are maintaining a repository with the list of papers covered in this survey and any relevant code that was publicly available at
Devin Gonier, Adrian Adduci, Cassidy Locascio
Abstract: ai alignment research seeks to align human and ai goals to ensure independent actions by a machine are always ethical. this paper argues empathy is necessary for this task, despite being often neglected in favor of more deductive approaches. we offer an inside-out approach that grounds morality within the context of the brain as a basis for algorithmically understanding ethics and empathy. these arguments are justified via a survey of relevant literature. the paper concludes with a suggested experimental approach to future research and some initial experimental observations.


Zhou Ziheng, Yingnian Wu, Song-Chun Zhu, Demetri Terzopoulos
Abstract: we introduce aligner, a novel parameter-efficient fine-tuning (peft) method for aligning multi-billion-parameter-sized large language models (llms). aligner employs a unique design that constructs a globally shared set of tunable tokens that modify the attention of every layer. remarkably with this method, even when using one token accounting for a mere 5,000 parameters, aligner can still perform comparably well to state-of-the-art llm adaptation methods like lora that require millions of parameters. this capacity is substantiated in both instruction following and value alignment tasks. besides the multiple order-of-magnitude improvement in parameter efficiency, the insight aligner provides into the internal mechanisms of llms is also valuable. the architectural features and efficacy of our method, in addition to our experiments demonstrate that an llm separates its internal handling of "form" and "knowledge" in a somewhat orthogonal manner. this finding promises to motivate new research into llm mechanism understanding and value alignment.
Gustavo Gonçalves, Emma Strubell
Abstract: large language models (llms) trained with self-supervision on vast corpora of web text fit to the social biases of that text. without intervention, these social biases persist in the model's predictions in downstream tasks, leading to representational harm. many strategies have been proposed to mitigate the effects of inappropriate social biases learned during pretraining. simultaneously, methods for model compression have become increasingly popular to reduce the computational burden of llms. despite the popularity and need for both approaches, little work has been done to explore the interplay between these two. we perform a carefully controlled study of the impact of model compression via quantization and knowledge distillation on measures of social bias in llms. longer pretraining and larger models led to higher social bias, and quantization showed a regularizer effect with its best trade-off around 20% of the original pretraining time.
Mithila Sivakumar, Alvine Boaye Belle, Jinjun Shan, Kimya Khakzad Shahandashti
Abstract: in the ever-evolving landscape of software engineering, the emergence of large language models (llms) and conversational interfaces, exemplified by chatgpt, is nothing short of revolutionary. while their potential is undeniable across various domains, this paper sets out on a captivating expedition to investigate their uncharted territory, the exploration of generating safety cases. in this paper, our primary objective is to delve into the existing knowledge base of gpt-4, focusing specifically on its understanding of the goal structuring notation (gsn), a well-established notation allowing to visually represent safety cases. subsequently, we perform four distinct experiments with gpt-4. these experiments are designed to assess its capacity for generating safety cases within a defined system and application domain. to measure the performance of gpt-4 in this context, we compare the results it generates with ground-truth safety cases created for an x-ray system system and a machine-learning (ml)-enabled component for tire noise recognition (tnr) in a vehicle. this allowed us to gain valuable insights into the model's generative capabilities. our findings indicate that gpt-4 demonstrates the capacity to produce safety arguments that are moderately accurate and reasonable. furthermore, it exhibits the capability to generate safety cases that closely align with the semantic content of the reference safety cases used as ground-truths in our experiments.


Boyi Zeng, Chenghu Zhou, Xinbing Wang, Zhouhan Lin
Abstract: protecting the copyright of large language models (llms) has become crucial due to their resource-intensive training and accompanying carefully designed licenses. however, identifying the original base model of an llm is challenging due to potential parameter alterations through fine-tuning or continued pretraining. in this study, we introduce huref, a human-readable fingerprint for llms that uniquely identifies the base model without exposing model parameters or interfering with training. we first observe that the vector direction of llm parameters remains stable after the model has converged during pretraining, showing negligible perturbations through subsequent training steps, including continued pretraining, supervised fine-tuning (sft), and rlhf, which makes it a sufficient condition to identify the base model. the necessity is validated by continuing to train an llm with an extra term to drive away the model parameters' direction and the model becomes damaged. however, this direction is vulnerable to simple attacks like dimension permutation or matrix rotation, which significantly change it without affecting performance. to address this, leveraging the transformer structure, we systematically analyze potential attacks and define three invariant terms that identify an llm's base model. we make these invariant terms human-readable by mapping them to a gaussian vector using a convolutional encoder and then converting it into a natural image with stylegan2. our method generates a dog image as an identity fingerprint for an llm, where the dog's appearance strongly indicates the llm's base model. experimental results across various llms demonstrate the effectiveness of our method, the generated dog image remains invariant to different training steps, including sft, rlhf, or even continued pretraining with augmented vocabulary in a new language.
Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia Gonzalez, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-Jussà, Maha Elbayad, Hongyu Gong, Francisco Guzmán, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson
Abstract: large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. in this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. first, we contribute an improved version of the massively multilingual and multimodal seamlessm4t model-seamlessm4t v2. this newer model, incorporating an updated unity2 framework, was trained on more low-resource language data. seamlessm4t v2 provides the foundation on which our next two models are initiated. seamlessexpressive enables translation that preserves vocal styles and prosody. compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. as for seamlessstreaming, our model leverages the efficient monotonic multihead attention mechanism to generate low-latency target translations without waiting for complete source utterances. as the first of its kind, seamlessstreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. to ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. consequently, we bring major components from seamlessexpressive and seamlessstreaming together to form seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. the contributions to this work are publicly released and accessible at
Mobashir Sadat, Zhengyu Zhou, Lukas Lange, Jun Araki, Arsalan Gundroo, Bingqing Wang, Rakesh R Menon, Md Rizwan Parvez, Zhe Feng
Abstract: hallucination is a well-known phenomenon in text generated by large language models (llms). the existence of hallucinatory responses is found in almost all application scenarios e.g., summarization, question-answering (qa) etc. for applications requiring high reliability (e.g., customer-facing assistants), the potential existence of hallucination in llm-generated text is a critical problem. the amount of hallucination can be reduced by leveraging information retrieval to provide relevant background information to the llm. however, llms can still generate hallucinatory content for various reasons (e.g., prioritizing its parametric knowledge over the context, failure to capture the relevant information from the context, etc.). detecting hallucinations through automated methods is thus paramount. to facilitate research in this direction, we introduce a sophisticated dataset, delucionqa, that captures hallucinations made by retrieval-augmented llms for a domain-specific qa task. furthermore, we propose a set of hallucination detection methods to serve as baselines for future works from the research community. analysis and case study are also provided to share valuable insights on hallucination phenomena in the target scenario.
Hongzhan Lin, Ziyang Luo, Jing Ma, Long Chen
Abstract: the age of social media is rife with memes. understanding and detecting harmful memes pose a significant challenge due to their implicit meaning that is not explicitly conveyed through the surface text and image. however, existing harmful meme detection approaches only recognize superficial harm-indicative signals in an end-to-end classification manner but ignore in-depth cognition of the meme text and image. in this paper, we attempt to detect harmful memes based on advanced reasoning over the interplay of multimodal information in memes. inspired by the success of large language models (llms) on complex reasoning, we first conduct abductive reasoning with llms. then we propose a novel generative framework to learn reasonable thoughts from llms for better multimodal fusion and lightweight fine-tuning, which consists of two training stages: 1) distill multimodal reasoning knowledge from llms; and 2) fine-tune the generative framework to infer harmfulness. extensive experiments conducted on three meme datasets demonstrate that our proposed approach achieves superior performance than state-of-the-art methods on the harmful meme detection task.


Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, Bing Qin
Abstract: extensive work has been devoted to improving the safety mechanism of large language models (llms). however, in specific scenar